scholarly journals Quality Control and Integration of Genotypes from Two Calling Pipelines for Whole Genome Sequence Data in the Alzheimer’s Disease Sequencing Project

2018 ◽  
Author(s):  
Adam C. Naj ◽  
Honghuang Lin ◽  
Badri N. Vardarajan ◽  
Simon White ◽  
Daniel Lancour ◽  
...  

AbstractThe Alzheimer’s Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed “consensus calling,” to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.AbbreviationsAD, Alzheimer’s disease; QC, Quality Control; LSSAC, Large-Scale Sequencing and Analysis Center; Broad, Broad Institute Genomics Service; Baylor, Baylor College of Medicine Human Genome Sequencing Center; WashU, Washington University-St. Louis McDonnell Genome Institute; WGS, whole genome sequencing; WES, whole exome sequencing; indel, insertion-deletion variants; VCF, variant control format; MI, Mendelian inconsistency; MC, Mendelian consistency; GWAS, genome-wide association study; VR, referent allele read depth; DP, overall read depth; MS, mapping score; GQ, genotype quality score; Ti/Tv, Transition/Transversion; CS, concordance code

2019 ◽  
Vol 15 ◽  
pp. P1312-P1312
Author(s):  
Badri N. Vardarajan ◽  
James Jaworski ◽  
Gary W. Beecham ◽  
Sandra Barral ◽  
Dolly Reyes-Dumeyer ◽  
...  

2020 ◽  
Vol 16 (S3) ◽  
Author(s):  
Gina M. Peloso ◽  
Yanbing Wang ◽  
Honghuang Lin ◽  
Chloé Sarnowski ◽  
Achilleas N. Pitsillides ◽  
...  

2015 ◽  
Vol 11 (7S_Part_5) ◽  
pp. P250-P251
Author(s):  
Anastasia Grigorenko ◽  
Fedor Gusev ◽  
Denis Reshetov ◽  
Tatiana Andreeva ◽  
Lev Shagam ◽  
...  

2017 ◽  
Author(s):  
Jonathon Brenner ◽  
Laurynas Kalesinskas ◽  
Catherine Putonti

ABSTRACTBackgroundThe persistent decrease in cost and difficulty of whole genome sequencing of microbial organisms has led to a dramatic increase in the number of species and strains characterized from a wide variety of environments. Microbial genome sequencing can now be conducted by small laboratories and as part of undergraduate curriculum. While sequencing is routine in microbiology, assembly, annotation and downstream analyses still require computational resources and expertise, often necessitating familiarity with programming languages. To address this problem, we have created a light-weight, user-friendly tool for the assembly and annotation of microbial sequencing projects.ResultsThe Prokaryotic Assembly and Annotation Tool, Peasant, automates the processes of read quality control, genome assembly, and annotation for microbial sequencing projects. High-quality assemblies and annotations can be generated by Peasant without the need of programming expertise or high-performance computing resources. Furthermore, statistics are calculated so that users can evaluate their sequencing project. To illustrate the computational speed and accuracy of Peasant, the SRA records of 322 Illumina platform whole genome sequencing assays for Bacillus species were retrieved from NCBI, assembled and annotated on a single desktop computer. From the assemblies and annotations produced, a comprehensive analysis of the diversity of over 200 high-quality samples was conducted, looking at both the 16S rRNA phylogenetic marker as well as the Bacillus core genome.ConclusionsPeasant provides an intuitive solution for high-quality whole genome sequence assembly and annotation for users with limited programing experience and/or computational resources. The analysis of the Bacillus whole genome sequencing projects exemplifies the utility of this tool. Furthermore, the study conducted here provides insight into the diversity of the species, the largest such comparison conducted to date.


2020 ◽  
Author(s):  
Dmitry Prokopenko ◽  
Sarah L. Morgan ◽  
Kristina Mullin ◽  
Oliver Hofmann ◽  
Brad Chapman ◽  
...  

AbstractINTRODUCTIONGenome-wide association studies have led to numerous genetic loci associated with Alzheimer’s disease (AD). Whole-genome sequencing (WGS) now permit genome-wide analyses to identify rare variants contributing to AD risk.METHODSWe performed single-variant and spatial clustering-based testing on rare variants (minor allele frequency ≤1%) in a family-based WGS-based association study of 2,247 subjects from 605 multiplex AD families, followed by replication in 1,669 unrelated individuals.RESULTSWe identified 13 new AD candidate loci that yielded consistent rare-variant signals in discovery and replication cohorts (4 from single-variant, 9 from spatial-clustering), implicating these genes: FNBP1L, SEL1L, LINC00298, PRKCH, C15ORF41, C2CD3, KIF2A, APC, LHX9, NALCN, CTNNA2, SYTL3, CLSTN2.DISCUSSIONDownstream analyses of these novel loci highlight synaptic function, in contrast to common AD-associated variants, which implicate innate immunity. These loci have not been previously associated with AD, emphasizing the ability of WGS to identify AD-associated rare variants, particularly outside of coding regions.


2021 ◽  
Author(s):  
Jacob Househam ◽  
William CH Cross ◽  
Giulio Caravagna

AbstractCancer is a global health issue that places enormous demands on healthcare systems. Basic research, the development of targeted treatments, and the utility of DNA sequencing in clinical settings, have been significantly improved with the introduction of whole genome sequencing. However the broad applications of this technology come with complications. To date there has been very little standardisation in how data quality is assessed, leading to inconsistencies in analyses and disparate conclusions. Manual checking and complex consensus calling strategies often do not scale to large sample numbers, which leads to procedural bottlenecks. To address this issue, we present a quality control method that integrates point mutations, copy numbers, and other metrics into a single quantitative score. We demonstrate its power on 1,065 whole-genomes from a large-scale pan-cancer cohort, and on multi-region data of two colorectal cancer patients. We highlight how our approach significantly improves the generation of cancer mutation data, providing visualisations for cross-referencing with other analyses. Our approach is fully automated, designed to work downstream of any bioinformatic pipeline, and can automatise tool parameterization paving the way for fast computational assessment of data quality in the era of whole genome sequencing.


2019 ◽  
Vol 6 (2) ◽  
Author(s):  
Bradley T Endres ◽  
Khurshida Begum ◽  
Hua Sun ◽  
Seth T Walk ◽  
Ali Memariani ◽  
...  

Abstract Background The epidemic Clostridioides difficile ribotype 027 strain resulted from the dissemination of 2 separate fluoroquinolone-resistant lineages: FQR1 and FQR2. Both lineages were reported to originate in North America; however, confirmatory large-scale investigations of C difficile ribotype 027 epidemiology using whole genome sequencing has not been undertaken in the United States. Methods Whole genome sequencing and single-nucleotide polymorphism (SNP) analysis was performed on 76 clinical ribotype 027 isolates obtained from hospitalized patients in Texas with C difficile infection and compared with 32 previously sequenced worldwide strains. Maximum-likelihood phylogeny based on a set of core genome SNPs was used to construct phylogenetic trees investigating strain macro- and microevolution. Bayesian phylogenetic and phylogeographic analyses were used to incorporate temporal and geographic variables with the SNP strain analysis. Results Whole genome sequence analysis identified 2841 SNPs including 900 nonsynonymous mutations, 1404 synonymous substitutions, and 537 intergenic changes. Phylogenetic analysis separated the strains into 2 prominent groups, which grossly differed by 28 SNPs: the FQR1 and FQR2 lineages. Five isolates were identified as pre-epidemic strains. Phylogeny demonstrated unique clustering and resistance genes in Texas strains indicating that spatiotemporal bias has defined the microevolution of ribotype 027 genetics. Conclusions Clostridioides difficile ribotype 027 lineages emerged earlier than previously reported, coinciding with increased use of fluoroquinolones. Both FQR1 and FQR2 ribotype 027 epidemic lineages are present in Texas, but they have evolved geographically to represent region-specific public health threats.


Sign in / Sign up

Export Citation Format

Share Document