scholarly journals A globally diverse reference alignment and panel for imputation of mitochondrial DNA variants

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tim W. McInerney ◽  
Brian Fulton-Howard ◽  
Christopher Patterson ◽  
Devashi Paliwal ◽  
Lars S. Jermiin ◽  
...  

Abstract Background Variation in mitochondrial DNA (mtDNA) identified by genotyping microarrays or by sequencing only the hypervariable regions of the genome may be insufficient to reliably assign mitochondrial genomes to phylogenetic lineages or haplogroups. This lack of resolution can limit functional and clinical interpretation of a substantial body of existing mtDNA data. To address this limitation, we developed and evaluated a large, curated reference alignment of complete mtDNA sequences as part of a pipeline for imputing missing mtDNA single nucleotide variants (mtSNVs). We call our reference alignment and pipeline MitoImpute. Results We aligned the sequences of 36,960 complete human mitochondrial genomes downloaded from GenBank, filtered and controlled for quality. These sequences were reformatted for use in imputation software, IMPUTE2. We assessed the imputation accuracy of MitoImpute by measuring haplogroup and genotype concordance in data from the 1000 Genomes Project and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The mean improvement of haplogroup assignment in the 1000 Genomes samples was 42.7% (Matthew’s correlation coefficient = 0.64). In the ADNI cohort, we imputed missing single nucleotide variants. Conclusion These results show that our reference alignment and panel can be used to impute missing mtSNVs in existing data obtained from using microarrays, thereby broadening the scope of functional and clinical investigation of mtDNA. This improvement may be particularly useful in studies where participants have been recruited over time and mtDNA data obtained using different methods, enabling better integration of early data collected using less accurate methods with more recent sequence data.

2019 ◽  
Author(s):  
Tim W McInerney ◽  
Brian Fulton-Howard ◽  
Christopher Patterson ◽  
Devashi Paliwal ◽  
Lars S Jermiin ◽  
...  

AbstractBackgroundVariation in mitochondrial DNA (mtDNA) identified by genotyping microarrays or by sequencing only hypervariable regions of the genome may be insufficient to reliably assign mitochondrial genomes to phylogenetic lineages or haplogroups. This lack of resolution can limit functional and clinical interpretation of a substantial body of existing mtDNA data. To address this limitation, we developed and evaluated a method for imputing missing mtDNA single nucleotide variants (mtSNVs) that uses a large reference alignment of complete mtDNA sequences. The method and reference alignment are combined into a pipeline, which we call MitoImpute.ResultsWe aligned the sequences of 36,960 complete human mitochondrial genomes downloaded from GenBank, filtered and controlled for quality. These sequences were reformatted for use in imputation software, IMPUTE2. We assessed the imputation accuracy of MitoImpute by measuring haplogroup and genotype concordance in data from the 1,000 Genomes Project and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The mean improvement of haplogroup assignment in the 1,000 Genomes samples was 42.7% (Matthew’s correlation coefficient = 0.64). In the ADNI cohort, we imputed missing single nucleotide variants.ConclusionsThese results show that our reference alignment and panel can be used to impute missing mtSNVs in exiting data obtained from using microarrays, thereby broadening the scope of functional and clinical investigation of mtDNA. This improvement may be particularly useful in studies where participants have been recruited over time and mtDNA data obtained using different methods, enabling better integration of early data collected using less accurate methods with more recent sequence data.


2020 ◽  
Vol 6 (22) ◽  
pp. eaaz7835 ◽  
Author(s):  
Sungwon Jeon ◽  
Youngjune Bhak ◽  
Yeonsong Choi ◽  
Yeonsu Jeon ◽  
Seunghoon Kim ◽  
...  

We present the initial phase of the Korean Genome Project (Korea1K), including 1094 whole genomes (sequenced at an average depth of 31×), along with data of 79 quantitative clinical traits. We identified 39 million single-nucleotide variants and indels of which half were singleton or doubleton and detected Korean-specific patterns based on several types of genomic variations. A genome-wide association study illustrated the power of whole-genome sequences for analyzing clinical traits, identifying nine more significant candidate alleles than previously reported from the same linkage disequilibrium blocks. Also, Korea1K, as a reference, showed better imputation accuracy for Koreans than the 1KGP panel. As proof of utility, germline variants in cancer samples could be filtered out more effectively when the Korea1K variome was used as a panel of normals compared to non-Korean variome sets. Overall, this study shows that Korea1K can be a useful genotypic and phenotypic resource for clinical and ethnogenetic studies.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Alexia L. Weeks ◽  
Richard W. Francis ◽  
Joao I. C. F. Neri ◽  
Nathaly M. C. Costa ◽  
Nivea M. R. Arrais ◽  
...  

Abstract Exome sequencing is widely used in the diagnosis of rare genetic diseases and provides useful variant data for analysis of complex diseases. There is not always adequate population-specific reference data to assist in assigning a diagnostic variant to a specific clinical condition. Here we provide a catalogue of variants called after sequencing the exomes of 45 babies from Rio Grande do Nord in Brazil. Sequence data were processed using an ‘intersect-then-combine’ (ITC) approach, using GATK and SAMtools to call variants. A total of 612,761 variants were identified in at least one individual in this Brazilian Cohort, including 559,448 single nucleotide variants (SNVs) and 53,313 insertion/deletions. Of these, 58,111 overlapped with nonsynonymous (nsSNVs) or splice site (ssSNVs) SNVs in dbNSFP. As an aid to clinical diagnosis of rare diseases, we used the American College of Medicine Genetics and Genomics (ACMG) guidelines to assign pathogenic/likely pathogenic status to 185 (0.32%) of the 58,111 nsSNVs and ssSNVs. Our data set provides a useful reference point for diagnosis of rare diseases in Brazil. (169 words).


2021 ◽  
Author(s):  
Kishwar Shafin ◽  
Trevor Pesout ◽  
Pi-Chuan Chang ◽  
Maria Nattestad ◽  
Alexey Kolesnikov ◽  
...  

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).


2019 ◽  
Vol 36 (7) ◽  
pp. 2295-2297
Author(s):  
Christina Nieuwoudt ◽  
Angela Brooks-Wilson ◽  
Jinko Graham

Abstract Summary We present the R package SimRVSequences to simulate sequence data for pedigrees. SimRVSequences allows for simulations of large numbers of single-nucleotide variants (SNVs) and scales well with increasing numbers of pedigrees. Users provide a sample of pedigrees and SNV data from a sample of unrelated individuals. Availability and implementation SimRVSequences is publicly-available on CRAN https://cran.r-project.org/web/packages/SimRVSequences/. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Matthew A. Field ◽  
Gaetan Burgio ◽  
Jalila Al Shekaili ◽  
Simon J. Foote ◽  
Matthew C. Cook ◽  
...  

AbstractIdentification of sequence variation from short-read sequence data is subject to common-yet-intermittent miscalling that occurs in a sequence intrinsic manner. We identify that recurrent false positive single nucleotide variants are strongly present in databases of human sequence variation and demonstrate how each individual sample generates a unique set of recurrent false positive variants. These recurrent miscalls result from known difficulties aligning short-read sequence data between redundant genomic regions. We could replicate, catalogue and remove three quarters of these recurrent miscalls for any given exome with as little as ten rounds of read resampling, realignment and recalling. The removal of such misleading variants reduces the search space for identification of disease causing variants.List of AbbreviationsSNVsingle nucleotide variantRFPrecurrent false positiveENUN-ethyl-N-nitrosourea


2021 ◽  
Author(s):  
Hamid Reza Ghanavi ◽  
Victoria Twort ◽  
Tobias Joannes Hartman ◽  
Reza Zahiri ◽  
Niklas Wahlberg

The use of molecular data to study evolutionary history of different organisms, revolutionized the field of systematics. Now with the appearance of high throughput sequencing (HTS) technologies more and more genetic sequence data is available. One of the important sources of genetic data for phylogenetic analyses has been mitochondrial DNA. The limitations of mitochondrial DNA for the study of phylogenetic relationships have been thoroughly explored in the age of single locus phylogenies. Now with the appearance of genomic scale data, more and more mitochondrial genomes are available. Here we assemble 47 mitochondrial genomes using whole genome Illumina short reads of representatives of the family Erebidae (Lepidoptera), in order to evaluate the accuracy of mitochondrial genome application in resolving deep phylogenetic relationships. We find that mitogenomes are inadequate for resolving subfamily level relationships in Erebidae, but given good taxon sampling, we see its potential in resolving lower level phylogenetic relationships.


2014 ◽  
Vol 16 (1) ◽  
pp. 75-88 ◽  
Author(s):  
David H. Spencer ◽  
Manoj Tyagi ◽  
Francesco Vallania ◽  
Andrew J. Bredemeyer ◽  
John D. Pfeifer ◽  
...  

F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 1336 ◽  
Author(s):  
Gianmarco Contino ◽  
Matthew D. Eldridge ◽  
Maria Secrier ◽  
Lawrence Bower ◽  
Rachael Fels Elliott ◽  
...  

Esophageal adenocarcinoma (EAC) is highly mutated and molecularly heterogeneous. The number of cell lines available for study is limited and their genome has been only partially characterized. The availability of an accurate annotation of their mutational landscape is crucial for accurate experimental design and correct interpretation of genotype-phenotype findings. We performed high coverage, paired end whole genome sequencing on eight EAC cell lines—ESO26, ESO51, FLO-1, JH-EsoAd1, OACM5.1 C, OACP4 C, OE33, SK-GT-4—all verified against original patient material, and one esophageal high grade dysplasia cell line, CP-D. We have made available the aligned sequence data and report single nucleotide variants (SNVs), small insertions and deletions (indels), and copy number alterations, identified by comparison with the human reference genome and known single nucleotide polymorphisms (SNPs). We compare these putative mutations to mutations found in primary tissue EAC samples, to inform the use of these cell lines as a model of EAC.


2016 ◽  
Author(s):  
Michael A. Eberle ◽  
Epameinondas Fritzilas ◽  
Peter Krusche ◽  
Morten Källberg ◽  
Benjamin L. Moore ◽  
...  

AbstractImprovement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalogue of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of seventeen individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased “platinum” variant catalogue of 4.7 million single nucleotide variants (SNVs) plus 0.7 million small (1-50bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and eleven children of this pedigree. Platinum genotypes are highly concordant with the current catalogue of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%), and add a validated truth catalogue that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission (“non-platinum”) revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.


Sign in / Sign up

Export Citation Format

Share Document