A globally diverse reference alignment and panel for imputation of mitochondrial DNA variants

Abstract Background Variation in mitochondrial DNA (mtDNA) identified by genotyping microarrays or by sequencing only the hypervariable regions of the genome may be insufficient to reliably assign mitochondrial genomes to phylogenetic lineages or haplogroups. This lack of resolution can limit functional and clinical interpretation of a substantial body of existing mtDNA data. To address this limitation, we developed and evaluated a large, curated reference alignment of complete mtDNA sequences as part of a pipeline for imputing missing mtDNA single nucleotide variants (mtSNVs). We call our reference alignment and pipeline MitoImpute. Results We aligned the sequences of 36,960 complete human mitochondrial genomes downloaded from GenBank, filtered and controlled for quality. These sequences were reformatted for use in imputation software, IMPUTE2. We assessed the imputation accuracy of MitoImpute by measuring haplogroup and genotype concordance in data from the 1000 Genomes Project and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The mean improvement of haplogroup assignment in the 1000 Genomes samples was 42.7% (Matthew’s correlation coefficient = 0.64). In the ADNI cohort, we imputed missing single nucleotide variants. Conclusion These results show that our reference alignment and panel can be used to impute missing mtSNVs in existing data obtained from using microarrays, thereby broadening the scope of functional and clinical investigation of mtDNA. This improvement may be particularly useful in studies where participants have been recruited over time and mtDNA data obtained using different methods, enabling better integration of early data collected using less accurate methods with more recent sequence data.

Download Full-text

A globally diverse reference alignment and panel for imputation of mitochondrial DNA variants

10.1101/649293 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tim W McInerney ◽

Brian Fulton-Howard ◽

Christopher Patterson ◽

Devashi Paliwal ◽

Lars S Jermiin ◽

...

Keyword(s):

Mitochondrial Dna ◽

Sequence Data ◽

Clinical Investigation ◽

Imputation Accuracy ◽

Reference Alignment ◽

Mitochondrial Genomes ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Phylogenetic Lineages ◽

Matthew’S Correlation Coefficient

AbstractBackgroundVariation in mitochondrial DNA (mtDNA) identified by genotyping microarrays or by sequencing only hypervariable regions of the genome may be insufficient to reliably assign mitochondrial genomes to phylogenetic lineages or haplogroups. This lack of resolution can limit functional and clinical interpretation of a substantial body of existing mtDNA data. To address this limitation, we developed and evaluated a method for imputing missing mtDNA single nucleotide variants (mtSNVs) that uses a large reference alignment of complete mtDNA sequences. The method and reference alignment are combined into a pipeline, which we call MitoImpute.ResultsWe aligned the sequences of 36,960 complete human mitochondrial genomes downloaded from GenBank, filtered and controlled for quality. These sequences were reformatted for use in imputation software, IMPUTE2. We assessed the imputation accuracy of MitoImpute by measuring haplogroup and genotype concordance in data from the 1,000 Genomes Project and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The mean improvement of haplogroup assignment in the 1,000 Genomes samples was 42.7% (Matthew’s correlation coefficient = 0.64). In the ADNI cohort, we imputed missing single nucleotide variants.ConclusionsThese results show that our reference alignment and panel can be used to impute missing mtSNVs in exiting data obtained from using microarrays, thereby broadening the scope of functional and clinical investigation of mtDNA. This improvement may be particularly useful in studies where participants have been recruited over time and mtDNA data obtained using different methods, enabling better integration of early data collected using less accurate methods with more recent sequence data.

Download Full-text

Korean Genome Project: 1094 Korean personal genomes with clinical information

Science Advances ◽

10.1126/sciadv.aaz7835 ◽

2020 ◽

Vol 6 (22) ◽

pp. eaaz7835 ◽

Cited By ~ 2

Author(s):

Sungwon Jeon ◽

Youngjune Bhak ◽

Yeonsong Choi ◽

Yeonsu Jeon ◽

Seunghoon Kim ◽

...

Keyword(s):

Genome Wide Association Study ◽

Imputation Accuracy ◽

Clinical Information ◽

Genome Project ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genome Wide ◽

A Genome ◽

Whole Genomes ◽

Personal Genomes

We present the initial phase of the Korean Genome Project (Korea1K), including 1094 whole genomes (sequenced at an average depth of 31×), along with data of 79 quantitative clinical traits. We identified 39 million single-nucleotide variants and indels of which half were singleton or doubleton and detected Korean-specific patterns based on several types of genomic variations. A genome-wide association study illustrated the power of whole-genome sequences for analyzing clinical traits, identifying nine more significant candidate alleles than previously reported from the same linkage disequilibrium blocks. Also, Korea1K, as a reference, showed better imputation accuracy for Koreans than the 1KGP panel. As proof of utility, germline variants in cancer samples could be filtered out more effectively when the Korea1K variome was used as a panel of normals compared to non-Korean variome sets. Overall, this study shows that Korea1K can be a useful genotypic and phenotypic resource for clinical and ethnogenetic studies.

Download Full-text

Reference exome data for a Northern Brazilian population

Scientific Data ◽

10.1038/s41597-020-00703-y ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Alexia L. Weeks ◽

Richard W. Francis ◽

Joao I. C. F. Neri ◽

Nathaly M. C. Costa ◽

Nivea M. R. Arrais ◽

...

Keyword(s):

Rare Diseases ◽

Sequence Data ◽

Genetic Diseases ◽

Specific Reference ◽

Single Nucleotide Variants ◽

Data Set ◽

Single Nucleotide ◽

Rare Genetic Diseases ◽

Genetics And Genomics ◽

Brazilian Cohort

Abstract Exome sequencing is widely used in the diagnosis of rare genetic diseases and provides useful variant data for analysis of complex diseases. There is not always adequate population-specific reference data to assist in assigning a diagnostic variant to a specific clinical condition. Here we provide a catalogue of variants called after sequencing the exomes of 45 babies from Rio Grande do Nord in Brazil. Sequence data were processed using an ‘intersect-then-combine’ (ITC) approach, using GATK and SAMtools to call variants. A total of 612,761 variants were identified in at least one individual in this Brazilian Cohort, including 559,448 single nucleotide variants (SNVs) and 53,313 insertion/deletions. Of these, 58,111 overlapped with nonsynonymous (nsSNVs) or splice site (ssSNVs) SNVs in dbNSFP. As an aid to clinical diagnosis of rare diseases, we used the American College of Medicine Genetics and Genomics (ACMG) guidelines to assign pathogenic/likely pathogenic status to 185 (0.32%) of the 58,111 nsSNVs and ssSNVs. Our data set provides a useful reference point for diagnosis of rare diseases in Brazil. (169 words).

Download Full-text

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Download Full-text

SimRVSequences: an R package to simulate genetic sequence data for pedigrees

Bioinformatics ◽

10.1093/bioinformatics/btz881 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2295-2297

Author(s):

Christina Nieuwoudt ◽

Angela Brooks-Wilson ◽

Jinko Graham

Keyword(s):

Sequence Data ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genetic Sequence ◽

Large Numbers

Abstract Summary We present the R package SimRVSequences to simulate sequence data for pedigrees. SimRVSequences allows for simulations of large numbers of single-nucleotide variants (SNVs) and scales well with increasing numbers of pedigrees. Users provide a sample of pedigrees and SNV data from a sample of unrelated individuals. Availability and implementation SimRVSequences is publicly-available on CRAN https://cran.r-project.org/web/packages/SimRVSequences/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Cleaning clinical genomic data: Simple identification and removal of recurrently miscalled variants in single genomes

10.1101/237107 ◽

2017 ◽

Author(s):

Matthew A. Field ◽

Gaetan Burgio ◽

Jalila Al Shekaili ◽

Simon J. Foote ◽

Matthew C. Cook ◽

...

Keyword(s):

False Positive ◽

Sequence Variation ◽

Sequence Data ◽

Search Space ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Human Sequence ◽

Short Read Sequence ◽

Genomic Regions

AbstractIdentification of sequence variation from short-read sequence data is subject to common-yet-intermittent miscalling that occurs in a sequence intrinsic manner. We identify that recurrent false positive single nucleotide variants are strongly present in databases of human sequence variation and demonstrate how each individual sample generates a unique set of recurrent false positive variants. These recurrent miscalls result from known difficulties aligning short-read sequence data between redundant genomic regions. We could replicate, catalogue and remove three quarters of these recurrent miscalls for any given exome with as little as ten rounds of read resampling, realignment and recalling. The removal of such misleading variants reduces the search space for identification of disease causing variants.List of AbbreviationsSNVsingle nucleotide variantRFPrecurrent false positiveENUN-ethyl-N-nitrosourea

Download Full-text

The (non) accuracy of mitochondrial genomes for family level phylogenetics: the case of erebid moths (Lepidoptera; Erebidae)

10.1101/2021.07.14.452330 ◽

2021 ◽

Author(s):

Hamid Reza Ghanavi ◽

Victoria Twort ◽

Tobias Joannes Hartman ◽

Reza Zahiri ◽

Niklas Wahlberg

Keyword(s):

Mitochondrial Dna ◽

Phylogenetic Relationships ◽

High Throughput Sequencing ◽

Sequence Data ◽

Phylogenetic Analyses ◽

Molecular Data ◽

Mitochondrial Genomes ◽

The Family ◽

History Of ◽

Genomic Scale

The use of molecular data to study evolutionary history of different organisms, revolutionized the field of systematics. Now with the appearance of high throughput sequencing (HTS) technologies more and more genetic sequence data is available. One of the important sources of genetic data for phylogenetic analyses has been mitochondrial DNA. The limitations of mitochondrial DNA for the study of phylogenetic relationships have been thoroughly explored in the age of single locus phylogenies. Now with the appearance of genomic scale data, more and more mitochondrial genomes are available. Here we assemble 47 mitochondrial genomes using whole genome Illumina short reads of representatives of the family Erebidae (Lepidoptera), in order to evaluate the accuracy of mitochondrial genome application in resolving deep phylogenetic relationships. We find that mitogenomes are inadequate for resolving subfamily level relationships in Erebidae, but given good taxon sampling, we see its potential in resolving lower level phylogenetic relationships.

Download Full-text

Performance of Common Analysis Methods for Detecting Low-Frequency Single Nucleotide Variants in Targeted Next-Generation Sequence Data

Journal of Molecular Diagnostics ◽

10.1016/j.jmoldx.2013.09.003 ◽

2014 ◽

Vol 16 (1) ◽

pp. 75-88 ◽

Cited By ~ 82

Author(s):

David H. Spencer ◽

Manoj Tyagi ◽

Francesco Vallania ◽

Andrew J. Bredemeyer ◽

John D. Pfeifer ◽

...

Keyword(s):

Sequence Data ◽

Low Frequency ◽

Next Generation ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Analysis Methods

Download Full-text

Whole-genome sequencing of nine esophageal adenocarcinoma cell lines

F1000Research ◽

10.12688/f1000research.7033.1 ◽

2016 ◽

Vol 5 ◽

pp. 1336 ◽

Cited By ~ 8

Author(s):

Gianmarco Contino ◽

Matthew D. Eldridge ◽

Maria Secrier ◽

Lawrence Bower ◽

Rachael Fels Elliott ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Esophageal Adenocarcinoma ◽

Cell Lines ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Single Nucleotide Variants ◽

High Coverage ◽

Single Nucleotide

Esophageal adenocarcinoma (EAC) is highly mutated and molecularly heterogeneous. The number of cell lines available for study is limited and their genome has been only partially characterized. The availability of an accurate annotation of their mutational landscape is crucial for accurate experimental design and correct interpretation of genotype-phenotype findings. We performed high coverage, paired end whole genome sequencing on eight EAC cell lines—ESO26, ESO51, FLO-1, JH-EsoAd1, OACM5.1 C, OACP4 C, OE33, SK-GT-4—all verified against original patient material, and one esophageal high grade dysplasia cell line, CP-D. We have made available the aligned sequence data and report single nucleotide variants (SNVs), small insertions and deletions (indels), and copy number alterations, identified by comparison with the human reference genome and known single nucleotide polymorphisms (SNPs). We compare these putative mutations to mutations found in primary tissue EAC samples, to inform the use of these cell lines as a model of EAC.

Download Full-text

A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree

10.1101/055541 ◽

2016 ◽

Cited By ~ 18

Author(s):

Michael A. Eberle ◽

Epameinondas Fritzilas ◽

Peter Krusche ◽

Morten Källberg ◽

Benjamin L. Moore ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Objective Assessment ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Dataset ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genome Wide ◽

Transmission Information

AbstractImprovement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalogue of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of seventeen individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased “platinum” variant catalogue of 4.7 million single nucleotide variants (SNVs) plus 0.7 million small (1-50bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and eleven children of this pedigree. Platinum genotypes are highly concordant with the current catalogue of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%), and add a validated truth catalogue that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission (“non-platinum”) revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.

Download Full-text