scholarly journals Detection of long repeat expansions from PCR-free whole-genome sequence data

2016 ◽  
Author(s):  
Egor Dolzhenko ◽  
Joke J.F.A. van Vugt ◽  
Richard J. Shaw ◽  
Mitchell A. Bekritsky ◽  
Marka van Blitterswijk ◽  
...  

AbstractIdentifying large repeat expansions such as those that cause amyotrophic lateral sclerosis (ALS) and Fragile X syndrome is challenging for short-read (100-150 bp) whole genome sequencing (WGS) data. A solution to this problem is an important step towards integrating WGS into precision medicine. We have developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3,001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Taking the RP-PCR calls as the ground truth, our WGS-based method identified pathogenic repeat expansions with 98.1% sensitivity and 99.7% specificity. Further inspection identified that all 11 conflicts were resolved as errors in the original RP-PCR results. Compared against this updated result, ExpansionHunter correctly classified all (212/212) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2,786/2,789) of the wild type samples were correctly classified as wild type by this method with the remaining two identified as possible expansions. We further applied our algorithm to a set of 144 samples where every sample had one of eight different pathogenic repeat expansions including examples associated with fragile X syndrome, Friedreich’s ataxia and Huntington’s disease and correctly flagged all of the known repeat expansions. Finally, we tested the accuracy of our method for short repeats by comparing our genotypes with results from 860 samples sized using fragment length analysis and determined that our calls were >95% accurate. ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.


2017 ◽  
Vol 27 (11) ◽  
pp. 1895-1903 ◽  
Author(s):  
Egor Dolzhenko ◽  
Joke J.F.A. van Vugt ◽  
Richard J. Shaw ◽  
Mitchell A. Bekritsky ◽  
Marka van Blitterswijk ◽  
...  


2018 ◽  
Vol 3 ◽  
pp. 21 ◽  
Author(s):  
Khadija Said Mohammed ◽  
Nelson Kibinge ◽  
Pjotr Prins ◽  
Charles N. Agoti ◽  
Matthew Cotten ◽  
...  

Background: High-throughput whole genome sequencing facilitates investigation of minority virus sub-populations from virus positive samples. Minority variants are useful in understanding within and between host diversity, population dynamics and can potentially assist in elucidating person-person transmission pathways. Several minority variant callers have been developed to describe low frequency sub-populations from whole genome sequence data. These callers differ based on bioinformatics and statistical methods used to discriminate sequencing errors from low-frequency variants. Methods: We evaluated the diagnostic performance and concordance between published minority variant callers used in identifying minority variants from whole-genome sequence data from virus samples. We used the ART-Illumina read simulation tool to generate three artificial short-read datasets of varying coverage and error profiles from an RSV reference genome. The datasets were spiked with nucleotide variants at predetermined positions and frequencies. Variants were called using FreeBayes, LoFreq, Vardict, and VarScan2. The variant callers’ agreement in identifying known variants was quantified using two measures; concordance accuracy and the inter-caller concordance. Results: The variant callers reported differences in identifying minority variants from the datasets. Concordance accuracy and inter-caller concordance were positively correlated with sample coverage. FreeBayes identified the majority of variants although it was characterised by variable sensitivity and precision in addition to a high false positive rate relative to the other minority variant callers and which varied with sample coverage. LoFreq was the most conservative caller. Conclusions: We conducted a performance and concordance evaluation of four minority variant calling tools used to identify and quantify low frequency variants. Inconsistency in the quality of sequenced samples impacts on sensitivity and accuracy of minority variant callers. Our study suggests that combining at least three tools when identifying minority variants is useful in filtering errors when calling low frequency variants.



2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Arthur W. Pightling ◽  
James B. Pettengill ◽  
Yu Wang ◽  
Hugh Rand ◽  
Errol Strain

AbstractAlthough it is assumed that contamination in bacterial whole-genome sequencing causes errors, the influences of contamination on clustering analyses, such as single-nucleotide polymorphism discovery, phylogenetics, and multi-locus sequencing typing, have not been quantified. By developing and analyzing 720 Listeria monocytogenes, Salmonella enterica, and Escherichia coli short-read datasets, we demonstrate that within-species contamination causes errors that confound clustering analyses, while between-species contamination generally does not. Contaminant reads mapping to references or becoming incorporated into chimeric sequences during assembly are the sources of those errors. Contamination sufficient to influence clustering analyses is present in public sequence databases.



2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Sai Chen ◽  
Peter Krusche ◽  
Egor Dolzhenko ◽  
Rachel M. Sherman ◽  
Roman Petrovski ◽  
...  

AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.



2019 ◽  
Author(s):  
Sai Chen ◽  
Peter Krusche ◽  
Egor Dolzhenko ◽  
Rachel M. Sherman ◽  
Roman Petrovski ◽  
...  

AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.



2018 ◽  
Vol 3 ◽  
pp. 21
Author(s):  
Khadija Said Mohammed ◽  
Nelson Kibinge ◽  
Pjotr Prins ◽  
Charles N. Agoti ◽  
Matthew Cotten ◽  
...  

Background: High-throughput whole genome sequencing facilitates investigation of minority sub-populations from virus positive samples. Minority variants are useful in understanding within and between host diversity, population dynamics and can potentially help to elucidate person-person transmission chains. Several minority variant callers have been developed to describe the minority variants sub-populations from whole genome sequence data. However, they differ on bioinformatics and statistical approaches used to discriminate sequencing errors from low-frequency variants. Methods: We evaluated the diagnostic performance and concordance between published minority variant callers used in identifying minority variants from whole-genome sequence data. The ART-Illumina read simulation tool was used to generate three artificial short-read datasets of varying coverage and error profiles from an RSV reference genome. The datasets were spiked with nucleotide variants at predetermined positions and frequencies. Variants were called using FreeBayes, LoFreq, Vardict, and VarScan2. The variant callers’ agreement in identifying known variants was quantified using two measures; concordance accuracy and the inter-caller concordance. Results: The variant callers reported differences in identifying minority variants from the datasets. Concordance accuracy and inter-caller concordance were positively correlated with sample coverage. FreeBayes identified majority of the variants although it was characterised by variable sensitivity and precision in addition to a high false positive rate relative to the other minority variant callers and which varied with sample coverage. LoFreq was the most conservative caller. Conclusions: We conducted a performance and concordance evaluation of four minority variant calling tools used to identify and quantify low frequency variants. Inconsistency in the quality of sequenced samples impact on sensitivity and accuracy of minority variant callers. Our study suggests that combining at least three tools when identifying minority variants is useful in filtering errors when calling low frequency variants.



2021 ◽  
Author(s):  
Marco Toffoli ◽  
Xiao Chen ◽  
Fritz J Sedlazeck ◽  
Chiao-Yin Lee ◽  
Stephen Mullin ◽  
...  

GBA variants cause the autosomal recessive Gaucher disease, and carriers are at increased risk of Parkinson disease (PD) and Lewy body dementia (LBD). The presence of a highly homologous nearby pseudogene (GBAP1) predisposes to a range of structural variants arising from either gene conversion or reciprocal recombination, the latter resulting in copy number gains or losses, complicating genetic testing and analysis. To date, short-read sequencing has not been able to fully resolve these or other variants in the key homology region, and targeted long-read sequencing has not previously resolved reciprocal recombinants. We present and validate two independent methods to resolve recombinant alleles and other variants in GBA: Gauchian, a novel bioinformatics tool for short-read, whole-genome sequencing data analysis, and Oxford Nanopore long-read sequencing after enrichment with appropriate PCR. The methods were concordant for 42 samples including 30 with a range of recombinants and GBAP1-related mutations, and Gauchian outperforms the GATK Best Practices pipeline. Applying Gauchian to Illumina sequencing of over 10,000 individuals from publicly available cohorts shows that copy number variants (CNVs) spanning GBAP1 are relatively common in Africans. CNV frequencies in PD and LBD are similar to controls, but gains may coexist with other mutations in patients, and a modifying effect cannot be excluded. Gauchian detects a higher frequency of GBA variants in LBD than PD, especially severe ones. These findings highlight the importance of accurate GBA mutation detection in these patients, which is possible by either Gauchian analysis of short-read whole genome sequencing, or targeted long-read sequencing.



Author(s):  
Amnon Koren ◽  
Dashiell J Massey ◽  
Alexa N Bracci

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online



Sign in / Sign up

Export Citation Format

Share Document