Paragraph: A graph-based structural variant genotyper for short-read sequence data

Mapping Intimacies ◽

10.1101/635011 ◽

2019 ◽

Cited By ~ 5

Author(s):

Sai Chen ◽

Peter Krusche ◽

Egor Dolzhenko ◽

Rachel M. Sherman ◽

Roman Petrovski ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Three Samples ◽

Genomics Research ◽

Long Read ◽

Short Read Sequence ◽

Population Scale

AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.

Paragraph: a graph-based structural variant genotyper for short-read sequence data

Genome Biology ◽

10.1186/s13059-019-1909-7 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 19

Author(s):

Sai Chen ◽

Peter Krusche ◽

Egor Dolzhenko ◽

Rachel M. Sherman ◽

Roman Petrovski ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Three Samples ◽

Genomics Research ◽

Long Read ◽

Short Read Sequence ◽

Population Scale

AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.

Comprehensive analysis of GBA using a novel algorithm for Illumina whole-genome sequence data or targeted Nanopore sequencing

10.1101/2021.11.12.21266253 ◽

2021 ◽

Author(s):

Marco Toffoli ◽

Xiao Chen ◽

Fritz J Sedlazeck ◽

Chiao-Yin Lee ◽

Stephen Mullin ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Short Read ◽

Increased Risk ◽

Long Read

GBA variants cause the autosomal recessive Gaucher disease, and carriers are at increased risk of Parkinson disease (PD) and Lewy body dementia (LBD). The presence of a highly homologous nearby pseudogene (GBAP1) predisposes to a range of structural variants arising from either gene conversion or reciprocal recombination, the latter resulting in copy number gains or losses, complicating genetic testing and analysis. To date, short-read sequencing has not been able to fully resolve these or other variants in the key homology region, and targeted long-read sequencing has not previously resolved reciprocal recombinants. We present and validate two independent methods to resolve recombinant alleles and other variants in GBA: Gauchian, a novel bioinformatics tool for short-read, whole-genome sequencing data analysis, and Oxford Nanopore long-read sequencing after enrichment with appropriate PCR. The methods were concordant for 42 samples including 30 with a range of recombinants and GBAP1-related mutations, and Gauchian outperforms the GATK Best Practices pipeline. Applying Gauchian to Illumina sequencing of over 10,000 individuals from publicly available cohorts shows that copy number variants (CNVs) spanning GBAP1 are relatively common in Africans. CNV frequencies in PD and LBD are similar to controls, but gains may coexist with other mutations in patients, and a modifying effect cannot be excluded. Gauchian detects a higher frequency of GBA variants in LBD than PD, especially severe ones. These findings highlight the importance of accurate GBA mutation detection in these patients, which is possible by either Gauchian analysis of short-read whole genome sequencing, or targeted long-read sequencing.

Erratum to: Detection and validation of structural variations in bovine whole-genome sequence data

Genetics Selection Evolution ◽

10.1186/s12711-017-0305-6 ◽

2017 ◽

Vol 49 (1) ◽

Author(s):

Long Chen ◽

Amanda J. Chamberlain ◽

Coralie M. Reich ◽

Hans D. Daetwyler ◽

Ben J. Hayes

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Structural Variations ◽

Genome Sequence Data

Detection of long repeat expansions from PCR-free whole-genome sequence data

10.1101/093831 ◽

2016 ◽

Cited By ~ 3

Author(s):

Egor Dolzhenko ◽

Joke J.F.A. van Vugt ◽

Richard J. Shaw ◽

Mitchell A. Bekritsky ◽

Marka van Blitterswijk ◽

...

Keyword(s):

Fragile X Syndrome ◽

Sequence Data ◽

Fragile X ◽

Software Tool ◽

Whole Genome Sequence ◽

Read Length ◽

Whole Genome ◽

Wild Type ◽

Short Read ◽

Repeat Expansions

AbstractIdentifying large repeat expansions such as those that cause amyotrophic lateral sclerosis (ALS) and Fragile X syndrome is challenging for short-read (100-150 bp) whole genome sequencing (WGS) data. A solution to this problem is an important step towards integrating WGS into precision medicine. We have developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3,001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Taking the RP-PCR calls as the ground truth, our WGS-based method identified pathogenic repeat expansions with 98.1% sensitivity and 99.7% specificity. Further inspection identified that all 11 conflicts were resolved as errors in the original RP-PCR results. Compared against this updated result, ExpansionHunter correctly classified all (212/212) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2,786/2,789) of the wild type samples were correctly classified as wild type by this method with the remaining two identified as possible expansions. We further applied our algorithm to a set of 144 samples where every sample had one of eight different pathogenic repeat expansions including examples associated with fragile X syndrome, Friedreich’s ataxia and Huntington’s disease and correctly flagged all of the known repeat expansions. Finally, we tested the accuracy of our method for short repeats by comparing our genotypes with results from 860 samples sized using fragment length analysis and determined that our calls were >95% accurate. ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.

Using Short Read Sequencing to Characterise Balanced Reciprocal Translocations in Pigs

10.21203/rs.3.rs-28830/v1 ◽

2020 ◽

Author(s):

Aniek Cornelia Bouwman ◽

Martijn F.L. Derks ◽

Marleen L.W.J. Broekhuijse ◽

Barbara Harlizius ◽

Roel F. Veerkamp

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Reciprocal Translocations ◽

Short Read ◽

Short Read Sequencing ◽

Long Read ◽

Short Read Sequence ◽

Staining Techniques ◽

Chromosome Staining ◽

Paired End Sequencing

Abstract Background A balanced constitutional reciprocal translocation (RT) is a mutual exchange of terminal segments of two non-homologous chromosomes without any loss or gain of DNA in germline cells. Carriers of balanced RTs are viable individuals with no apparent phenotypical consequences. These animals produce, however, unbalanced gametes and show therefore reduced fertility and offspring with congenital abnormalities. This cytogenetic abnormality is usually detected using chromosome staining techniques. The aim of this study was to test the possibilities of using paired end short read sequencing for detection of balanced RTs in boars and investigate their breakpoints and junctions.Results Balanced RTs were recovered in a blinded analysis, using structural variant calling software DELLY, in 6 of the 7 carriers with 30 fold short read paired end sequencing. In 15 non-carriers we did not detect any RTs. Reducing the coverage to 20 fold, 15 fold and 10 fold showed that at least 20 fold coverage is required to obtain good results. One RT was not detected using the blind screening, however, a highly likely RT was discovered after unblinding. This RT was located in a repetitive region, showing the limitations of short read sequence data. The detailed analysis of the breakpoints and junctions suggested three junctions showing microhomology, three junctions with blunt-end ligation, and three micro-insertions at the breakpoint junctions. The RTs detected also showed to disrupt genes.Conclusions We conclude that paired end short read sequence data can be used to detect and characterize balanced reciprocal translocations, if sequencing depth is at least 20 fold coverage. However, translocations in repetitive areas may require large fragments or even long read sequence data.

Evaluating the performance of tools used to call minority variants from whole genome short-read data

Wellcome Open Research ◽

10.12688/wellcomeopenres.13538.2 ◽

2018 ◽

Vol 3 ◽

pp. 21 ◽

Cited By ~ 3

Author(s):

Khadija Said Mohammed ◽

Nelson Kibinge ◽

Pjotr Prins ◽

Charles N. Agoti ◽

Matthew Cotten ◽

...

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

False Positive Rate ◽

Low Frequency ◽

Whole Genome Sequence ◽

Whole Genome ◽

Short Read ◽

Genome Sequence Data ◽

Minority Variants ◽

Minority Variant

Background: High-throughput whole genome sequencing facilitates investigation of minority virus sub-populations from virus positive samples. Minority variants are useful in understanding within and between host diversity, population dynamics and can potentially assist in elucidating person-person transmission pathways. Several minority variant callers have been developed to describe low frequency sub-populations from whole genome sequence data. These callers differ based on bioinformatics and statistical methods used to discriminate sequencing errors from low-frequency variants. Methods: We evaluated the diagnostic performance and concordance between published minority variant callers used in identifying minority variants from whole-genome sequence data from virus samples. We used the ART-Illumina read simulation tool to generate three artificial short-read datasets of varying coverage and error profiles from an RSV reference genome. The datasets were spiked with nucleotide variants at predetermined positions and frequencies. Variants were called using FreeBayes, LoFreq, Vardict, and VarScan2. The variant callers’ agreement in identifying known variants was quantified using two measures; concordance accuracy and the inter-caller concordance. Results: The variant callers reported differences in identifying minority variants from the datasets. Concordance accuracy and inter-caller concordance were positively correlated with sample coverage. FreeBayes identified the majority of variants although it was characterised by variable sensitivity and precision in addition to a high false positive rate relative to the other minority variant callers and which varied with sample coverage. LoFreq was the most conservative caller. Conclusions: We conducted a performance and concordance evaluation of four minority variant calling tools used to identify and quantify low frequency variants. Inconsistency in the quality of sequenced samples impacts on sensitivity and accuracy of minority variant callers. Our study suggests that combining at least three tools when identifying minority variants is useful in filtering errors when calling low frequency variants.

Genomic analysis of carbapenemase-encoding plasmids from Klebsiella pneumoniae across Europe highlights three major patterns of dissemination

10.1101/2019.12.19.873935 ◽

2019 ◽

Cited By ~ 3

Author(s):

Sophia David ◽

Victoria Cohen ◽

Sandra Reuter ◽

Anna E. Sheppard ◽

Tommaso Giani ◽

...

Keyword(s):

Klebsiella Pneumoniae ◽

Sequence Data ◽

Genomic Analysis ◽

Carbapenem Resistance ◽

Short Read ◽

Primary Mechanism ◽

Carbapenemase Gene ◽

Long Read ◽

Short Read Sequence ◽

Stable Association

AbstractThe incidence of Klebsiella pneumoniae infections that are resistant to carbapenems, a last-line class of antibiotics, has been rapidly increasing. The primary mechanism of carbapenem resistance is production of carbapenemase enzymes, which are most frequently encoded on plasmids by blaOXA-48-like, blaVIM, blaNDM and blaKPC genes. Using short-read sequence data, we previously analysed genomes of 1717 isolates from the K. pneumoniae species complex submitted during the European survey of carbapenemase-producing Enterobacteriaceae (EuSCAPE). Here, we investigated the diversity, prevalence and transmission dynamics of carbapenemase-encoding plasmids using long-read sequencing of representative isolates (n=79) from this collection in combination with short-read data from all isolates. We highlight three major patterns by which carbapenemase genes have disseminated via plasmids. First, blaOXA-48-like genes have spread across diverse lineages primarily via a highly conserved, epidemic pOXA-48-like plasmid. Second, blaVIM and blaNDM genes have spread via transient associations of diverse plasmids with numerous lineages. Third, blaKPC genes have transmitted predominantly by stable association with one clonal lineage (ST258/512) despite frequent mobilisation between pre-existing yet diverse plasmids within the lineage. Despite contrasts in these three modes of carbapenemase gene spread, which can be summarised as using one plasmid/multiple lineages, multiple plasmids/multiple lineages, and multiple plasmids/one lineage, all are underpinned by significant propagation along high-risk clonal lineages.

A Comparison between Hi-C and 10X Genomics Linked Read Sequencing for Whole Genome Phasing in Hanwoo Cattle

Genes ◽

10.3390/genes11030332 ◽

2020 ◽

Vol 11 (3) ◽

pp. 332 ◽

Cited By ~ 1

Author(s):

Krishnamoorthy Srikanth ◽

Jong-Eun Park ◽

Dajeong Lim ◽

Jihye Cha ◽

Sang-Rae Cho ◽

...

Keyword(s):

Sample Preparation ◽

Sequence Data ◽

Whole Genome ◽

Library Preparation ◽

Preparation Methods ◽

Short Read ◽

Long Read ◽

Hanwoo Cattle ◽

Genome Scale ◽

Sample Preparation Methods

Until recently, genome-scale phasing was limited due to the short read sizes of sequence data. Though the use of long-read sequencing can overcome this limitation, they require extensive error correction. The emergence of technologies such as 10X genomics linked read sequencing and Hi-C which uses short-read sequencers along with library preparation protocols that facilitates long-read assemblies have greatly reduced the complexities of genome scale phasing. Moreover, it is possible to accurately assemble phased genome of individual samples using these methods. Therefore, in this study, we compared three phasing strategies which included two sample preparation methods along with the Long Ranger pipeline of 10X genomics and HapCut2 software, namely 10X-LG, 10X-HapCut2, and HiC-HapCut2 and assessed their performance and accuracy. We found that the 10X-LG had the best phasing performance amongst the method analyzed. They had the highest phasing rate (89.6%), longest adjusted N50 (1.24 Mb), and lowest switch error rate (0.07%). Moreover, the phasing accuracy and yield of the 10X-LG stayed over 90% for distances up to 4 Mb and 550 Kb respectively, which were considerably higher than 10X-HapCut2 and Hi-C Hapcut2. The results of this study will serve as a good reference for future benchmarking studies and also for reference-based imputation in Hanwoo.

Within-species contamination of bacterial whole-genome sequence data has a greater influence on clustering analyses than between-species contamination

Genome Biology ◽

10.1186/s13059-019-1914-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Arthur W. Pightling ◽

James B. Pettengill ◽

Yu Wang ◽

Hugh Rand ◽

Errol Strain

Keyword(s):

Escherichia Coli ◽

Single Nucleotide Polymorphism ◽

Sequence Data ◽

Whole Genome Sequence ◽

Single Nucleotide Polymorphism Discovery ◽

Whole Genome ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Short Read ◽

Polymorphism Discovery

AbstractAlthough it is assumed that contamination in bacterial whole-genome sequencing causes errors, the influences of contamination on clustering analyses, such as single-nucleotide polymorphism discovery, phylogenetics, and multi-locus sequencing typing, have not been quantified. By developing and analyzing 720 Listeria monocytogenes, Salmonella enterica, and Escherichia coli short-read datasets, we demonstrate that within-species contamination causes errors that confound clustering analyses, while between-species contamination generally does not. Contaminant reads mapping to references or becoming incorporated into chimeric sequences during assembly are the sources of those errors. Contamination sufficient to influence clustering analyses is present in public sequence databases.

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary