LinkedSV for detection of mosaic structural variants from linked-read exome and genome sequencing data

AbstractLinked-read sequencing provides long-range information on short-read sequencing data by barcoding reads originating from the same DNA molecule, and can improve the detection and breakpoint identification for structural variants (SVs). We present LinkedSV for SV detection on linked-read sequencing data. LinkedSV considers barcode overlapping and enriched fragment endpoints as signals to detect large SVs, while it leverages read depth, paired-end signals and local assembly to detect small SVs. Benchmarking studies demonstrates that LinkedSV outperforms existing tools, especially on exome data and on somatic SVs with low variant allele frequencies. We demonstrate clinical cases where LinkedSV identifies disease causal SVs from linked-read exome sequencing data missed by conventional exome sequencing, and show examples where LinkedSV identifies SVs missed by high-coverage long-read sequencing. In summary, LinkedSV can detect SVs missed by conventional short-read and long-read sequencing approaches, and may resolve negative cases from clinical genome/exome sequencing studies.

Download Full-text

LinkedSV for detection of mosaic structural variants from linked-read exome and genome sequencing data

Nature Communications ◽

10.1038/s41467-019-13397-7 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Li Fang ◽

Charlly Kao ◽

Michael V. Gonzalez ◽

Fernanda A. Mafra ◽

Renata Pellegrino da Silva ◽

...

Keyword(s):

Exome Sequencing ◽

Read Depth ◽

Structural Variants ◽

Sequencing Data ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Studies ◽

Long Read ◽

Local Assembly

AbstractLinked-read sequencing provides long-range information on short-read sequencing data by barcoding reads originating from the same DNA molecule, and can improve detection and breakpoint identification for structural variants (SVs). Here we present LinkedSV for SV detection on linked-read sequencing data. LinkedSV considers barcode overlapping and enriched fragment endpoints as signals to detect large SVs, while it leverages read depth, paired-end signals and local assembly to detect small SVs. Benchmarking studies demonstrate that LinkedSV outperforms existing tools, especially on exome data and on somatic SVs with low variant allele frequencies. We demonstrate clinical cases where LinkedSV identifies disease-causal SVs from linked-read exome sequencing data missed by conventional exome sequencing, and show examples where LinkedSV identifies SVs missed by high-coverage long-read sequencing. In summary, LinkedSV can detect SVs missed by conventional short-read and long-read sequencing approaches, and may resolve negative cases from clinical genome/exome sequencing studies.

Download Full-text

NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data

10.1101/092544 ◽

2016 ◽

Author(s):

Li Fang ◽

Jiang Hu ◽

Depeng Wang ◽

Kai Wang

Keyword(s):

Whole Genome ◽

Ashkenazi Jewish ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Human Genomes ◽

Long Read ◽

Personal Genomes ◽

Low Coverage

AbstractBackgroundStructural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers.ResultsIn this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5% to 94.1% for deletions and 87.9% to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset.ConclusionsOur results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.

Download Full-text

Genotyping structural variants in pangenome graphs using the vg toolkit

10.1101/654566 ◽

2019 ◽

Cited By ~ 7

Author(s):

Glenn Hickey ◽

David Heller ◽

Jean Monlong ◽

Jonas A. Sibbesen ◽

Jouni Sirén ◽

...

Keyword(s):

De Novo ◽

State Of The Art ◽

Effective Means ◽

Point Mutations ◽

Structural Variants ◽

Short Read ◽

Yeast Strains ◽

Sequencing Studies ◽

Long Read

AbstractStructural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmarked vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

Complete Genome Sequence of Rubrobacter xylanophilus Strain AA3-22, Isolated from Arima Onsen in Japan

Microbiology Resource Announcements ◽

10.1128/mra.00818-19 ◽

2019 ◽

Vol 8 (34) ◽

Cited By ~ 1

Author(s):

Natsuki Tomariguchi ◽

Kentaro Miyazaki

Keyword(s):

Genome Sequence ◽

Complete Genome Sequence ◽

Complete Genome ◽

Hot Spring ◽

Sequencing Data ◽

Short Read ◽

Content Type ◽

Short Read Sequencing ◽

Oxford Nanopore ◽

Long Read

Rubrobacter xylanophilus strain AA3-22, belonging to the phylum Actinobacteria, was isolated from nonvolcanic Arima Onsen (hot spring) in Japan. Here, we report the complete genome sequence of this organism, which was obtained by combining Oxford Nanopore long-read and Illumina short-read sequencing data.

Download Full-text

Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight

10.1101/514497 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mark T. W. Ebbert ◽

Tanner D. Jensen ◽

Karen Jansen-West ◽

Jonathon P. Sens ◽

Joseph S. Reddy ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Sequencing Data ◽

Systematic Analysis ◽

Protein Coding ◽

Short Read ◽

Sequencing Project ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read

AbstractBackgroundThe human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.ResultsBased on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.ConclusionsWhile we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v2 ◽

2021 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v1 ◽

2020 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text

Local adaptation and archaic introgression shape global diversity at human structural variant loci

10.1101/2021.01.26.428314 ◽

2021 ◽

Author(s):

Stephanie M. Yan ◽

Rachel M. Sherman ◽

Dylan J. Taylor ◽

Divya R. Nair ◽

Andrew N. Bortvin ◽

...

Keyword(s):

Local Adaptation ◽

Human Populations ◽

Sequencing Data ◽

High Coverage ◽

Short Read ◽

Asian Populations ◽

Long Read ◽

Archaic Introgression ◽

Immune Related Genes

AbstractLarge genomic insertions, deletions, and inversions are a potent source of functional and fitness-altering variation, but are challenging to resolve with short-read DNA sequencing alone. While recent long-read sequencing technologies have greatly expanded the catalog of structural variants (SVs), their costs have so far precluded their application at population scales. Given these limitations, the role of SVs in human adaptation remains poorly characterized. Here, we used a graph-based approach to genotype 107,866 long-read-discovered SVs in short-read sequencing data from diverse human populations. We then applied an admixture-aware method to scan these SVs for patterns of population-specific frequency differentiation—a signature of local adaptation. We identified 220 SVs exhibiting extreme frequency differentiation, including several SVs that were among the lead variants at their corresponding loci. The top two signatures traced to separate insertion and deletion polymorphisms at the immunoglobulin heavy chain locus, together tagging a 325 Kbp haplotype that swept to high frequency and was subsequently fragmented by recombination. Alleles defining this haplotype are nearly fixed (60-95%) in certain Southeast Asian populations, but are rare or absent from other global populations composing the 1000 Genomes Project. Further investigation revealed that the haplotype closely matches with sequences observed in two of three high-coverage Neanderthal genomes, providing strong evidence of a Neanderthal-introgressed origin. This extraordinary episode of positive selection, which we infer to have occurred between 1700 and 8400 years ago, corroborates the role of immune-related genes as prominent targets of adaptive archaic introgression. Our study demonstrates how combining recent advances in genome sequencing, genotyping algorithms, and population genetic methods can reveal signatures of key evolutionary events that remained hidden within poorly resolved regions of the genome.

Download Full-text

Complete Genome Sequence of Vibrio rotiferianus Strain AM7

Microbiology Resource Announcements ◽

10.1128/mra.01591-19 ◽

2020 ◽

Vol 9 (21) ◽

Author(s):

Kentaro Miyazaki ◽

Apirak Wiseschart ◽

Kusol Pootanakit ◽

Kei Kitahara

Keyword(s):

Genome Sequence ◽

Complete Genome Sequence ◽

Complete Genome ◽

The Novel ◽

Sequencing Data ◽

Short Read ◽

Content Type ◽

Short Read Sequencing ◽

Oxford Nanopore ◽

Long Read

ABSTRACT We isolated the novel strain Vibrio rotiferianus AM7 from the shell of an abalone. In this article, we report the complete genome sequence of this organism, which was obtained by combining Oxford Nanopore long-read and Illumina short-read sequencing data.

Download Full-text