Will long-read sequencing technologies replace short-read sequencing technologies in the next 10 years?

AbstractBackground: Many neurodegenerative diseases are caused by nucleotide repeat expansions, but most expansions, like the C9orf72 ‘GGGGCC’ (G4C2) repeat that causes approximately 5-7% of all amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) cases, are too long to sequence using short-read sequencing technologies. It is unclear whether long-read sequencing technologies can traverse these long, challenging repeat expansions. Here, we demonstrate that two long-read sequencing technologies, Pacific Biosciences’ (PacBio) and Oxford Nanopore Technologies’ (ONT), can sequence through disease-causing repeats cloned into plasmids, including the FTD/ALS-causing G4C2 repeat expansion. We also report the first long-read sequencing data characterizing the C9orf72 G4C2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using PacBio whole-genome sequencing and a no-amplification (No-Amp) targeted approach based on CRISPR/Cas9.Results: Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinlON was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8x coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained >800x coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual’s repeat region was >99% G4C2 content, though we cannot rule out small interruptions.Conclusions: Our findings indicate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. The PacBio No-Amp targeted approach may have future potential in clinical and genetic counseling environments. Larger and deeper long-read sequencing studies in C9orf72 expansion carriers will be important to determine heterogeneity and whether the repeats are interrupted by non-G4C2 content, potentially mitigating or modifying disease course or age of onset, as interruptions are known to do in other repeat-expansion disorders. These results have broad implications across all diseases where the genetic etiology remains unclear.

Download Full-text

Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight

10.1101/514497 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mark T. W. Ebbert ◽

Tanner D. Jensen ◽

Karen Jansen-West ◽

Jonathon P. Sens ◽

Joseph S. Reddy ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Sequencing Data ◽

Systematic Analysis ◽

Protein Coding ◽

Short Read ◽

Sequencing Project ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read

AbstractBackgroundThe human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.ResultsBased on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.ConclusionsWhile we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v2 ◽

2021 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v1 ◽

2020 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

10.21203/rs.3.rs-32139/v3 ◽

2021 ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.

Download Full-text

Computational Methods for Chromosome-Scale Haplotype Reconstruction

10.20944/preprints202101.0116.v1 ◽

2021 ◽

Author(s):

Shilpa Garg

Keyword(s):

Genetic Variation ◽

Whole Genome ◽

Haplotype Reconstruction ◽

High Quality ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Evolutionary Studies ◽

Long Read ◽

Haplotype Information

High-quality chromosome-scale haplotype sequences— of diploid genomes, polyploid genomes and metagenomes — provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information that spans whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent methodological progress in these areas and discuss perspectives that could enable routine high-quality haplotype reconstruction in clinical and evolutionary studies.

Download Full-text

Computational methods for chromosome-scale haplotype reconstruction

Genome Biology ◽

10.1186/s13059-021-02328-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shilpa Garg

Keyword(s):

Genetic Variation ◽

Computational Methods ◽

Whole Genome ◽

Haplotype Reconstruction ◽

High Quality ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Haplotype Information

AbstractHigh-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.

Download Full-text

Impact of short-read sequencing on the misassembly of a plant genome

BMC Genomics ◽

10.1186/s12864-021-07397-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Peipei Wang ◽

Fanrui Meng ◽

Bethany M. Moore ◽

Shin-Han Shiu

Keyword(s):

Great Majority ◽

Plant Genome ◽

High Coverage ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read ◽

Downstream Analysis ◽

Genomic Regions ◽

Simple Sequence

Abstract Background Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.

Download Full-text

Illumina But With Nanopore: Sequencing Illumina libraries at high accuracy on the ONT MinION using R2C2

10.1101/2021.10.30.466545 ◽

2021 ◽

Author(s):

Alexander Zee ◽

Dori Zhi Qian Deng ◽

Matthew Adams ◽

Kayla D. Schimke ◽

Russell Corbett-Detig ◽

...

Keyword(s):

Tandem Repeats ◽

Illumina Miseq ◽

High Accuracy ◽

Error Rates ◽

Short Read ◽

Rolling Circle ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Real Time Analysis ◽

Long Read

High-throughput short-read sequencing has taken on a central role in research and diagnostics. Literally hundreds of different assays exist today to take advantage of Illumina short-read sequencers, the predominant short-read sequencing technology available today. Although other short read sequencing technologies exist, the ubiquity of Illumina sequencers in sequencing core facilities, and the inertia associated with the research enterprise as a whole have limited their adoption. Among a new generation of sequencing technologies, Oxford Nanopore Technologies (ONT) holds a unique position because the ONT MinION, an error-prone long-read sequencer, is associated with little to no capital cost. Here we show that we can make short-read Illumina libraries compatible with the long-read ONT MinION by circularizing and rolling circle amplifying the short library molecules using the R2C2 method. This results in longer DNA molecules containing tandem repeats of the original short library molecules. This longer DNA is ideally suited for the ONT MinION, and after sequencing, the tandem repeats in the resulting raw reads can be converted into millions of high-accuracy consensus reads with similar error rates to that of the Illumina MiSeq. We highlight this capability by producing and benchmarking RNA-seq, ChIP-seq, as well as regular and target-enriched Tn5 libraries. We also explore the use of this approach for rapid evaluation of sequencing library metrics by implementing a real-time analysis workflow.

Download Full-text

Biosynthetic potential of uncultured Antarctic soil bacteria revealed through long-read metagenomic sequencing

The ISME Journal ◽

10.1038/s41396-021-01052-3 ◽

2021 ◽

Author(s):

Valentin Waschulin ◽

Chiara Borsetto ◽

Robert James ◽

Kevin K. Newsham ◽

Stefano Donadio ◽

...

Keyword(s):

Genome Mining ◽

Gene Clusters ◽

Biosynthetic Gene Cluster ◽

Full Length ◽

Metagenomic Sequencing ◽

Short Read ◽

Short Read Sequencing ◽

Rich Diversity ◽

Long Read ◽

The Rich

AbstractThe growing problem of antibiotic resistance has led to the exploration of uncultured bacteria as potential sources of new antimicrobials. PCR amplicon analyses and short-read sequencing studies of samples from different environments have reported evidence of high biosynthetic gene cluster (BGC) diversity in metagenomes, indicating their potential for producing novel and useful compounds. However, recovering full-length BGC sequences from uncultivated bacteria remains a challenge due to the technological restraints of short-read sequencing, thus making assessment of BGC diversity difficult. Here, long-read sequencing and genome mining were used to recover >1400 mostly full-length BGCs that demonstrate the rich diversity of BGCs from uncultivated lineages present in soil from Mars Oasis, Antarctica. A large number of highly divergent BGCs were not only found in the phyla Acidobacteriota, Verrucomicrobiota and Gemmatimonadota but also in the actinobacterial classes Acidimicrobiia and Thermoleophilia and the gammaproteobacterial order UBA7966. The latter furthermore contained a potential novel family of RiPPs. Our findings underline the biosynthetic potential of underexplored phyla as well as unexplored lineages within seemingly well-studied producer phyla. They also showcase long-read metagenomic sequencing as a promising way to access the untapped genetic reservoir of specialised metabolite gene clusters of the uncultured majority of microbes.

Download Full-text