Illuminating the dark side of the human transcriptome with long read transcript sequencing

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v2 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Data Processing ◽

Error Correction ◽

Human Genome ◽

Parameter Tuning ◽

Dark Side ◽

Sequencing Data ◽

Protein Coding ◽

Human Transcriptome ◽

Model Predictions ◽

Long Read

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide stronger evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used pipelines, we found that the convention of using mapping identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we identified 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

BMC Genomics ◽

10.1186/s12864-020-07123-7 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Richard I. Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W. S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Data Processing ◽

Error Correction ◽

Human Genome ◽

Parameter Tuning ◽

Dark Side ◽

Sequencing Data ◽

Protein Coding ◽

Human Transcriptome ◽

Model Predictions ◽

Long Read

Abstract Background The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6 K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2566 putative novel non-coding genes and 1557 putative novel protein coding gene models. Conclusions Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Download Full-text

Illuminating the dark side of the human transcriptome with TAMA Iso-Seq analysis

10.1101/780015 ◽

2019 ◽

Cited By ~ 10

Author(s):

Richard I. Kuo ◽

Yuanyuan Cheng ◽

Jacqueline Smith ◽

Alan L. Archibald ◽

David W. Burt

Keyword(s):

Data Processing ◽

High Throughput ◽

Dark Side ◽

Pacific Bioscience ◽

Protein Coding ◽

Human Transcriptome ◽

The Pacific ◽

Eukaryotic Species ◽

Long Read ◽

Universal Human Reference

AbstractThe human transcriptome is one of the most well-annotated of the eukaryotic species. However, limitations in technology biased discovery toward protein coding spliced genes. Accurate high throughput long read RNA sequencing now has the potential to investigate genes that were previously undetectable. Using our Transcriptome Annotation by Modular Algorithms (TAMA) tool kit to analyze the Pacific Bioscience Universal Human Reference RNA Sequel II Iso-Seq dataset, we discovered thousands of potential novel genes and identified challenges in both RNA preparation and long read data processing that have major implications for transcriptome annotation.

Download Full-text

Illuminating the dark side of the human transcriptome with long read transcript sequencing

10.21203/rs.3.rs-23156/v1 ◽

2020 ◽

Author(s):

Richard Kuo ◽

Yuanyuan Cheng ◽

Runxuan Zhang ◽

John W.S. Brown ◽

Jacqueline Smith ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Dark Side ◽

Sequencing Data ◽

Human Transcriptome ◽

Sequencing Technologies ◽

Long Read ◽

Read Error Correction ◽

Gene Models ◽

Genome Annotations

Download Full-text

Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly

Genome Biology ◽

10.1186/s13059-020-02244-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Guillaume Holley ◽

Doruk Beyter ◽

Helga Ingimundardottir ◽

Peter L. Møller ◽

Snædis Kristmundsdottir ◽

...

Keyword(s):

Error Correction ◽

Human Genome ◽

Error Rate ◽

Variant Calling ◽

High Error Rate ◽

Sequencing Data ◽

Short Read ◽

Long Reads ◽

Median Error ◽

Long Read

AbstractA major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

Minimizer-space de Bruijn graphs

10.1101/2021.06.09.447586 ◽

2021 ◽

Author(s):

Barış Ekim ◽

Bonnie Berger ◽

Rayan Chikhi

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Graphical Representation ◽

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Human Genome Assembly ◽

Long Read ◽

Metagenome Assembly

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Download Full-text

Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight

10.1101/514497 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mark T. W. Ebbert ◽

Tanner D. Jensen ◽

Karen Jansen-West ◽

Jonathon P. Sens ◽

Joseph S. Reddy ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Sequencing Data ◽

Systematic Analysis ◽

Protein Coding ◽

Short Read ◽

Sequencing Project ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Read

AbstractBackgroundThe human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.ResultsBased on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.ConclusionsWhile we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Download Full-text

Germline mosaicism of a missense variant in KCNC2 in a multiplex family with autism and epilepsy

10.1101/2021.12.06.21264306 ◽

2021 ◽

Author(s):

Elvisa Mehinovic ◽

Teddi Gray ◽

Meghan Campbell ◽

Jenny Ekholm ◽

Aaron Wenger ◽

...

Keyword(s):

De Novo ◽

Copy Number Variants ◽

Missense Variant ◽

Missense Mutations ◽

Sequencing Data ◽

Multiplex Family ◽

Protein Coding ◽

Germline Mosaicism ◽

Current Decay ◽

Long Read

ABSTRACTCurrently, protein-coding de novo variants and large copy number variants have been identified as important for ∼30% of individuals with autism. One approach to identify relevant variation in individuals who lack these types of events is by utilizing newer genomic technologies. In this study, highly accurate PacBio HiFi long-read sequencing was applied to a family with autism, treatment-refractory epilepsy, cognitive impairment, and mild dysmorphic features (two affected female full siblings, parents, and one unaffected sibling) with no known clinical variant. From our long-read sequencing data, a de novo missense variant in the KCNC2 gene (encodes Kv3.2 protein) was identified in both affected children. This variant was phased to the paternal chromosome of origin and is likely a germline mosaic. In silico assessment of the variant revealed it was in the top 0.05% of all conserved bases in the genome, and was predicted damaging by Polyphen2, MutationTaster, and SIFT. It was not present in any controls from public genome databases nor in a joint-call set we generated across 49 individuals with publicly available PacBio HiFi data. This specific missense mutation (Val473Ala) has been shown in both an ortholog and paralog of Kv3.2 to accelerate current decay, shift the voltage dependence of activation, and prevent the channel from entering a long-lasting open state. Seven additional missense mutations have been identified in other individuals with neurodevelopmental disorders (p = 1.03 × 10−5). KCNC2 is most highly expressed in the brain; in particular, in the thalamus and is enriched in GABAergic neurons. Long-read sequencing was useful in discovering the relevant variant in this family with autism that had remained a mystery for several years and will potentially have great benefits in the clinic once it is widely available.

Download Full-text

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz058 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1164-1181 ◽

Cited By ~ 9

Author(s):

Leandro Lima ◽

Camille Marchet ◽

Ségolène Caboche ◽

Corinne Da Silva ◽

Benjamin Istace ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Gene Families ◽

Error Rates ◽

Open Reading Frames ◽

Sequencing Data ◽

Isoform Diversity ◽

Long Reads ◽

Long Read ◽

Read Error Correction

Abstract Motivation Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. Results In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. Benchmarking software https://gitlab.com/leoisl/LR_EC_analyser

Download Full-text