Transcriptome assembly from long-read RNA-seq alignments with StringTie2

Genome Biology ◽

10.1186/s13059-019-1910-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 54

Author(s):

Sam Kovaka ◽

Aleksey V. Zimin ◽

Geo M. Pertea ◽

Roham Razaghi ◽

Steven L. Salzberg ◽

...

Keyword(s):

Single Molecule ◽

Transcriptome Assembly ◽

Rna Seq ◽

Ability To Work ◽

Single Molecule Sequencing ◽

Short Read ◽

New Methods ◽

Long Reads ◽

Long Read

AbstractRNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

QAlign: Aligning nanopore reads accurately using current-level modeling

10.1101/862813 ◽

2019 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

High Error Rate ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

10.1101/008003 ◽

2014 ◽

Cited By ~ 13

Author(s):

Konstantin Berlin ◽

Sergey Koren ◽

Chen-Shan Chin ◽

James Drake ◽

Jane M Landolin ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Locality Sensitive Hashing ◽

Model Organisms ◽

Smrt Sequencing ◽

High Coverage ◽

Celera Assembler ◽

Single Molecule Sequencing ◽

Long Reads ◽

Long Read

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Comparative Transcriptome Profiling of Disruptive Technology, Single- Molecule Direct RNA Sequencing

Current Bioinformatics ◽

10.2174/1574893614666191017154427 ◽

2020 ◽

Vol 15 (2) ◽

pp. 165-172

Author(s):

Chaithra Pradeep ◽

Dharam Nandan ◽

Arya A. Das ◽

Dinesh Velayutham

Keyword(s):

Rna Sequencing ◽

Single Molecule ◽

Transcriptome Assembly ◽

Transcriptome Profiling ◽

Read Length ◽

Complex Nature ◽

Disruptive Technology ◽

Sequencing Technology ◽

Sequencing Technologies ◽

Long Read

Background: The standard approach for transcriptomic profiling involves high throughput short-read sequencing technology, mainly dominated by Illumina. However, the short reads have limitations in transcriptome assembly and in obtaining full-length transcripts due to the complex nature of transcriptomes with variable length and multiple alternative spliced isoforms. Recent advances in long read sequencing by the Oxford Nanopore Technologies (ONT) offered both cDNA as well as direct RNA sequencing and has brought a paradigm change in the sequencing technology to greatly improve the assembly and expression estimates. ONT enables molecules to be sequenced without fragmentation resulting in ultra-long read length enabling the entire genes and transcripts to be fully characterized. The direct RNA sequencing method, in addition, circumvents the reverse transcription and amplification steps. Objective: In this study, RNA sequencing methods were assessed by comparing data from Illumina (ILM), ONT cDNA (OCD) and ONT direct RNA (ODR). Methods: The sensitivity & specificity of the isoform detection was determined from the data generated by Illumina, ONT cDNA and ONT direct RNA sequencing technologies using Saccharomyces cerevisiae as model. Comparative studies were conducted with two pipelines to detect the isoforms, novel genes and variable gene length. Results: Mapping metrics and qualitative profiles for different pipelines are presented to understand these disruptive technologies. The variability in sequencing technology and the analysis pipeline were studied.

lra: the Long Read Aligner for Sequences and Contigs

10.1101/2020.11.15.383273 ◽

2020 ◽

Author(s):

Jingwen Ren ◽

Mark JP Chaisson

Keyword(s):

Single Molecule ◽

De Novo ◽

Data Types ◽

Single Molecule Sequencing ◽

Detection Algorithms ◽

Link Type ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Linear Cost

AbstractMotivationIt is computationally challenging to detect variation by aligning long reads from single-molecule sequencing (SMS) instruments, or megabase-scale contigs from SMS assemblies. One approach to efficiently align long sequences is sparse dynamic programming (SDP), where exact matches are found between the sequence and the genome, and optimal chains of matches are found representing a rough alignment. Sequence variation is more accurately modeled when alignments are scored with a gap penalty that is a convex function of the gap length. Because previous implementations of SDP used a linear-cost gap function that does not accurately model variation, and implementations of alignment that have a convex gap penalty are either inefficient or use heuristics, we developed a method, lra, that uses SDP with a convex-cost gap penalty. We use lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs.ResultsAcross all data types, the runtime of lra is between 52-168% of the state of the art aligner minimap2 when generating SAM alignment, and 9-15% of an alternative method, ngmlr. This alignment approach may be used to provide additional evidence of SV calls in PacBio datasets, and an increase in sensitivity and specificity on ONT data with current SV detection algorithms. The number of calls discovered using pbsv with lra alignments are within 98.3-98.6% of calls made from minimap2 alignments on the same data, and give a nominal 0.2-0.4% increase in F1 score by Truvari analysis. On ONT data with SV called using Sniffles, the number of calls made from lra alignments is 3% greater than minimap2-based calls, and 30% greater than ngmlr based calls, with a 4.6-5.5% increase in Truvari F1 score. When applied to calling variation from de novo assembly contigs, there is a 5.8% increase in SV calls compared to minimap2+paftools, with a 4.3% increase in Truvari F1 score.Availability and implementationAvailable in bioconda: https://anaconda.org/bioconda/lra and github: https://github.com/ChaissonLab/[email protected], [email protected]

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Single-molecule long-read sequencing reveals a conserved selection mechanism determining intact long RNA and miRNA profiles in sperm

10.1101/2020.05.28.122382 ◽

2020 ◽

Author(s):

Yu H. Sun ◽

Anqi Wang ◽

Chi Song ◽

Rajesh K. Srivastava ◽

Kin Fai Au ◽

...

Keyword(s):

Single Molecule ◽

Ribosomal Proteins ◽

Selection Process ◽

Future Research ◽

Selection Mechanism ◽

Bioinformatics Pipeline ◽

Long Reads ◽

Evolutionarily Conserved ◽

Long Read ◽

Early Trauma

AbstractSperm contributes diverse RNAs to the zygote. While sperm small RNAs have been shown to be shaped by paternal environments and impact offspring phenotypes, we know little about long RNAs in sperm, including mRNAs and long non-coding RNAs. Here, by integrating PacBio single-molecule long reads with Illumina short reads, we found 2,778 sperm intact long transcript (SpILT) species in mouse. The SpILTs profile is evolutionarily conserved between rodents and primates. mRNAs encoding ribosomal proteins are enriched in SpILTs, and in mice they are sensitive to early trauma. Mouse and human SpILT profiles are determined by a post-transcriptional selection process during spermiogenesis, and are co-retained in sperm with base pair-complementary miRNAs. In sum, we have developed a bioinformatics pipeline to define intact transcripts, added SplLTs into the “sperm RNA code” for use in future research and potential diagnosis, and uncovered selection mechanism(s) controlling sperm RNA profiles.

Single-Molecule Long-Read Sequencing Reveals the Diversity of Full-Length Transcripts in Leaves of Gnetum (Gnetales)

International Journal of Molecular Sciences ◽

10.3390/ijms20246350 ◽

2019 ◽

Vol 20 (24) ◽

pp. 6350 ◽

Cited By ~ 2

Author(s):

Nan Deng ◽

Chen Hou ◽

Fengfeng Ma ◽

Caixia Liu ◽

Yuxin Tian

Keyword(s):

Single Molecule ◽

Developmental Stages ◽

Alternative Polyadenylation ◽

Full Length ◽

Stomatal Development ◽

Rna Seq ◽

Leaf Transcriptome ◽

Long Read ◽

Non Coding Rnas ◽

A Site

The limitations of RNA sequencing make it difficult to accurately predict alternative splicing (AS) and alternative polyadenylation (APA) events and long non-coding RNAs (lncRNAs), all of which reveal transcriptomic diversity and the complexity of gene regulation. Gnetum, a genus with ambiguous phylogenetic placement in seed plants, has a distinct stomatal structure and photosynthetic characteristics. In this study, a full-length transcriptome of Gnetum luofuense leaves at different developmental stages was sequenced with the latest PacBio Sequel platform. After correction by short reads generated by Illumina RNA-Seq, 80,496 full-length transcripts were obtained, of which 5269 reads were identified as isoforms of novel genes. Additionally, 1660 lncRNAs and 12,998 AS events were detected. In total, 5647 genes in the G. luofuense leaves had APA featured by at least one poly(A) site. Moreover, 67 and 30 genes from the bHLH gene family, which play an important role in stomatal development and photosynthesis, were identified from the G. luofuense genome and leaf transcripts, respectively. This leaf transcriptome supplements the reference genome of G. luofuense, and the AS events and lncRNAs detected provide valuable resources for future studies of investigating low photosynthetic capacity of Gnetum.

Defining Blood Group Gene Reference Alleles by Long-Read Sequencing: Proof of Concept in the ACKR1 Gene Encoding the Duffy Antigens

Transfusion Medicine and Hemotherapy ◽

10.1159/000504584 ◽

2019 ◽

Vol 47 (1) ◽

pp. 23-32 ◽

Cited By ~ 2

Author(s):

Yann Fichou ◽

Isabelle Berlivet ◽

Gaëlle Richard ◽

Christophe Tournamille ◽

Lilian Castilho ◽

...

Keyword(s):

Blood Group ◽

Single Molecule ◽

Pcr Amplification ◽

Null Alleles ◽

Sequencing Technology ◽

Gene Encoding ◽

Next Generation Sequencing Technology ◽

Sequencing Technologies ◽

Long Read ◽

Long Range Pcr

Background: In the novel era of blood group genomics, (re-)defining reference gene/allele sequences of blood group genes has become an important goal to achieve, both for diagnostic and research purposes. As novel potent sequencing technologies are available, we thought to investigate the variability encountered in the three most common alleles of ACKR1, the gene encoding the clinically relevant Duffy antigens, at the haplotype level by a long-read sequencing approach. Materials and Methods: After long-range PCR amplification spanning the whole ACKR1 gene locus (∼2.5 kilobases), amplicons generated from 81 samples with known genotypes were sequenced in a single read by using the Pacific Biosciences (PacBio) single molecule, real-time (SMRT) sequencing technology. Results: High-quality sequencing reads were obtained for the 162 alleles (accuracy >0.999). Twenty-two nucleotide variations reported in databases were identified, defining 19 haplotypes: four, eight, and seven haplotypes in 46 ACKR1*01, 63 ACKR1*02, and 53 ACKR1*02N.01 alleles, respectively. Discussion: Overall, we have defined a subset of reference alleles by third-generation (long-read) sequencing. This technology, which provides a “longitudinal” overview of the loci of interest (several thousand base pairs) and is complementary to the second-generation (short-read) next-generation sequencing technology, is of critical interest for resolving novel, rare, and null alleles.

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.