Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data

Abstract Background Compared with second-generation sequencing technologies, third-generation single-molecule RNA sequencing has unprecedented advantages; the long reads it generates facilitate isoform-level transcript characterization. In particular, the Oxford Nanopore Technology sequencing platforms have become more popular in recent years owing to their relatively high affordability and portability compared with other third-generation sequencing technologies. To aid the development of analytical tools that leverage the power of this technology, simulated data provide a cost-effective solution with ground truth. However, a nanopore sequence simulator targeting transcriptomic data is not available yet. Findings We introduce Trans-NanoSim, a tool that simulates reads with technical and transcriptome-specific features learnt from nanopore RNA-sequncing data. We comprehensively benchmarked Trans-NanoSim on direct RNA and complementary DNA datasets describing human and mouse transcriptomes. Through comparison against other nanopore read simulators, we show the unique advantage and robustness of Trans-NanoSim in capturing the characteristics of nanopore complementary DNA and direct RNA reads. Conclusions As a cost-effective alternative to sequencing real transcriptomes, Trans-NanoSim will facilitate the rapid development of analytical tools for nanopore RNA-sequencing data. Trans-NanoSim and its pre-trained models are freely accessible at https://github.com/bcgsc/NanoSim.

Download Full-text

RNA Transcriptome Mapping with GraphMap

10.1101/160085 ◽

2017 ◽

Cited By ~ 1

Author(s):

Krešimir Križanović ◽

Ivan Sović ◽

Ivan Krpelnik ◽

Mile Šikić

Keyword(s):

Third Generation ◽

Sequencing Data ◽

Mapping Algorithm ◽

Gene Annotations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Rna Mapping ◽

Synthetic Datasets ◽

Generation Sequencing

AbstractNext generation sequencing technologies have made RNA sequencing widely accessible and applicable in many areas of research. In recent years, 3rd generation sequencing technologies have matured and are slowly replacing NGS for DNA sequencing. This paper presents a novel tool for RNA mapping guided by gene annotations. The tool is an adapted version of a previously developed DNA mapper – GraphMap, tailored for third generation sequencing data, such as those produced by Pacific Biosciences or Oxford Nanopore Technologies devices. It uses gene annotations to generate a transcriptome, uses a DNA mapping algorithm to map reads to the transcriptome, and finally transforms the mappings back to genome coordinates. Modified version of GraphMap is compared on several synthetic datasets to the state-of-the-art RNAseq mappers enabled to work with third generation sequencing data. The results show that our tool outperforms other tools in general mapping quality.

Download Full-text

TRiCoLOR: tandem repeat profiling using whole-genome long-read sequencing data

GigaScience ◽

10.1093/gigascience/giaa101 ◽

2020 ◽

Vol 9 (10) ◽

Cited By ~ 1

Author(s):

Davide Bolognini ◽

Alberto Magi ◽

Vladimir Benes ◽

Jan O Korbel ◽

Tobias Rausch

Keyword(s):

Tandem Repeat ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Generation Sequencing

Abstract Background Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution. Results We developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. Conclusions TRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.

Download Full-text

Reference-free reconstruction and quantification of transcriptomes from Nanopore long-read sequencing

10.1101/2020.02.08.939942 ◽

2020 ◽

Author(s):

Ivan de la Rubia ◽

Joel A. Indi ◽

Silvia Carbonell-Sala ◽

Julien Lagarde ◽

M Mar Albà ◽

...

Keyword(s):

Single Molecule ◽

Reference Genome ◽

Simulated Data ◽

Cost Effective ◽

Dna Assembly ◽

Sequencing Data ◽

Consensus Sequences ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read

AbstractSingle-molecule long-read sequencing with Nanopore provides an unprecedented opportunity to measure transcriptomes from any sample1–3. However, current analysis methods rely on the comparison with a reference genome or transcriptome2,4,5, or the use of multiple sequencing technologies6,7, thereby precluding cost-effective studies in species with no genome assembly available, in individuals underrepresented in the existing reference, and for the discovery of disease-specific transcripts not directly identifiable from a reference genome. Methods for DNA assembly8–10 cannot be directly transferred to transcriptomes since their consensus sequences lack the required interpretability for genes with multiple transcript isoforms. To address these challenges, we have developed RATTLE, the first tool to perform reference-free reconstruction and quantification of transcripts from Nanopore long reads. Using simulated data, isoform spike-ins, and sequencing data from tissues and cell lines, we demonstrate that RATTLE accurately determines transcript sequence and abundance, is comparable to reference-based methods, and shows saturation in the number of predicted transcripts with increasing number of input reads.

Download Full-text

IsoDetect: Detection of splice isoforms from third generation long reads based on short feature sequences

Current Bioinformatics ◽

10.2174/1574893615666200316101205 ◽

2020 ◽

Vol 15 ◽

Author(s):

Hongdong Li ◽

Wenjing Zhang ◽

Yuwen Luo ◽

Jianxin Wang

Keyword(s):

Sequence Similarity ◽

Detection Methods ◽

Sequence Information ◽

Third Generation ◽

Sequencing Data ◽

Splice Isoforms ◽

Third Generation Sequencing ◽

Long Reads ◽

Feature Sequence ◽

Generation Sequencing

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.

Download Full-text

Sequoia: an interactive visual analytics platform for interpretation and feature extraction from nanopore sequencing datasets

BMC Genomics ◽

10.1186/s12864-021-07791-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ratanond Koonchanok ◽

Swapna Vidhur Daulatabad ◽

Quoseena Mir ◽

Khairi Reda ◽

Sarath Chandra Janga

Keyword(s):

Single Molecule ◽

Visual Analytics ◽

Visual Analysis ◽

Direct Sequencing ◽

Visual Exploration ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Rna Sequences ◽

Sequencing Technologies ◽

Signal Features

Abstract Background Direct-sequencing technologies, such as Oxford Nanopore’s, are delivering long RNA reads with great efficacy and convenience. These technologies afford an ability to detect post-transcriptional modifications at a single-molecule resolution, promising new insights into the functional roles of RNA. However, realizing this potential requires new tools to analyze and explore this type of data. Result Here, we present Sequoia, a visual analytics tool that allows users to interactively explore nanopore sequences. Sequoia combines a Python-based backend with a multi-view visualization interface, enabling users to import raw nanopore sequencing data in a Fast5 format, cluster sequences based on electric-current similarities, and drill-down onto signals to identify properties of interest. We demonstrate the application of Sequoia by generating and analyzing ~ 500k reads from direct RNA sequencing data of human HeLa cell line. We focus on comparing signal features from m6A and m5C RNA modifications as the first step towards building automated classifiers. We show how, through iterative visual exploration and tuning of dimensionality reduction parameters, we can separate modified RNA sequences from their unmodified counterparts. We also document new, qualitative signal signatures that characterize these modifications from otherwise normal RNA bases, which we were able to discover from the visualization. Conclusions Sequoia’s interactive features complement existing computational approaches in nanopore-based RNA workflows. The insights gleaned through visual analysis should help users in developing rationales, hypotheses, and insights into the dynamic nature of RNA. Sequoia is available at https://github.com/dnonatar/Sequoia.

Download Full-text

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm

Bioinformatics ◽

10.1093/bioinformatics/btaa179 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3669-3679 ◽

Cited By ~ 3

Author(s):

Can Firtina ◽

Jeremie S Kim ◽

Mohammed Alser ◽

Damla Senol Cali ◽

A Ercument Cicek ◽

...

Keyword(s):

Genome Analysis ◽

Supplementary Information ◽

Third Generation ◽

Sequencing Technology ◽

Base Pairs ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing ◽

Large Genomes

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Measuring evolutionary cancer dynamics from genome sequencing, one patient at a time

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2020-0075 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Giulio Caravagna

Keyword(s):

Genome Sequencing ◽

Cancer Evolution ◽

Sequencing Data ◽

Evolutionary Forces ◽

Sequencing Technologies ◽

Cancer Genome Sequencing ◽

Multiple Resolutions ◽

Multiple Patients ◽

Single Tumour ◽

Generation Sequencing

AbstractCancers progress through the accumulation of somatic mutations which accrue during tumour evolution, allowing some cells to proliferate in an uncontrolled fashion. This growth process is intimately related to latent evolutionary forces moulding the genetic and epigenetic composition of tumour subpopulations. Understanding cancer requires therefore the understanding of these selective pressures. The adoption of widespread next-generation sequencing technologies opens up for the possibility of measuring molecular profiles of cancers at multiple resolutions, across one or multiple patients. In this review we discuss how cancer genome sequencing data from a single tumour can be used to understand these evolutionary forces, overviewing mathematical models and inferential methods adopted in field of Cancer Evolution.

Download Full-text

A Transposon Story: From TE Content to TE Dynamic Invasion of Drosophila Genomes Using the Single-Molecule Sequencing Technology from Oxford Nanopore

Cells ◽

10.3390/cells9081776 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1776

Author(s):

Mourdas Mohamed ◽

Nguyet Thi-Minh Dang ◽

Yuki Ogyama ◽

Nelly Burlet ◽

Bruno Mugat ◽

...

Keyword(s):

Single Molecule ◽

Wild Type ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

In The Wild ◽

Successive Generations ◽

Type Strains

Transposable elements (TEs) are the main components of genomes. However, due to their repetitive nature, they are very difficult to study using data obtained with short-read sequencing technologies. Here, we describe an efficient pipeline to accurately recover TE insertion (TEI) sites and sequences from long reads obtained by Oxford Nanopore Technology (ONT) sequencing. With this pipeline, we could precisely describe the landscapes of the most recent TEIs in wild-type strains of Drosophila melanogaster and Drosophila simulans. Their comparison suggests that this subset of TE sequences is more similar than previously thought in these two species. The chromosome assemblies obtained using this pipeline also allowed recovering piRNA cluster sequences, which was impossible using short-read sequencing. Finally, we used our pipeline to analyze ONT sequencing data from a D. melanogaster unstable line in which LTR transposition was derepressed for 73 successive generations. We could rely on single reads to identify new insertions with intact target site duplications. Moreover, the detailed analysis of TEIs in the wild-type strains and the unstable line did not support the trap model claiming that piRNA clusters are hotspots of TE insertions.

Download Full-text

Next-Generation Sequencing Technologies in Blood Group Typing

Transfusion Medicine and Hemotherapy ◽

10.1159/000504765 ◽

2019 ◽

Vol 47 (1) ◽

pp. 4-13 ◽

Cited By ~ 1

Author(s):

Daniel Fürst ◽

Chrysanthi Tsamadou ◽

Christine Neuchel ◽

Hubert Schrezenmeier ◽

Joannis Mytilineos ◽

...

Keyword(s):

Next Generation Sequencing ◽

Blood Group ◽

Large Scale ◽

Cost Effective ◽

Molecular Testing ◽

Blood Group Antigens ◽

Next Generation ◽

Sequencing Technologies ◽

Blood Group Typing ◽

Generation Sequencing

Sequencing of the human genome has led to the definition of the genes for most of the relevant blood group systems, and the polymorphisms responsible for most of the clinically relevant blood group antigens are characterized. Molecular blood group typing is used in situations where erythrocytes are not available or where serological testing was inconclusive or not possible due to the lack of antisera. Also, molecular testing may be more cost-effective in certain situations. Molecular typing approaches are mostly based on either PCR with specific primers, DNA hybridization, or DNA sequencing. Particularly the transition of sequencing techniques from Sanger-based sequencing to next-generation sequencing (NGS) technologies has led to exciting new possibilities in blood group genotyping. We describe briefly the currently available NGS platforms and their specifications, depict the genetic background of blood group polymorphisms, and discuss applications for NGS approaches in immunohematology. As an example, we delineate a protocol for large-scale donor blood group screening established and in use at our institution. Furthermore, we discuss technical challenges and limitations as well as the prospect for future developments, including long-read sequencing technologies.

Download Full-text

Lung transplantation for patients with severe COVID-19

Science Translational Medicine ◽

10.1126/scitranslmed.abe4282 ◽

2020 ◽

Vol 12 (574) ◽

pp. eabe4282 ◽

Cited By ~ 1

Author(s):

Ankit Bharat ◽

Melissa Querrey ◽

Nikolay S. Markov ◽

Samuel Kim ◽

Chitaru Kurihara ◽

...

Keyword(s):

Respiratory Failure ◽

Pulmonary Fibrosis ◽

Lung Transplantation ◽

Single Cell ◽

Rna Sequencing ◽

Lung Tissue ◽

Single Molecule ◽

Sequencing Data ◽

Native Lung ◽

Single Cell Rna Sequencing

Lung transplantation can potentially be a life-saving treatment for patients with nonresolving COVID-19–associated respiratory failure. Concerns limiting lung transplantation include recurrence of SARS-CoV-2 infection in the allograft, technical challenges imposed by viral-mediated injury to the native lung, and the potential risk for allograft infection by pathogens causing ventilator-associated pneumonia in the native lung. Additionally, the native lung might recover, resulting in long-term outcomes preferable to those of transplant. Here, we report the results of lung transplantation in three patients with nonresolving COVID-19–associated respiratory failure. We performed single-molecule fluorescence in situ hybridization (smFISH) to detect both positive and negative strands of SARS-CoV-2 RNA in explanted lung tissue from the three patients and in additional control lung tissue samples. We conducted extracellular matrix imaging and single-cell RNA sequencing on explanted lung tissue from the three patients who underwent transplantation and on warm postmortem lung biopsies from two patients who had died from COVID-19–associated pneumonia. Lungs from these five patients with prolonged COVID-19 disease were free of SARS-CoV-2 as detected by smFISH, but pathology showed extensive evidence of injury and fibrosis that resembled end-stage pulmonary fibrosis. Using machine learning, we compared single-cell RNA sequencing data from the lungs of patients with late-stage COVID-19 to that from the lungs of patients with pulmonary fibrosis and identified similarities in gene expression across cell lineages. Our findings suggest that some patients with severe COVID-19 develop fibrotic lung disease for which lung transplantation is their only option for survival.

Download Full-text