Comparison of Short-Read Sequence Aligners Indicates Strengths and Weaknesses for Biologists to Consider

Aligning short-read sequences is the foundational step to most genomic and transcriptomic analyses, but not all tools perform equally, and choosing among the growing body of available tools can be daunting. Here, in order to increase awareness in the research community, we discuss the merits of common algorithms and programs in a way that should be approachable to biologists with limited experience in bioinformatics. We will only in passing consider the effects of data cleanup, a precursor analysis to most alignment tools, and no consideration will be given to downstream processing of the aligned fragments. To compare aligners [Bowtie2, Burrows Wheeler Aligner (BWA), HISAT2, MUMmer4, STAR, and TopHat2], an RNA-seq dataset was used containing data from 48 geographically distinct samples of the grapevine powdery mildew fungus Erysiphe necator. Based on alignment rate and gene coverage, all aligners performed well with the exception of TopHat2, which HISAT2 superseded. BWA perhaps had the best performance in these metrics, except for longer transcripts (>500 bp) for which HISAT2 and STAR performed well. HISAT2 was ~3-fold faster than the next fastest aligner in runtime, which we consider a secondary factor in most alignments. At the end, this direct comparison of commonly used aligners illustrates key considerations when choosing which tool to use for the specific sequencing data and objectives. No single tool meets all needs for every user, and there are many quality aligners available.

Download Full-text

The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab028 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Xueyi Dong ◽

Luyi Tian ◽

Quentin Gouil ◽

Hasaru Kariyawasam ◽

Shian Su ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Transcriptomic Analysis ◽

Statistical Testing ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Sequencing Platform ◽

Long Read

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.

Download Full-text

NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data

10.1101/2020.11.06.371724 ◽

2020 ◽

Author(s):

Eliah G. Overbey ◽

Amanda M. Saravia-Butler ◽

Zhe Zhang ◽

Komal S. Rathi ◽

Homer Fogle ◽

...

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Rna Seq ◽

Sequencing Data ◽

Analysis Pipeline ◽

Short Read ◽

Gene Quantification ◽

Working Groups ◽

Using Data ◽

Data Analysis Pipeline

SummaryWith the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility and reusability of pipeline data, to provide a template for data processing of future spaceflight-relevant datasets, and to encourage cross-analysis of data from other databases with the data available in GeneLab.

Download Full-text

FastqPuri: high-performance preprocessing of RNA-seq data

10.1101/480707 ◽

2018 ◽

Author(s):

Paula Pérez-Rubio ◽

Claudio Lottaz ◽

Julia C Engelmann

Keyword(s):

High Performance ◽

Sequence Data ◽

Transcript Expression ◽

Memory Usage ◽

Rna Seq ◽

Sequencing Data ◽

Total Size ◽

Short Read ◽

Quality Filtering ◽

Sequence Quality

AbstractBackgroundRNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing.ResultsWe here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness. Conclusions: FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri. It is implemented in C and R and licensed under GPL v3.

Download Full-text

gplas: a comprehensive tool for plasmid analysis using short-read graphs

10.1101/835900 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Martin Bootsma ◽

Yaïr Hein ◽

Malbert R.C. Rogers ◽

Jukka Corander ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Genome ◽

Workflow Management ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Genetic Traits ◽

Short Read ◽

Sequence Composition ◽

Short Read Sequence

ABSTRACTSummaryPlasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data is often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and clustering based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short read sequence data.Availability and implementationGplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/[email protected]

Download Full-text

CoGe LoadExp+: A web-based suite that integrates next-gen sequencing data analysis workflows and visualization

10.1101/118802 ◽

2017 ◽

Cited By ~ 1

Author(s):

Jeffrey W. Grover ◽

Matthew Bomhoff ◽

Sean Davey ◽

Brian D. Gregory ◽

Rebecca A. Mosher ◽

...

Keyword(s):

Downstream Processing ◽

Biological Research ◽

Chromatin Binding ◽

Rna Seq ◽

Sequencing Data ◽

Skill Levels ◽

Web Based ◽

Sequencing Data Analysis ◽

Bioinformatics Workflows ◽

Variant Identification

AbstractTo make genomic and epigenomic analyses more widely available to the biological research community, we have created LoadExp+, a suite of bioinformatics workflows integrated with the web-based comparative genomics platform, CoGe. LoadExp+ allows users to perform transcriptomic (RNA-seq), epigenomic (bisulfite-seq), chromatin-binding (ChIP-seq), variant identification (SNPs), and population genetics analyses against any genome in CoGe, including genomes integrated by users themselves. Through LoadExp+’s integration with CoGe’s existing features, all analyses are available for visualization and additional downstream processing, and are available for export to CyVerse’s data management and analysis platforms. LoadExp+ provides easy-to-use functionality to manage genomics and epigenomics data throughout its entire lifecycle and facilitates greater accessibility of genomics analyses to researchers of all skill levels. LoadExp+ can be accessed at https://genomevolution.org.

Download Full-text

An Overview of the Application of Deep Learning in Short Read Sequence Classification

10.1101/2020.09.19.304782 ◽

2020 ◽

Author(s):

Kristaps Bebris ◽

Inese Polaka

Keyword(s):

Deep Learning ◽

Sequence Classification ◽

Sequencing Data ◽

Sequencing Technology ◽

Short Read ◽

Short Read Sequencing ◽

Current State ◽

Learning Techniques ◽

Short Read Sequence

AbstractAdvances in sequencing technology have led to an ever increasing amount of available short read sequencing data. This has, consequently, exacerbated the need for efficient and precise classification tools that can be used in the analysis of this data. As it stands, recent years have shown that massive leaps in performance can be achieved when it comes to approaches that are based in heuristics, and alongside these improvements there has been an ever increasing interest in applying deep learning techniques to revolutionize this classification task. We attempt to gather up these approaches and to evaluate their performance in a reproducible fashion to get a better perspective on the current state of deep learning based methods when it comes to the classification of short read sequencing data.

Download Full-text

The Lair: A resource for exploratory analysis of published RNA-Seq data

10.1101/056200 ◽

2016 ◽

Author(s):

Harold Pimentel ◽

Pascal Sturmfels ◽

Nicolas Bray ◽

Páll Melsted ◽

Lior Pachter

Keyword(s):

Large Scale ◽

Exploratory Analysis ◽

Technical Expertise ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Link Type ◽

Short Read Archive ◽

Published Research

AbstractIncreased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is typically not easily usable in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Short Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair

Download Full-text

An Overview of the Application of Deep Learning in Short-Read Sequence Classification

Information Technology and Management Science ◽

10.7250/itms-2020-0005 ◽

2020 ◽

Vol 23 ◽

pp. 35-40

Author(s):

Kristaps Bebris ◽

Inese Polaka

Keyword(s):

Deep Learning ◽

Sequence Classification ◽

Sequencing Data ◽

Sequencing Technology ◽

Short Read ◽

Short Read Sequencing ◽

Current State ◽

Learning Techniques ◽

Short Read Sequence

Advances in sequencing technology have led to an ever increasing amount of available short-read sequencing data. This has, consequently, exacerbated the need for efficient and precise classification tools that can be used in the analysis of these data. As it stands, recent years have shown that massive leaps in performance can be achieved when it comes to approaches that are based on heuristics, and apart from these improvements there has been an ever increasing interest in applying deep learning techniques to revolutionize this classification task. We attempt to study these approaches and to evaluate their performance in a reproducible fashion to get a better perspective on the current state of deep learning based methods when it comes to the classification of short-read sequencing data

Download Full-text

The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read tools

10.1101/2020.06.28.176727 ◽

2020 ◽

Cited By ~ 1

Author(s):

Xueyi Dong ◽

Luyi Tian ◽

Quentin Gouil ◽

Hasaru Kariyawasam ◽

Shian Su ◽

...

Keyword(s):

Ground Truth ◽

Transcriptomic Analysis ◽

Statistical Testing ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Sequencing Platform ◽

Oxford Nanopore ◽

Long Read ◽

Usage Analysis

Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to small library sizes and high sequence error, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore sequencing RNA-seq datasets with the goal of obtaining gene-level and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (“sequins”) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 were analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using short-read software and methods that are already in wide use can yield meaningful results.

Download Full-text

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text