Computational Methods for Transcript Assembly from RNA-SEQ Reads

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.

Download Full-text

A multi-sample approach increases the accuracy of transcript assembly

Nature Communications ◽

10.1038/s41467-019-12990-0 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 2

Author(s):

Li Song ◽

Sarven Sabunciyan ◽

Guangyu Yang ◽

Liliana Florea

Keyword(s):

Statistical Models ◽

Rna Seq ◽

Weighted Voting ◽

Large Numbers ◽

Splice Graph ◽

Voting Scheme ◽

Functional Analyses ◽

Multiple Samples ◽

Programming Algorithms ◽

Transcript Assembly

Abstract Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples.

Download Full-text

Big Data to the Bench: Transcriptome Analysis for Undergraduates

CBE—Life Sciences Education ◽

10.1187/cbe.18-08-0161 ◽

2019 ◽

Vol 18 (2) ◽

pp. ar19

Author(s):

Carl Procko ◽

Steven Morrison ◽

Courtney Dunar ◽

Sara Mills ◽

Brianna Maldonado ◽

...

Keyword(s):

Gene Expression ◽

Big Data ◽

Computational Methods ◽

Liberal Arts ◽

Plant Responses ◽

Rna Seq ◽

Research Experience ◽

Modern Biology ◽

Lab Experiments ◽

Wet Lab

Next-generation sequencing (NGS)-based methods are revolutionizing biology. Their prevalence requires biologists to be increasingly knowledgeable about computational methods to manage the enormous scale of data. As such, early introduction to NGS analysis and conceptual connection to wet-lab experiments is crucial for training young scientists. However, significant challenges impede the introduction of these methods into the undergraduate classroom, including the need for specialized computer programs and knowledge of computer coding. Here, we describe a semester-long, course-based undergraduate research experience at a liberal arts college combining RNA-sequencing (RNA-seq) analysis with student-driven, wet-lab experiments to investigate plant responses to light. Students derived hypotheses based on analysis of RNA-seq data and designed follow-up studies of gene expression and plant growth. Our assessments indicate that students acquired knowledge of big data analysis and computer coding; however, earlier exposure to computational methods may be beneficial. Our course requires minimal prior knowledge of plant biology, is easy to replicate, and can be modified to a shorter, directed-inquiry module. This framework promotes exploration of the links between gene expression and phenotype using examples that are clear and tractable and improves computational skills and bioinformatics self-efficacy to prepare students for the “big data” era of modern biology.

Download Full-text

Evaluation of transcript assembly in multiple porcine tissues suggests optimal sequencing depth for RNA-Seq using total RNA library

Animal Gene ◽

10.1016/j.angen.2020.200105 ◽

2020 ◽

Vol 17-18 ◽

pp. 200105

Author(s):

Brittney N. Keel ◽

William T. Oliver ◽

John W. Keele ◽

Amanda K. Lindholm-Perry

Keyword(s):

Sequencing Depth ◽

Rna Seq ◽

Total Rna ◽

Optimal Sequencing ◽

Transcript Assembly ◽

Porcine Tissues

Download Full-text

Computational methods for annotation of plant regulatory non-coding RNAs using RNA-seq

Briefings in Bioinformatics ◽

10.1093/bib/bbaa322 ◽

2020 ◽

Author(s):

A T Vivek ◽

Shailesh Kumar

Keyword(s):

Computational Methods ◽

High Throughput ◽

Rna Seq ◽

Physiological Mechanisms ◽

Software Packages ◽

Genome Wide ◽

A Genome ◽

Wide Scale ◽

Plant Transcriptome ◽

Non Coding Rnas

Abstract Plant transcriptome encompasses numerous endogenous, regulatory non-coding RNAs (ncRNAs) that play a major biological role in regulating key physiological mechanisms. While studies have shown that ncRNAs are extremely diverse and ubiquitous, the functions of the vast majority of ncRNAs are still unknown. With ever-increasing ncRNAs under study, it is essential to identify, categorize and annotate these ncRNAs on a genome-wide scale. The use of high-throughput RNA sequencing (RNA-seq) technologies provides a broader picture of the non-coding component of transcriptome, enabling the comprehensive identification and annotation of all major ncRNAs across samples. However, the detection of known and emerging class of ncRNAs from RNA-seq data demands complex computational methods owing to their unique as well as similar characteristics. Here, we discuss major plant endogenous, regulatory ncRNAs in an RNA sample followed by computational strategies applied to discover each class of ncRNAs using RNA-seq. We also provide a collection of relevant software packages and databases to present a comprehensive bioinformatics toolbox for plant ncRNA researchers. We assume that the discussions in this review will provide a rationale for the discovery of all major categories of plant ncRNAs.

Download Full-text

Transcript Assembly and Quantification by RNA-Seq Reveals Significant Differences in Gene Expression and Genetic Variants in Mosquitoes of the Culex pipiens (Diptera: Culicidae) Complex

Journal of Medical Entomology ◽

10.1093/jme/tjaa167 ◽

2020 ◽

Author(s):

David S Kang ◽

Sungshil Kim ◽

Michael A Cotten ◽

Cheolho Sim

Keyword(s):

Gene Expression ◽

Culex Pipiens ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Culex Pipiens Quinquefasciatus ◽

Niche Differences ◽

Differential Gene ◽

Culex Pipiens Molestus ◽

Transcript Assembly ◽

Blood Meals

Abstract The taxonomy of Culex pipiens complex of mosquitoes is still debated, but in North America it is generally regarded to include Culex pipiens pipiens, Culex pipiens molestus, and Culex quinquefasciatus (or Culex pipiens quinquefasciatus). Although these mosquitoes have very similar morphometry, they each have unique life strategies specifically adapted to their ecological niche. Differences include the capability for overwintering diapause, bloodmeal preference, mating behaviors, and reliance on blood meals to produce eggs. Here, we used RNA-seq transcriptome analysis to investigate the differential gene expression and nucleotide polymorphisms that may link to the divergent traits specifically between Cx. pipiens pipiens and Cx. pipiens molestus.

Download Full-text

Computational methods for alternative splicing detection using RNA-seq

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics - BCB'13 ◽

10.1145/2506583.2506666 ◽

2013 ◽

Author(s):

Ruolin Liu ◽

Julie Dickerson

Keyword(s):

Alternative Splicing ◽

Computational Methods ◽

Rna Seq ◽

Splicing Detection

Download Full-text

Transcript assembly improves expression quantification of transposable elements in single-cell RNA-seq data

Genome Research ◽

10.1101/gr.265173.120 ◽

2020 ◽

Vol 31 (1) ◽

pp. 88-100

Author(s):

Wanqing Shao ◽

Ting Wang

Keyword(s):

Transposable Elements ◽

Single Cell ◽

Rna Seq ◽

Transcript Assembly ◽

Expression Quantification

Download Full-text

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

10.1101/025767 ◽

2015 ◽

Cited By ~ 6

Author(s):

Michael I Love ◽

John B Hogenesch ◽

Rafael A Irizarry

Keyword(s):

Computational Methods ◽

Gc Content ◽

Science Research ◽

Transcript Abundance ◽

Transcript Expression ◽

Rna Seq ◽

Fold Reduction ◽

Visualization Tools ◽

Positive Results ◽

Sequence Bias

RNA-seq technology is widely used in biomedical and basic science research. These studies rely on complex computational methods that quantify expression levels for observed transcripts. We find that current computational methods can lead to hundreds of false positive results related to alternative isoform usage. This flaw in the current methodology stems from a lack of modeling sample-specific bias that leads to drops in coverage and is related to sequence features like fragment GC content and GC stretches. By incorporating features that explain this bias into transcript expression models, we greatly increase the specificity of transcript expression estimates, with more than a four-fold reduction in the number of false positives for reported changes in expression. We introduce alpine, a method for estimation of bias-corrected transcript abundance. The method is available as a Bioconductor package that includes data visualization tools useful for bias discovery.

Download Full-text

A tool for the comparison of transcript differential expression analysis pipelines

10.7287/peerj.preprints.2212 ◽

2016 ◽

Author(s):

Stefano Beretta ◽

Yuri Pirola ◽

Valeria Ranzani ◽

Grazisa Rossetti ◽

Raoul Bonnal ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

State Of The Art ◽

A Priori ◽

Differential Expression Analysis ◽

Workflow Management ◽

Transcript Level ◽

Rna Seq ◽

Art Methods ◽

Transcript Assembly

MOTIVATION Long non-coding RNAs (lncRNAs) have recently gained interest, especially for their involvement in controlling several cell processes, but a full understanding of their role is lacking. Differential Expression (DE) analysis is one of the most important tasks in the analysis of RNA-seq data, since it potentially points out genes involved in the regulation of the condition under study. However, a classical analysis at gene level may disregard the role of Alternative Splicing (AS) in regulating cell conditions. This is the case, for example, when a given gene is expressed in all the different conditions, but the expressed isoform is significantly diverse in the different conditions (that is an isoform switch). A transcript level analysis may better shed light on this case, especially in studies having as goal, for example, a better understanding of the behavior of lncRNAs in lymphocytes T cells, which are fundamental in studies of specific diseases, such as cancer. After Cufflinks/Cuffdiff, several approaches for DE analysis at isoform/transcript level have been proposed. However, their results are often sensitive to the upstream analysis such as read mapping, transcript reconstruction and quantification, and it is often hard to choose "a priori" the most appropriate combination of tools. This work presents a tool for assisting the user in this choice, and poses the bases for a study devoted to the characterization of lncRNAs and the identification of of isoform switch events. Our tool includes a framework for the description and the execution of a set of DE pipelines over the same input dataset, as well a set of tools for reconciling and comparing the results. METHOD We designed an automated and easily customizable tool which is able to execute a set of existing pipelines for DE analysis at transcript level starting from RNA-seq data. Our method is built upon Snakemake, a workflow management system, with the specific goal of reducing the complexity of creating workflows. This approach guarantees that the experimentation is fully replicable and easy to customize. Each considered pipeline is structured in three steps: (i) transcript assembly, (ii) quantification, and (iii) DE analysis. By default, our tool builds and compares 9 different pipelines, each taking as input the same set of RNA-seq reads, obtained by combining different state-of-the-art methods to perform the transcript assembly (TA step) with different state-of-the-art methods to perform quantification and differential expression analysis (Q+DE step). More precisely, the 9 pipelines are obtained by combining two tools (Cufflinks and StringTie) and a Reference Annotation (Ensembl annotated transcripts) for the TA step, with three tools (Cuffquant+Cuffdiff, StringTie-B+Ballgown and Kallisto+Sleuth) for the Q+DE step. Abstract truncated at 3,000 characters - the full version is available in the pdf file

Download Full-text