Computational Methods for Transcript Assembly from RNA-SEQ Reads

Author(s):  
Stefan Canzar ◽  
Liliana Florea
2020 ◽  
Author(s):  
Ruben Chazarra-Gil ◽  
Stijn van Dongen ◽  
Vladimir Yu Kiselev ◽  
Martin Hemberg

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Li Song ◽  
Sarven Sabunciyan ◽  
Guangyu Yang ◽  
Liliana Florea

Abstract Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples.


2019 ◽  
Vol 18 (2) ◽  
pp. ar19
Author(s):  
Carl Procko ◽  
Steven Morrison ◽  
Courtney Dunar ◽  
Sara Mills ◽  
Brianna Maldonado ◽  
...  

Next-generation sequencing (NGS)-based methods are revolutionizing biology. Their prevalence requires biologists to be increasingly knowledgeable about computational methods to manage the enormous scale of data. As such, early introduction to NGS analysis and conceptual connection to wet-lab experiments is crucial for training young scientists. However, significant challenges impede the introduction of these methods into the undergraduate classroom, including the need for specialized computer programs and knowledge of computer coding. Here, we describe a semester-long, course-based undergraduate research experience at a liberal arts college combining RNA-sequencing (RNA-seq) analysis with student-driven, wet-lab experiments to investigate plant responses to light. Students derived hypotheses based on analysis of RNA-seq data and designed follow-up studies of gene expression and plant growth. Our assessments indicate that students acquired knowledge of big data analysis and computer coding; however, earlier exposure to computational methods may be beneficial. Our course requires minimal prior knowledge of plant biology, is easy to replicate, and can be modified to a shorter, directed-inquiry module. This framework promotes exploration of the links between gene expression and phenotype using examples that are clear and tractable and improves computational skills and bioinformatics self-efficacy to prepare students for the “big data” era of modern biology.


Animal Gene ◽  
2020 ◽  
Vol 17-18 ◽  
pp. 200105
Author(s):  
Brittney N. Keel ◽  
William T. Oliver ◽  
John W. Keele ◽  
Amanda K. Lindholm-Perry

Author(s):  
A T Vivek ◽  
Shailesh Kumar

Abstract Plant transcriptome encompasses numerous endogenous, regulatory non-coding RNAs (ncRNAs) that play a major biological role in regulating key physiological mechanisms. While studies have shown that ncRNAs are extremely diverse and ubiquitous, the functions of the vast majority of ncRNAs are still unknown. With ever-increasing ncRNAs under study, it is essential to identify, categorize and annotate these ncRNAs on a genome-wide scale. The use of high-throughput RNA sequencing (RNA-seq) technologies provides a broader picture of the non-coding component of transcriptome, enabling the comprehensive identification and annotation of all major ncRNAs across samples. However, the detection of known and emerging class of ncRNAs from RNA-seq data demands complex computational methods owing to their unique as well as similar characteristics. Here, we discuss major plant endogenous, regulatory ncRNAs in an RNA sample followed by computational strategies applied to discover each class of ncRNAs using RNA-seq. We also provide a collection of relevant software packages and databases to present a comprehensive bioinformatics toolbox for plant ncRNA researchers. We assume that the discussions in this review will provide a rationale for the discovery of all major categories of plant ncRNAs.


Author(s):  
David S Kang ◽  
Sungshil Kim ◽  
Michael A Cotten ◽  
Cheolho Sim

Abstract The taxonomy of Culex pipiens complex of mosquitoes is still debated, but in North America it is generally regarded to include Culex pipiens pipiens, Culex pipiens molestus, and Culex quinquefasciatus (or Culex pipiens quinquefasciatus). Although these mosquitoes have very similar morphometry, they each have unique life strategies specifically adapted to their ecological niche. Differences include the capability for overwintering diapause, bloodmeal preference, mating behaviors, and reliance on blood meals to produce eggs. Here, we used RNA-seq transcriptome analysis to investigate the differential gene expression and nucleotide polymorphisms that may link to the divergent traits specifically between Cx. pipiens pipiens and Cx. pipiens molestus.


2015 ◽  
Author(s):  
Michael I Love ◽  
John B Hogenesch ◽  
Rafael A Irizarry

RNA-seq technology is widely used in biomedical and basic science research. These studies rely on complex computational methods that quantify expression levels for observed transcripts. We find that current computational methods can lead to hundreds of false positive results related to alternative isoform usage. This flaw in the current methodology stems from a lack of modeling sample-specific bias that leads to drops in coverage and is related to sequence features like fragment GC content and GC stretches. By incorporating features that explain this bias into transcript expression models, we greatly increase the specificity of transcript expression estimates, with more than a four-fold reduction in the number of false positives for reported changes in expression. We introduce alpine, a method for estimation of bias-corrected transcript abundance. The method is available as a Bioconductor package that includes data visualization tools useful for bias discovery.


2016 ◽  
Author(s):  
Stefano Beretta ◽  
Yuri Pirola ◽  
Valeria Ranzani ◽  
Grazisa Rossetti ◽  
Raoul Bonnal ◽  
...  

MOTIVATION Long non-coding RNAs (lncRNAs) have recently gained interest, especially for their involvement in controlling several cell processes, but a full understanding of their role is lacking. Differential Expression (DE) analysis is one of the most important tasks in the analysis of RNA-seq data, since it potentially points out genes involved in the regulation of the condition under study. However, a classical analysis at gene level may disregard the role of Alternative Splicing (AS) in regulating cell conditions. This is the case, for example, when a given gene is expressed in all the different conditions, but the expressed isoform is significantly diverse in the different conditions (that is an isoform switch). A transcript level analysis may better shed light on this case, especially in studies having as goal, for example, a better understanding of the behavior of lncRNAs in lymphocytes T cells, which are fundamental in studies of specific diseases, such as cancer. After Cufflinks/Cuffdiff, several approaches for DE analysis at isoform/transcript level have been proposed. However, their results are often sensitive to the upstream analysis such as read mapping, transcript reconstruction and quantification, and it is often hard to choose "a priori" the most appropriate combination of tools. This work presents a tool for assisting the user in this choice, and poses the bases for a study devoted to the characterization of lncRNAs and the identification of of isoform switch events. Our tool includes a framework for the description and the execution of a set of DE pipelines over the same input dataset, as well a set of tools for reconciling and comparing the results. METHOD We designed an automated and easily customizable tool which is able to execute a set of existing pipelines for DE analysis at transcript level starting from RNA-seq data. Our method is built upon Snakemake, a workflow management system, with the specific goal of reducing the complexity of creating workflows. This approach guarantees that the experimentation is fully replicable and easy to customize. Each considered pipeline is structured in three steps: (i) transcript assembly, (ii) quantification, and (iii) DE analysis. By default, our tool builds and compares 9 different pipelines, each taking as input the same set of RNA-seq reads, obtained by combining different state-of-the-art methods to perform the transcript assembly (TA step) with different state-of-the-art methods to perform quantification and differential expression analysis (Q+DE step). More precisely, the 9 pipelines are obtained by combining two tools (Cufflinks and StringTie) and a Reference Annotation (Ensembl annotated transcripts) for the TA step, with three tools (Cuffquant+Cuffdiff, StringTie-B+Ballgown and Kallisto+Sleuth) for the Q+DE step. Abstract truncated at 3,000 characters - the full version is available in the pdf file


Sign in / Sign up

Export Citation Format

Share Document