The Oyster River Protocol: A Multi Assembler and Kmer Approach For de novo Transcriptome Assembly

Population Genomics ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome ◽

Link Type ◽

Biological Phenomena ◽

Complicated Process ◽

Downstream Analysis

AbstractCharacterizing transcriptomes in non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. The Oyster River Protocol (ORP), described here, implements a standardized and benchmarked set of bioinformatic processes, resulting in an assembly with enhanced qualities over other standard assembly methods. Specifically, ORP produced assemblies have higher Detonate and TransRate scores and mapping rates, which is largely a product of the fact that it leverages a multi-assembler and kmer assembly process, thereby bypassing the shortcomings of any one approach. These improvements are important, as previously unassembled transcripts are included in ORP assemblies, resulting in a significant enhancement of the power of downstream analysis. Further, as part of this study, I show that assembly quality is unrelated with the number of reads generated, above 30 million reads. Code Availability: The version controlled open-source code is available at https://github.com/macmanes-lab/Oyster_River_Protocol. Instructions for software installation and use, and other details are available at http://oyster-river-protocol.rtfd.org/.

The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly

PeerJ ◽

10.7717/peerj.5428 ◽

2018 ◽

Vol 6 ◽

pp. e5428 ◽

Cited By ~ 22

Author(s):

Matthew D. MacManes

Keyword(s):

Population Genomics ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome ◽

Biological Phenomena ◽

Complicated Process ◽

Downstream Analysis

Characterizing transcriptomes in non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary, and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. The Oyster River Protocol (ORP), described here, implements a standardized and benchmarked set of bioinformatic processes, resulting in an assembly with enhanced qualities over other standard assembly methods. Specifically, ORP produced assemblies have higher Detonate and TransRate scores and mapping rates, which is largely a product of the fact that it leverages a multi-assembler and kmer assembly process, thereby bypassing the shortcomings of any one approach. These improvements are important, as previously unassembled transcripts are included in ORP assemblies, resulting in a significant enhancement of the power of downstream analysis. Further, as part of this study, I show that assembly quality is unrelated with the number of reads generated, above 30 million reads. Code Availability: The version controlled open-source code is available at https://github.com/macmanes-lab/Oyster_River_Protocol. Instructions for software installation and use, and other details are available at http://oyster-river-protocol.rtfd.org/.

Establishing evidenced-based best practice for the de novo assembly and evaluation of transcriptomes from non-model organisms

10.1101/035642 ◽

2015 ◽

Cited By ~ 25

Author(s):

Matthew D MacManes

Keyword(s):

Best Practice ◽

Population Genomics ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Single Individual ◽

Biological Phenomena ◽

Or Gene ◽

Evidenced Based

Characterizing transcriptomes in both model and non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. Each step may be accomplished in one of several different ways, using different software packages, each producing different results. This analytical complexity begs the question -- Which method(s) are optimal? Using reference and non-reference based evaluative methods, I propose a set of guidelines that aim to standardize and facilitate the process of transcriptome assembly. These recommendations include the generation of between 20 million and 40 million sequencing reads from single individual where possible, error correction of reads, gentle quality trimming, assembly filtering using Transrate and/or gene expression, annotation using dammit, and appropriate reporting. These recommendations have been extensively benchmarked and applied to publicly available transcriptomes, resulting in improvements in both content and contiguity. To facilitate the implementation of the proposed standardized methods, I have released a set of version controlled open-sourced code, The Oyster River Protocol for Transcriptome Assembly, available at http://oyster-river-protocol.rtfd.org/.

The brain transcriptome of the wolf spider, Schizocosa ocreata

BMC Research Notes ◽

10.1186/s13104-021-05648-y ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Daniel Stribling ◽

Peter L. Chang ◽

Justin E. Dalton ◽

Christopher A. Conow ◽

Malcolm Rosenthal ◽

...

Keyword(s):

Gene Expression ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome ◽

Wolf Spiders ◽

Schizocosa Ocreata ◽

Genomic Studies ◽

The Brain

Abstract Objectives Arachnids have fascinating and unique biology, particularly for questions on sex differences and behavior, creating the potential for development of powerful emerging models in this group. Recent advances in genomic techniques have paved the way for a significant increase in the breadth of genomic studies in non-model organisms. One growing area of research is comparative transcriptomics. When phylogenetic relationships to model organisms are known, comparative genomic studies provide context for analysis of homologous genes and pathways. The goal of this study was to lay the groundwork for comparative transcriptomics of sex differences in the brain of wolf spiders, a non-model organism of the pyhlum Euarthropoda, by generating transcriptomes and analyzing gene expression. Data description To examine sex-differential gene expression, short read transcript sequencing and de novo transcriptome assembly were performed. Messenger RNA was isolated from brain tissue of male and female subadult and mature wolf spiders (Schizocosa ocreata). The raw data consist of sequences for the two different life stages in each sex. Computational analyses on these data include de novo transcriptome assembly and differential expression analyses. Sample-specific and combined transcriptomes, gene annotations, and differential expression results are described in this data note and are available from publicly-available databases.

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

10.1101/420208 ◽

2018 ◽

Cited By ~ 13

Author(s):

Elena Bushmanova ◽

Dmitry Antipov ◽

Alla Lapidus ◽

Andrey D. Prjibelski

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Challenging Problem ◽

Rna Seq ◽

De Novo Transcriptome ◽

Weak Points ◽

Transcriptome Reconstruction ◽

Evaluation Approaches ◽

Genome Assembler

AbstractSummaryPossibility to generate large RNA-seq datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the model organisms with finished and annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing and paralogous genes. In this paper we describe a novel transcriptome assembler called rnaSPAdes, which is developed on top of SPAdes genome assembler and explores surprising computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-Seq datasets, and briefly highlight strong and weak points of different assemblers.Availability and implementationrnaSPAdes is implemented in C++ and Python and is freely available at cab.spbu.ru/software/rnaspades/.

A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine

BIO-PROTOCOL ◽

10.21769/bioprotoc.3912 ◽

2021 ◽

Vol 11 (3) ◽

Author(s):

Gustavo Duarte ◽

Polina Yu. ◽

Stanislav Geras’kin

Keyword(s):

Gene Ontology ◽

Scots Pine ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Gene Ontology Analysis ◽

De Novo Transcriptome

High-throughput sequencing and de novo transcriptome assembly of Swertia japonica to identify genes involved in the biosynthesis of therapeutic metabolites

Plant Cell Reports ◽

10.1007/s00299-016-2021-z ◽

2016 ◽

Vol 35 (10) ◽

pp. 2091-2111 ◽

Cited By ~ 20

Author(s):

Amit Rai ◽

Michimi Nakamura ◽

Hiroki Takahashi ◽

Hideyuki Suzuki ◽

Kazuki Saito ◽

...

Keyword(s):

High Throughput ◽

De Novo ◽

Transcriptome Assembly ◽

De Novo Transcriptome ◽

Swertia Japonica

A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms

BMC Genomics ◽

10.1186/s12864-017-3735-1 ◽

2017 ◽

Vol 18 (S4) ◽

Cited By ~ 5

Author(s):

Sing-Hoi Sze ◽

Meaghan L. Pimsler ◽

Jeffery K. Tomberlin ◽

Corbin D. Jones ◽

Aaron M. Tarone

Keyword(s):

Efficient Algorithm ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome ◽

Memory Efficient

DTA-SiST: de novo transcriptome assembly by using simplified suffix trees

BMC Bioinformatics ◽

10.1186/s12859-019-3272-9 ◽

2019 ◽

Vol 20 (S25) ◽

Author(s):

Jin Zhao ◽

Haodi Feng ◽

Daming Zhu ◽

Chi Zhang ◽

Ying Xu

Keyword(s):

Suffix Tree ◽

De Novo ◽

State Of The Art ◽

Linear Time ◽

Transcriptome Assembly ◽

Suffix Trees ◽

De Novo Transcriptome ◽

Hybrid Strategy

Abstract Background Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. Results We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. Conclusions DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.

Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms

BMC Bioinformatics ◽

10.1186/1471-2105-13-170 ◽

2012 ◽

Vol 13 (1) ◽

pp. 170 ◽

Cited By ~ 24

Author(s):

Berat Z Haznedaroglu ◽

Darryl Reeves ◽

Hamid Rismani-Yazdi ◽

Jordan Peccia

Keyword(s):

High Throughput ◽

Functional Annotation ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Sequencing Data ◽

De Novo Transcriptome ◽

Short Read ◽

Short Read Sequencing

First de-novo transcriptome assembly of a South American frog, Oreobates cruralis, enables population genomic studies of Neotropical amphibians

10.7287/peerj.preprints.2980v1 ◽

2017 ◽

Author(s):

Santiago Montero-Mendieta ◽

Manfred Grabherr ◽

Henrik Lantz ◽

Ignacio De la Riva ◽

Jennifer A Leonard ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Transcriptome Assembly ◽

Cost Effective ◽

Model Organisms ◽

South American ◽

Whole Genome ◽

Rna Seq ◽

De Novo Transcriptome

Whole genome sequencing is opening the door to novel insights into the population structure and evolutionary history of poorly known species. In organisms with large genomes, which includes most amphibians, whole-genome sequencing is excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome to facilitate assembly and the transcriptome sequence must be assembled de-novo. We used RNA-seq to obtain the transcriptome profile for Oreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome of O. cruralis. We also present a workflow to assist with pre-processing, assembling, evaluating and functionally annotating a de-novo transcriptome from RNA-seq data of non-model organisms. Our workflow guides the inexperienced user in an intuitive way through all the necessary steps to build de-novo transcriptome assemblies using readily available software and is freely available at: https://github.com/biomendi/PRACTICAL-GUIDE-TO-BUILD-DE-NOVO-TRANSCRIPTOME-ASSEMBLIES-FOR-NON-MODEL-ORGANISMS/wiki