Challenges and advances for transcriptome assembly in non-model species

Mapping Intimacies ◽

10.1101/084145 ◽

2016 ◽

Cited By ~ 2

Author(s):

Arnaud Ungaro ◽

Nicolas Pech ◽

Jean-François Martin ◽

R.J. Scott McCairns ◽

Jean-Philippe Mévy ◽

...

Keyword(s):

Fish Species ◽

High Performance ◽

De Novo ◽

Transcriptome Assembly ◽

Read Length ◽

Model Organisms ◽

Bias Error ◽

Model Species ◽

Guided Assembly ◽

Sequencing Platforms

AbstractAnalyses of high-throughput transcriptome sequences of non-model organisms are based on two main approaches: de novo assembly and genome-guided assembly using mapping to assign reads prior to assembly. Given the limits of mapping reads to a reference when it is highly divergent, as is frequently the case for non-model species, we evaluate whether using blastn would outperform mapping methods for read assignment in such situations (>15% divergence). We demonstrate its high performance by using simulated reads of lengths corresponding to those generated by the most common sequencing platforms, and over a realistic range of genetic divergence (0% to 30% divergence). Here we focus on gene identification and not on resolving the whole set of transcripts (i.e. the complete transcriptome). For simulated datasets, the transcriptome-guided assembly based on blastn recovers 94.8% of genes irrespective of read length at 0% divergence; however, assignment rate of reads is negatively correlated with both increasing divergence level and reducing read lengths. Nevertheless, we still observe 92.6% of recovered genes at 30% divergence irrespective of read length. This analysis also produces a categorization of genes relative to their assignment, and suggests guidelines for data processing prior to analyses of comparative transcriptomics and gene expression to minimize potential inferential bias associated with incorrect transcript assignment. We also compare the performances of de novo assembly alone vs in combination with a transcriptome-guided assembly based on blastn via simulation and empirically, using data from a cyprinid fish species and from an oak species. For any simulated scenario, the transcriptome-guided assembly using blastn outperforms the de novo approach alone, including when the divergence level is beyond the reach of mapping methods. Combining de novo assembly and a related reference transcriptome for read assignment also addresses the bias/error in contigs caused by the dependence on a related reference alone. Empirical data corroborate those findings when assembling transcriptomes from the two non-model organisms: Parachondrostoma toxostoma (fish) and Quercus pubescens (plant). For the fish species, out of the 31,944 genes known from D. rerio, the guided and de novo assemblies recover respectively 20,605 and 20,032 genes but the performance of the guided assembly approach is much higher for both the contiguity and completeness metrics. For the oak, out of the 29,971 genes known from Vitis vinifera, the transcriptome-guided and de novo assemblies display similar performance but the new guided approach detects 16,326 genes where the de novo assembly only detects 9,385 genes.

The brain transcriptome of the wolf spider, Schizocosa ocreata

BMC Research Notes ◽

10.1186/s13104-021-05648-y ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Daniel Stribling ◽

Peter L. Chang ◽

Justin E. Dalton ◽

Christopher A. Conow ◽

Malcolm Rosenthal ◽

...

Keyword(s):

Gene Expression ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome Assembly ◽

De Novo Transcriptome ◽

Wolf Spiders ◽

Schizocosa Ocreata ◽

Genomic Studies ◽

The Brain

Abstract Objectives Arachnids have fascinating and unique biology, particularly for questions on sex differences and behavior, creating the potential for development of powerful emerging models in this group. Recent advances in genomic techniques have paved the way for a significant increase in the breadth of genomic studies in non-model organisms. One growing area of research is comparative transcriptomics. When phylogenetic relationships to model organisms are known, comparative genomic studies provide context for analysis of homologous genes and pathways. The goal of this study was to lay the groundwork for comparative transcriptomics of sex differences in the brain of wolf spiders, a non-model organism of the pyhlum Euarthropoda, by generating transcriptomes and analyzing gene expression. Data description To examine sex-differential gene expression, short read transcript sequencing and de novo transcriptome assembly were performed. Messenger RNA was isolated from brain tissue of male and female subadult and mature wolf spiders (Schizocosa ocreata). The raw data consist of sequences for the two different life stages in each sex. Computational analyses on these data include de novo transcriptome assembly and differential expression analyses. Sample-specific and combined transcriptomes, gene annotations, and differential expression results are described in this data note and are available from publicly-available databases.

A practical guide to buildde-novoassemblies for single tissues of non-model organisms: the example of a Neotropical frog

PeerJ ◽

10.7717/peerj.3702 ◽

2017 ◽

Vol 5 ◽

pp. e3702 ◽

Cited By ~ 5

Author(s):

Santiago Montero-Mendieta ◽

Manfred Grabherr ◽

Henrik Lantz ◽

Ignacio De la Riva ◽

Jennifer A. Leonard ◽

...

Keyword(s):

Defense Mechanisms ◽

De Novo ◽

Transcriptome Assembly ◽

Cost Effective ◽

Model Organisms ◽

Rna Seq ◽

Assembly Pipeline ◽

Wide Variability ◽

History Of ◽

Inexperienced User

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

10.1101/420208 ◽

2018 ◽

Cited By ~ 13

Author(s):

Elena Bushmanova ◽

Dmitry Antipov ◽

Alla Lapidus ◽

Andrey D. Prjibelski

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Challenging Problem ◽

Rna Seq ◽

De Novo Transcriptome ◽

Weak Points ◽

Transcriptome Reconstruction ◽

Evaluation Approaches ◽

Genome Assembler

AbstractSummaryPossibility to generate large RNA-seq datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the model organisms with finished and annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing and paralogous genes. In this paper we describe a novel transcriptome assembler called rnaSPAdes, which is developed on top of SPAdes genome assembler and explores surprising computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-Seq datasets, and briefly highlight strong and weak points of different assemblers.Availability and implementationrnaSPAdes is implemented in C++ and Python and is freely available at cab.spbu.ru/software/rnaspades/.

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

10.1101/149948 ◽

2017 ◽

Author(s):

Chang Sik Kim ◽

Martyn D. Winn ◽

Vipin Sachdeva ◽

Kirk E. Jordan

Keyword(s):

Clustering Algorithm ◽

De Novo ◽

Transcriptome Assembly ◽

Initial Step ◽

Computer Hardware ◽

Model Organisms ◽

De Bruijn Graph ◽

Memory Representation ◽

Novel Approach ◽

Sequencing Problems

AbstractBackgroundDe novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.ResultsWe introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.ConclusionsOur study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.

Proteotranscriptomics assisted gene annotation and spatial proteomics of Bombyx mori BmN4 cell line

10.21203/rs.3.rs-23159/v2 ◽

2020 ◽

Author(s):

Michal Levin ◽

Marion Scheibe ◽

Falk Butter

Keyword(s):

Mass Spectrometry ◽

Bombyx Mori ◽

Cell Line ◽

De Novo ◽

High Resolution Mass Spectrometry ◽

Gene Annotation ◽

Transcriptome Assembly ◽

Model Organisms ◽

Sequence Information ◽

A Genome

Abstract BackgroundThe process of identifying all coding regions in a genome is crucial for any study at the level of molecular biology, ranging from single-gene cloning to genome-wide measurements using RNA-Seq or mass spectrometry. While satisfactory annotation has been made feasible for well-studied model organisms through great efforts of big consortia, for most systems this kind of data is either absent or not adequately precise. ResultsCombining in-depth transcriptome sequencing and high resolution mass spectrometry, we here use proteotranscriptomics to improve gene annotation of protein-coding genes in the Bombyx mori cell line BmN4 which is an increasingly used tool for the analysis of piRNA biogenesis and function. Using this approach we provide the exact coding sequence and evidence for more than 6,200 genes on the protein level. Furthermore using spatial proteomics, we establish the subcellular localization of thousands of these proteins. We show that our approach outperforms current Bombyx mori annotation attempts in terms of accuracy and coverage. ConclusionsWe show that proteotranscriptomics is an efficient, cost-effective and accurate approach to improve previous annotations or generate new gene models. As this technique is based on de-novo transcriptome assembly, it provides the possibility to study any species also in the absence of genome sequence information for which proteogenomics would be impossible.

A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine

BIO-PROTOCOL ◽

10.21769/bioprotoc.3912 ◽

2021 ◽

Vol 11 (3) ◽

Author(s):

Gustavo Duarte ◽

Polina Yu. ◽

Stanislav Geras’kin

Keyword(s):

Gene Ontology ◽

Scots Pine ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Gene Ontology Analysis ◽

De Novo Transcriptome Assembly ◽

De Novo Transcriptome

An exploration of assembly strategies and quality metrics on the accuracy of the Knightia excelsa (rewarewa) genome.

10.22541/au.161048558.86691399/v1 ◽

2021 ◽

Author(s):

Ann McCartney ◽

Elena Hilario ◽

Seung-Sub Choi ◽

Joseph Guhlin ◽

Jessie Prebble ◽

...

Keyword(s):

New Zealand ◽

De Novo ◽

Quality Metrics ◽

Read Length ◽

Model Organisms ◽

Sequencing Data ◽

Contig Assembly ◽

High Quality ◽

Aotearoa New Zealand ◽

Long Read

We used long read sequencing data generated from Knightia excelsaI R.Br, a nectar producing Proteaceae tree endemic to Aotearoa New Zealand, to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome construction. Establishing a high-quality genome for this species has specific cultural importance to Māori, the indigenous people, as well as commercial importance to honey producers in Aotearoa New Zealand. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies, and two Hi-C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Assemblies that used longer read lengths (>30 kb) and lower coverage were the most contiguous, kmer and gene complete. The final genome assembly was constructed into pseudo-chromosomes using all available data assembled with FLYE, polished using Racon/Medaka/Pilon combined, scaffolded using SALSA2 and AllHiC, curated using Juicebox, and validated by synteny with Macadamia. We highlighted the importance of developing assembly workflows based on the volume and type of sequencing data and establishing a set of robust quality metrics for generating high quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by utilizing Hi-C data and that scaffolded assemblies were more accurate when the underlying contig assembly was of higher accuracy. These findings provide insight into what is required for future high-quality de-novo assemblies of non-model organisms.

TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

10.1101/2021.02.18.431773 ◽

2021 ◽

Author(s):

R.E. Rivera-Vicéns ◽

C. Garcia Escudero ◽

N. Conci ◽

M. Eitel ◽

G. Wörheide

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Rna Seq ◽

Analysis Pipeline ◽

User Input ◽

Genome Data ◽

Differential Gene ◽

Transcriptomic Level ◽

Genome Information

AbstractThe use of RNA-Seq data and the generation of de novo transcriptome assemblies have been pivotal for studies in ecology and evolution. This is distinctly true for non-model organisms, where no genome information is available; yet, studies of differential gene expression, DNA enrichment baits design, and phylogenetics can all be accomplished with the data gathered at the transcriptomic level. Multiple tools are available for transcriptome assembly, however, no single tool can provide the best assembly for all datasets. Therefore, a multi assembler approach, followed by a reduction step, is often sought to generate an improved representation of the assembly. To reduce errors in these complex analyses while at the same time attaining reproducibility and scalability, automated workflows have been essential in the analysis of RNA-Seq data. However, most of these tools are designed for species where genome data is used as reference for the assembly process, limiting their use in non-model organisms. We present TransPi, a comprehensive pipeline for de novo transcriptome assembly, with minimum user input but without losing the ability of a thorough analysis. A combination of different model organisms, kmer sets, read lengths, and read quantities were used for assessing the tool. Furthermore, a total of 49 non-model organisms, spanning different phyla, were also analyzed. Compared to approaches using single assemblers only, TransPi produces higher BUSCO completeness percentages, and a concurrent significant reduction in duplication rates. TransPi is easy to configure and can be deployed seamlessly using Conda, Docker and Singularity.

De novo transcriptome assembly for a non-model species, the blood-sucking bug Triatoma brasiliensis, a vector of Chagas disease

Genetica ◽

10.1007/s10709-014-9790-5 ◽

2014 ◽

Vol 143 (2) ◽

pp. 225-239 ◽

Cited By ~ 15

Author(s):

A. Marchant ◽

F. Mougel ◽

C. Almeida ◽

E. Jacquin-Joly ◽

J. Costa ◽

...

Keyword(s):

Chagas Disease ◽

De Novo ◽

Transcriptome Assembly ◽

De Novo Transcriptome Assembly ◽

De Novo Transcriptome ◽

Model Species ◽

Blood Sucking

A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms

BMC Genomics ◽

10.1186/s12864-017-3735-1 ◽

2017 ◽

Vol 18 (S4) ◽

Cited By ~ 5

Author(s):

Sing-Hoi Sze ◽

Meaghan L. Pimsler ◽

Jeffery K. Tomberlin ◽

Corbin D. Jones ◽

Aaron M. Tarone

Keyword(s):

Efficient Algorithm ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome Assembly ◽

De Novo Transcriptome ◽

Memory Efficient