scholarly journals Effect of de novo transcriptome assembly on transcript quantification

2018 ◽  
Author(s):  
Ping-Han Hsieh ◽  
Yen-Jen Oyang ◽  
Chien-Yu Chen

AbstractBackgroundCorrect quantification of transcript expression is essential to understand the functional products of the genome in different physiological conditions and developmental stages. Recently, the development of high-throughput RNA sequencing (RNA-Seq) allows the researchers to perform transcriptome analysis for the organisms without the reference genome and transcriptome. For such projects, de novo transcriptome assembly must be carried out prior to quantification. However, a large number of erroneous contigs produced by the assemblers might result in unreliable estimation on the abundance of transcripts. In this regard, this study comprehensively investigates how assembly quality affects the performance of quantification for RNA-Seq analysis based on de novo transcriptome assembly.ResultsSeveral important factors that might seriously affect the accuracy of the RNA-Seq analysis were thoroughly discussed. First, we found that the assemblers perform comparatively well for the transcriptomes with lower biological complexity. Second, we examined the over-extended and incomplete contigs, and then demonstrated that assembly completeness has a strong impact on the estimation of contig abundance. Lastly, we investigated the behavior of the quantifiers with respect to sequence ambiguity which might be originally present in the transcriptome or accidentally produced by assemblers. The results suggest that the quantifiers often over-estimate the expression of family-collapse contigs and under-estimate the expression of duplicated contigs. For organisms without reference transcriptome, it remained challenging to detect the inaccurate abundance estimation on family-collapse contigs. On the contrary, we observed that the situation of under-estimation on duplicated contigs can be warned through analyzing the read distribution of the duplicated contigs.ConclusionsIn summary, we explicated the behavior of quantifiers when erroneous contigs are present and we outlined the potential problems that the assemblers might cause for the downstream analysis of RNA-Seq. We anticipate the analytic results conducted in this study provides valuable insights for future development of transcriptome assembly and quantification.Availabilitywe proposed an open-source Python based package QuantEval that builds connected components for the assembled contigs based on sequence similarity and evaluates the quantification results for each connected component. The package can be downloaded from https://github.com/dn070017/QuantEval.


PLoS ONE ◽  
2015 ◽  
Vol 10 (5) ◽  
pp. e0125722 ◽  
Author(s):  
Yuli Li ◽  
Xiliang Wang ◽  
Tingting Chen ◽  
Fuwen Yao ◽  
Cuiping Li ◽  
...  


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3702 ◽  
Author(s):  
Santiago Montero-Mendieta ◽  
Manfred Grabherr ◽  
Henrik Lantz ◽  
Ignacio De la Riva ◽  
Jennifer A. Leonard ◽  
...  

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.



2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.



2019 ◽  
Vol 10 (2) ◽  
pp. 443-454
Author(s):  
Chang Liu ◽  
Cornelius Tlotliso Sello ◽  
Yujian Sui ◽  
Jingtao Hu ◽  
Shaokang Chen ◽  
...  

In order to enrich the Anser cygnoides genome and identify the gene expression profiles of primary and secondary feather follicles development, de novo transcriptome assembly of skin tissues was established by analyzing three developmental stages at embryonic day 14, 18, and 28 (E14, E18, E28). Sequencing output generated 436,730,608 clean reads from nine libraries and de novo assembled into 56,301 unigenes. There were 2,298, 9,423 and 12,559 unigenes showing differential expression in three stages respectively. Furthermore, differentially expressed genes (DEGs) were functionally classified according to genes ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and series-cluster analysis. Relevant specific GO terms such as epithelium development, regulation of keratinocyte proliferation, morphogenesis of an epithelium were identified. In all, 15,144 DEGs were clustered into eight profiles with distinct expression patterns and 2,424 DEGs were assigned to 198 KEGG pathways. Skin development related pathways (mitogen-activated protein kinase signaling pathway, extra-cellular matrix -receptor interaction, Wingless-type signaling pathway) and genes (delta like canonical Notch ligand 1, fibroblast growth factor 2, Snail family transcriptional repressor 2, bone morphogenetic protein 6, polo like kinase 1) were identified, and eight DEGs were selected to verify the reliability of transcriptome results by real-time quantitative PCR. The findings of this study will provide the key insights into the complicated molecular mechanism and breeding techniques underlying the developmental characteristics of skin and feather follicles in Anser cygnoides.





2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Ping-Han Hsieh ◽  
Yen-Jen Oyang ◽  
Chien-Yu Chen

An amendment to this paper has been published and can be accessed via a link at the top of the paper.



2011 ◽  
Vol 54 (12) ◽  
pp. 1129-1133 ◽  
Author(s):  
Geng Chen ◽  
KangPing Yin ◽  
Charles Wang ◽  
TieLiu Shi


2018 ◽  
Author(s):  
Elena Bushmanova ◽  
Dmitry Antipov ◽  
Alla Lapidus ◽  
Andrey D. Prjibelski

AbstractSummaryPossibility to generate large RNA-seq datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the model organisms with finished and annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing and paralogous genes. In this paper we describe a novel transcriptome assembler called rnaSPAdes, which is developed on top of SPAdes genome assembler and explores surprising computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-Seq datasets, and briefly highlight strong and weak points of different assemblers.Availability and implementationrnaSPAdes is implemented in C++ and Python and is freely available at cab.spbu.ru/software/rnaspades/.



2017 ◽  
Author(s):  
Arnaud Meng ◽  
Erwan Corre ◽  
Ian Probert ◽  
Andres Gutierrez-Rodriguez ◽  
Raffaele Siano ◽  
...  

ABSTRACTDinoflagellates are one of the most abundant and functionally diverse groups of eukaryotes. Despite an overall scarcity of genomic information for dinoflagellates, constantly emerging high-throughput sequencing resources can be used to characterize and compare these organisms. We assembled de novo and processed 46 dinoflagellate transcriptomes and used a sequence similarity network (SSN) to compare the underlying genomic basis of functional features within the group. This approach constitutes the most comprehensive picture to date of the genomic potential of dinoflagellates. A core proteome composed of 252 connected components (CCs) of putative conserved protein domains (pCDs) was identified. Of these, 206 were novel and 16 lacked any functional annotation in public databases. Integration of functional information in our network analyses allowed investigation of pCDs specifically associated to functional traits. With respect to toxicity, sequences homologous to those of proteins involved in toxin biosynthesis pathways (e.g. sxtA1-4 and sxtG) were not specific to known toxin-producing species. Although not fully specific to symbiosis, the most represented functions associated with proteins involved in the symbiotic trait were related to membrane processes and ion transport. Overall, our SSN approach led to identification of 45,207 and 90,794 specific and constitutive pCDs of respectively the toxic and symbiotic species represented in our analyses. Of these, 56% and 57% respectively (i.e. 25,393 and 52,193 pCDs) completely lacked annotation in public databases. This stresses the extent of our lack of knowledge, while emphasizing the potential of SSNs to identify candidate pCDs for further functional genomic characterization.



Sign in / Sign up

Export Citation Format

Share Document