CAARS: comparative assembly and annotation of RNA-Seq data

Abstract Motivation RNA sequencing (RNA-Seq) is a widely used approach to obtain transcript sequences in non-model organisms, notably for performing comparative analyses. However, current bioinformatic pipelines do not take full advantage of pre-existing reference data in related species for improving RNA-Seq assembly, annotation and gene family reconstruction. Results We built an automated pipeline named CAARS to combine novel data from RNA-Seq experiments with existing multi-species gene family alignments. RNA-Seq reads are assembled into transcripts by both de novo and assisted assemblies. Then, CAARS incorporates transcripts into gene families, builds gene alignments and trees and uses phylogenetic information to classify the genes as orthologs and paralogs of existing genes. We used CAARS to assemble and annotate RNA-Seq data in rodents and fishes using distantly related genomes as reference, a difficult case for this kind of analysis. We showed CAARS assemblies are more complete and accurate than those assembled by a standard pipeline consisting of de novo assembly coupled with annotation by sequence similarity on a guide species. In addition to annotated transcripts, CAARS provides gene family alignments and trees, annotated with orthology relationships, directly usable for downstream comparative analyses. Availability and implementation CAARS is implemented in Python and Ocaml and is freely available at https://github.com/carinerey/caars. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identification and Expression Analysis of the Genes Involved in the Raffinose Family Oligosaccharides Pathway of Phaseolus vulgaris and Glycine max

Plants ◽

10.3390/plants10071465 ◽

2021 ◽

Vol 10 (7) ◽

pp. 1465

Author(s):

Ramon de Koning ◽

Raphaël Kiekens ◽

Mary Esther Muyoka Toili ◽

Geert Angenon

Keyword(s):

Common Bean ◽

Seed Development ◽

Expression Analysis ◽

De Novo ◽

Expression Patterns ◽

Gene Families ◽

Rna Seq ◽

Raffinose Family Oligosaccharides ◽

Specific Expression ◽

Raffinose Synthase

Raffinose family oligosaccharides (RFO) play an important role in plants but are also considered to be antinutritional factors. A profound understanding of the galactinol and RFO biosynthetic gene families and the expression patterns of the individual genes is a prerequisite for the sustainable reduction of the RFO content in the seeds, without compromising normal plant development and functioning. In this paper, an overview of the annotation and genetic structure of all galactinol- and RFO biosynthesis genes is given for soybean and common bean. In common bean, three galactinol synthase genes, two raffinose synthase genes and one stachyose synthase gene were identified for the first time. To discover the expression patterns of these genes in different tissues, two expression atlases have been created through re-analysis of publicly available RNA-seq data. De novo expression analysis through an RNA-seq study during seed development of three varieties of common bean gave more insight into the expression patterns of these genes during the seed development. The results of the expression analysis suggest that different classes of galactinol- and RFO synthase genes have tissue-specific expression patterns in soybean and common bean. With the obtained knowledge, important galactinol- and RFO synthase genes that specifically play a key role in the accumulation of RFOs in the seeds are identified. These candidate genes may play a pivotal role in reducing the RFO content in the seeds of important legumes which could improve the nutritional quality of these beans and would solve the discomforts associated with their consumption.

Download Full-text

Microbial Functional Responses to Cholesterol Catabolism in Denitrifying Sludge

mSystems ◽

10.1128/msystems.00113-18 ◽

2018 ◽

Vol 3 (5) ◽

Cited By ~ 6

Author(s):

Sean Ting-Shyang Wei ◽

Yu-Wei Wu ◽

Tzong-Huei Lee ◽

Yi-Shiang Huang ◽

Cheng-Yu Yang ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Anthropogenic Impacts ◽

Sequence Similarity ◽

Community Level ◽

Model Organisms ◽

Substrate Uptake ◽

Functional Responses ◽

Content Type ◽

Rare Biosphere

ABSTRACTThe 2,3-secopathway, the pathway for anaerobic cholesterol degradation, has been established in the denitrifying betaproteobacteriumSterolibacterium denitrificans. However, knowledge of how microorganisms respond to cholesterol at the community level is elusive. Here, we applied mesocosm incubation and 16S rRNA sequencing to reveal that, in denitrifying sludge communities, three betaproteobacterial operational taxonomic units (OTUs) with low (94% to 95%) 16S rRNA sequence similarity toStl. denitrificansare cholesterol degraders and members of the rare biosphere. Metatranscriptomic and metabolite analyses show that these degraders adopt the 2,3-secopathway to sequentially catalyze the side chain and sterane of cholesterol and that two molybdoenzymes—steroid C25 dehydrogenase and 1-testosterone dehydrogenase/hydratase—are crucial for these bioprocesses, respectively. The metatranscriptome further suggests that these betaproteobacterial degraders display chemotaxis and motility toward cholesterol and that FadL-like transporters may be the key components for substrate uptake. Also, these betaproteobacteria are capable of transporting micronutrients and synthesizing cofactors essential for cellular metabolism and cholesterol degradation; however, the required cobalamin is possibly provided by cobalamin-de novo-synthesizing gamma-, delta-, and betaproteobacteria via the salvage pathway. Overall, our results indicate that the ability to degrade cholesterol in sludge communities is reserved for certain rare biosphere members and that C25 dehydrogenase can serve as a biomarker for sterol degradation in anoxic environments.IMPORTANCESteroids are ubiquitous and abundant natural compounds that display recalcitrance. Biodegradation via sludge communities in wastewater treatment plants is the primary removal process for steroids. To date, compared to studies for aerobic steroid degradation, the knowledge of anaerobic degradation of steroids has been based on only a few model organisms. Due to the increase of anthropogenic impacts, steroid inputs may affect microbial diversity and functioning in ecosystems. Here, we first investigated microbial functional responses to cholesterol, the most abundant steroid in sludge, at the community level. Our metagenomic and metatranscriptomic analyses revealed that the capacities for cholesterol approach, uptake, and degradation are unique traits of certain low-abundance betaproteobacteria, indicating the importance of the rare biosphere in bioremediation. Apparent expression of genes involved in cofactorde novosynthesis and salvage pathways suggests that these micronutrients play important roles for cholesterol degradation in sludge communities.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A practical guide to buildde-novoassemblies for single tissues of non-model organisms: the example of a Neotropical frog

PeerJ ◽

10.7717/peerj.3702 ◽

2017 ◽

Vol 5 ◽

pp. e3702 ◽

Cited By ~ 5

Author(s):

Santiago Montero-Mendieta ◽

Manfred Grabherr ◽

Henrik Lantz ◽

Ignacio De la Riva ◽

Jennifer A. Leonard ◽

...

Keyword(s):

Defense Mechanisms ◽

De Novo ◽

Transcriptome Assembly ◽

Cost Effective ◽

Model Organisms ◽

Rna Seq ◽

Assembly Pipeline ◽

Wide Variability ◽

History Of ◽

Inexperienced User

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.

Download Full-text

CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009631 ◽

2021 ◽

Vol 17 (11) ◽

pp. e1009631

Author(s):

Raquel Linheiro ◽

John Archer

Keyword(s):

De Novo ◽

Simulated Data ◽

Real Data ◽

Gene Families ◽

Classification Systems ◽

Whole Body ◽

Cdna Libraries ◽

Sequence Information ◽

Rna Seq ◽

High Quality

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.

Download Full-text

Understanding the Early Evolutionary Stages of a Tandem Drosophilamelanogaster-Specific Gene Family: A Structural and Functional Population Study

Molecular Biology and Evolution ◽

10.1093/molbev/msaa109 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2584-2600 ◽

Cited By ~ 3

Author(s):

Bryan D Clifton ◽

Jamie Jimenez ◽

Ashlyn Kimura ◽

Zeinab Chahine ◽

Pablo Librado ◽

...

Keyword(s):

Gene Family ◽

Sequence Similarity ◽

Gene Families ◽

Read Depth ◽

Specific Gene ◽

Protein Variant ◽

Protein Coding ◽

Expression Levels ◽

Number Variation ◽

Reference Quality

Abstract Gene families underlie genetic innovation and phenotypic diversification. However, our understanding of the early genomic and functional evolution of tandemly arranged gene families remains incomplete as paralog sequence similarity hinders their accurate characterization. The Drosophila melanogaster-specific gene family Sdic is tandemly repeated and impacts sperm competition. We scrutinized Sdic in 20 geographically diverse populations using reference-quality genome assemblies, read-depth methodologies, and qPCR, finding that ∼90% of the individuals harbor 3–7 copies as well as evidence of population differentiation. In strains with reliable gene annotations, copy number variation (CNV) and differential transposable element insertions distinguish one structurally distinct version of the Sdic region per strain. All 31 annotated copies featured protein-coding potential and, based on the protein variant encoded, were categorized into 13 paratypes differing in their 3′ ends, with 3–5 paratypes coexisting in any strain examined. Despite widespread gene conversion, the only copy present in all strains has functionally diverged at both coding and regulatory levels under positive selection. Contrary to artificial tandem duplications of the Sdic region that resulted in increased male expression, CNV in cosmopolitan strains did not correlate with expression levels, likely as a result of differential genome modifier composition. Duplicating the region did not enhance sperm competitiveness, suggesting a fitness cost at high expression levels or a plateau effect. Beyond facilitating a minimally optimal expression level, Sdic CNV acts as a catalyst of protein and regulatory diversity, showcasing a possible evolutionary path recently formed tandem multigene families can follow toward long-term consolidation in eukaryotic genomes.

Download Full-text

De novo transcriptomes of 14 gammarid individuals for proteogenomic analysis of seven taxonomic groups

Scientific Data ◽

10.1038/s41597-019-0192-5 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 5

Author(s):

Yannick Cogne ◽

Davide Degli-Esposti ◽

Olivier Pible ◽

Duarte Gouveia ◽

Adeline François ◽

...

Keyword(s):

Aquatic Ecosystems ◽

Molecular Diversity ◽

De Novo ◽

Model Organisms ◽

Rna Seq ◽

Specific Expression ◽

Gammarus Fossarum ◽

Male Individual ◽

Taxonomic Groups ◽

Whole Transcriptome

Abstract Gammarids are amphipods found worldwide distributed in fresh and marine waters. They play an important role in aquatic ecosystems and are well established sentinel species in ecotoxicology. In this study, we sequenced the transcriptomes of a male individual and a female individual for seven different taxonomic groups belonging to the two genera Gammarus and Echinogammarus: Gammarus fossarum A, G. fossarum B, G. fossarum C, Gammarus wautieri, Gammarus pulex, Echinogammarus berilloni, and Echinogammarus marinus. These taxa were chosen to explore the molecular diversity of transcribed genes of genotyped individuals from these groups. Transcriptomes were de novo assembled and annotated. High-quality assembly was confirmed by BUSCO comparison against the Arthropod dataset. The 14 RNA-Seq-derived protein sequence databases proposed here will be a significant resource for proteogenomics studies of these ecotoxicologically relevant non-model organisms. These transcriptomes represent reliable reference sequences for whole-transcriptome and proteome studies on other gammarids, for primer design to clone specific genes or monitor their specific expression, and for analyses of molecular differences between gammarid species.

Download Full-text

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

10.1101/420208 ◽

2018 ◽

Cited By ~ 13

Author(s):

Elena Bushmanova ◽

Dmitry Antipov ◽

Alla Lapidus ◽

Andrey D. Prjibelski

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Challenging Problem ◽

Rna Seq ◽

De Novo Transcriptome ◽

Weak Points ◽

Transcriptome Reconstruction ◽

Evaluation Approaches ◽

Genome Assembler

AbstractSummaryPossibility to generate large RNA-seq datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the model organisms with finished and annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing and paralogous genes. In this paper we describe a novel transcriptome assembler called rnaSPAdes, which is developed on top of SPAdes genome assembler and explores surprising computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-Seq datasets, and briefly highlight strong and weak points of different assemblers.Availability and implementationrnaSPAdes is implemented in C++ and Python and is freely available at cab.spbu.ru/software/rnaspades/.

Download Full-text

The evaluation of RNA-Seq de novo assembly by PacBio long read sequencing

10.1101/735621 ◽

2019 ◽

Author(s):

Yifan Yang ◽

Michael Gribskov

Keyword(s):

Real Time ◽

De Novo ◽

Critical Issue ◽

Evaluation Methods ◽

Model Organisms ◽

Rna Seq ◽

Long Reads ◽

Long Read ◽

Set Up ◽

Downstream Analysis

AbstractRNA-Seq de novo assembly is an important method to generate transcriptomes for non-model organisms before any downstream analysis. Given many great de novo assembly methods developed by now, one critical issue is that there is no consensus on the evaluation of de novo assembly methods yet. Therefore, to set up a benchmark for evaluating the quality of de novo assemblies is very critical. Addressing this challenge will help us deepen the insights on the properties of different de novo assemblers and their evaluation methods, and provide hints on choosing the best assembly sets as transcriptomes of non-model organisms for the further functional analysis. In this article, we generate a “real time” transcriptome using PacBio long reads as a benchmark for evaluating five de novo assemblers and two model-based de novo assembly evaluation methods. By comparing the de novo assmblies generated by RNA-Seq short reads with the “real time” transcriptome from the same biological sample, we find that Trinity is best at the completeness by generating more assemblies than the alternative assemblers, but less continuous and having more misassemblies; Oases is best at the continuity and specificity, but less complete; The performance of SOAPdenovo-Trans, Trans-AByss and IDBA-Tran are in between of five assemblers. For evaluation methods, DETONATE leverages multiple aspects of the assembly set and ranks the assembly set with an average performance as the best, meanwhile the contig score can serve as a good metric to select assemblies with high completeness, specificity, continuity but not sensitive to misassemblies; TransRate contig score is useful for removing misassemblies, yet often the assemblies in the optimal set is too few to be used as a transcriptome.

Download Full-text

Ion channel profiling of the Lymnaea stagnalis ganglia via transcriptome analysis

10.21203/rs.3.rs-31358/v1 ◽

2020 ◽

Author(s):

Nan Dong ◽

Julia Bandura ◽

Zhaolei Zhang ◽

Yan Wang ◽

Karine Labadie ◽

...

Keyword(s):

Ion Channels ◽

Reference Genome ◽

Lymnaea Stagnalis ◽

De Novo ◽

Model Organisms ◽

Sequence Length ◽

Pond Snail ◽

Sequence Information ◽

Functional Domain ◽

Rna Seq

Abstract Background. The pond snail Lymnaea stagnalis (L. stagnalis) has been widely used as a model organism in neurobiology, ecotoxicology, and parasitology due to the relative simplicity of its CNS. However, its usefulness is restricted by a limited availability of transcriptome data. While sequence information for the L. stagnalis CNS transcripts has been obtained from EST library and a de novo RNA-seq assembly, the quality of these assemblies is limited by a combination of low coverage of EST libraries, the fragmented nature of de novo assemblies, and lack of reference genome. Results. In this study, taking advantage of the recent availability of the L. stagnalis reference genome, we generated an RNA-seq library from the adult L. stagnalis CNS, using a combination of genome-guided and de novo assembly programs to identify 17,832 protein-coding L. stagnalis transcripts. We combined our library with existing resources to produce a transcript set with greater sequence length, completeness, and diversity than previously available ones. Using our assembly and functional domain analysis, we profiled L. stagnalis CNS transcripts encoding ion channels and ionotropic receptors, which are key proteins for CNS function, and compared their sequences to other vertebrate and invertebrate model organisms. Interestingly, L. stagnalis transcripts encoding numerous putative Ca2+ channels showed the most sequence similarity to those of mouse, zebrafish, Xenopus tropicalis, fruit fly, and C. elegans, suggesting that many calcium channel-related signaling pathways may be evolutionarily conserved. Conclusions. Our study provides the most thorough characterization to date of the L. stagnalis transcriptome and provides insights into differences between vertebrates and invertebrates in CNS transcript diversity, according to function and protein class. Furthermore, this study is, to the best of our knowledge, the first to provide a complete characterization of the ion channels of a single species, opening new avenues for future research on fundamental neurobiological processes.

Download Full-text