scholarly journals TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: (i) full-length RNA-seq for detection of splicing patterns and (ii) high-throughput 5′ and 3′ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts. We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings and Saccharomyces cerevisiae cells as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the most commonly used community gene models, TAIR10 and Araport11 for A.thaliana and SacCer3 for S.cerevisiae. In particular, we identify multiple transient transcripts missing from the existing annotations. Our new annotations promise to improve the quality of A.thaliana and S.cerevisiae genome research. Conclusions Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2014 ◽  
Vol 2014 ◽  
pp. 1-8
Author(s):  
Momchilo Vuyisich ◽  
Ayesha Arefin ◽  
Karen Davenport ◽  
Shihai Feng ◽  
Cheryl Gleasner ◽  
...  

Sequencing bacterial genomes has traditionally required large amounts of genomic DNA (~1 μg). There have been few studies to determine the effects of the input DNA amount or library preparation method on the quality of sequencing data. Several new commercially available library preparation methods enable shotgun sequencing from as little as 1 ng of input DNA. In this study, we evaluated the NEBNext Ultra library preparation reagents for sequencing bacterial genomes. We have evaluated the utility of NEBNext Ultra for resequencing andde novoassembly of four bacterial genomes and compared its performance with the TruSeq library preparation kit. The NEBNext Ultra reagents enable high quality resequencing andde novoassembly of a variety of bacterial genomes when using 100 ng of input genomic DNA. For the two most challenging genomes (Burkholderiaspp.), which have the highest GC content and are the longest, we also show that the quality of both resequencing andde novoassembly is not decreased when only 10 ng of input genomic DNA is used.


2016 ◽  
Author(s):  
Ying Wang ◽  
Kun Liu ◽  
De Bi ◽  
Biao Shou Zhou ◽  
Wen Jian Shao

Background. Resurrection plants constitute a unique cadre within angiosperms. Boea clarkeana Hemsl. (Boea, Gesneriaceae) is a desiccation-tolerant dicotyledonous herb that is endemic to China. Although research on angiosperms with DT could be instructive for crops, genomic resources for B. clarkeana remain scarce. In addition, transcriptome sequencing could be an effective way to study desiccation-tolerant plants. Methods. In the present study, we used the platform Illumina HiSeqTM 2000 and de novo assembly technology to obtain leaf transcriptomes of B. clarkeana and conducted a BLASTX alignment of the sequencing data and protein databases for sequence classification and annotation. Then, based on the sequence information obtained, we developed EST-SSR markers by means of EST-SSR mining, primer design and polymorphism identification. Results. A total of 91,449 unigenes were generated from the leaf cDNA library of B. clarkeana in this study. Based on a sequence similarity search with a known protein database, 72,087 unigenes were annotated. Among the annotated unigenes, a total of 71,170 unigenes showed significant similarity to known proteins of 463 popular model species in the Nr database, and 59,962 unigenes and 32,336 unigenes were assigned to GO classifications and COG, respectively. In addition, 44,924 unigenes were mapped in 128 KEGG pathways. Furthermore, a total of 7,610 unigenes with 8,563 microsatellites were found. Seventy-four primer pairs were selected from 436 primer pairs designed for polymorphism validation. SSRs with higher polymorphism rates were concentrated on dinucleotides, pentanucleotides and hexanucleotides. Finally, 17 pairs with highly polymorphic and stable loci were selected for polymorphism screening. There were a total of 65 alleles, with 2–6 alleles at each locus. Mainly due to the unique biological characteristics of plants, the HE, HO and PIC per locus were very low, ranging from 0 to 0.196, 0.082 to 0.14 and 0 to 0.155, respectively. Discussion. A substantial fraction transcriptome sequences of B. clarkeana were generated in this study, which is the first molecular-level analysis of this plant. These sequences are valuable resources for gene annotation and discovery and molecular marker development. These sequences could also provide a valuable basis for the future molecular study of B. clarkeana.


2016 ◽  
Author(s):  
Ying Wang ◽  
Kun Liu ◽  
De Bi ◽  
Biao Shou Zhou ◽  
Wen Jian Shao

Background. Resurrection plants constitute a unique cadre within angiosperms. Boea clarkeana Hemsl. (Boea, Gesneriaceae) is a desiccation-tolerant dicotyledonous herb that is endemic to China. Although research on angiosperms with DT could be instructive for crops, genomic resources for B. clarkeana remain scarce. In addition, transcriptome sequencing could be an effective way to study desiccation-tolerant plants. Methods. In the present study, we used the platform Illumina HiSeqTM 2000 and de novo assembly technology to obtain leaf transcriptomes of B. clarkeana and conducted a BLASTX alignment of the sequencing data and protein databases for sequence classification and annotation. Then, based on the sequence information obtained, we developed EST-SSR markers by means of EST-SSR mining, primer design and polymorphism identification. Results. A total of 91,449 unigenes were generated from the leaf cDNA library of B. clarkeana in this study. Based on a sequence similarity search with a known protein database, 72,087 unigenes were annotated. Among the annotated unigenes, a total of 71,170 unigenes showed significant similarity to known proteins of 463 popular model species in the Nr database, and 59,962 unigenes and 32,336 unigenes were assigned to GO classifications and COG, respectively. In addition, 44,924 unigenes were mapped in 128 KEGG pathways. Furthermore, a total of 7,610 unigenes with 8,563 microsatellites were found. Seventy-four primer pairs were selected from 436 primer pairs designed for polymorphism validation. SSRs with higher polymorphism rates were concentrated on dinucleotides, pentanucleotides and hexanucleotides. Finally, 17 pairs with highly polymorphic and stable loci were selected for polymorphism screening. There were a total of 65 alleles, with 2–6 alleles at each locus. Mainly due to the unique biological characteristics of plants, the HE, HO and PIC per locus were very low, ranging from 0 to 0.196, 0.082 to 0.14 and 0 to 0.155, respectively. Discussion. A substantial fraction transcriptome sequences of B. clarkeana were generated in this study, which is the first molecular-level analysis of this plant. These sequences are valuable resources for gene annotation and discovery and molecular marker development. These sequences could also provide a valuable basis for the future molecular study of B. clarkeana.


2022 ◽  
Vol 12 ◽  
Author(s):  
Bingdong Liu ◽  
Liujing Huang ◽  
Zhihong Liu ◽  
Xiaohan Pan ◽  
Zongbing Cui ◽  
...  

Advances in next-generation sequencing (NGS) have revolutionized microbial studies in many fields, especially in clinical investigation. As the second human genome, microbiota has been recognized as a new approach and perspective to understand the biological and pathologic basis of various diseases. However, massive amounts of sequencing data remain a huge challenge to researchers, especially those who are unfamiliar with microbial data analysis. The mathematic algorithm and approaches introduced from another scientific field will bring a bewildering array of computational tools and acquire higher quality of script experience. Moreover, a large cohort research together with extensive meta-data including age, body mass index (BMI), gender, medical results, and others related to subjects also aggravate this situation. Thus, it is necessary to develop an efficient and convenient software for clinical microbiome data analysis. EasyMicroPlot (EMP) package aims to provide an easy-to-use microbial analysis tool based on R platform that accomplishes the core tasks of metagenomic downstream analysis, specially designed by incorporation of popular microbial analysis and visualization used in clinical microbial studies. To illustrate how EMP works, 694 bio-samples from Guangdong Gut Microbiome Project (GGMP) were selected and analyzed with EMP package. Our analysis demonstrated the influence of dietary style on gut microbiota and proved EMP package's powerful ability and excellent convenience to address problems for this field.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3422 ◽  
Author(s):  
Ying Wang ◽  
Kun Liu ◽  
De Bi ◽  
Shoubiao Zhou ◽  
Jianwen Shao

Background Desiccation-tolerant (DT) plants can recover full metabolic competence upon rehydration after losing most of their cellular water (>95%) for extended periods of time. Functional genomic approaches such as transcriptome sequencing can help us understand how DT plants survive and respond to dehydration, which has great significance for plant biology and improving the drought tolerance of crops. Boea clarkeana Hemsl. (Gesneriaceae) is a DT dicotyledonous herb. Its genomic sequences characteristics remain unknown. Based on transcriptomic analyses, polymorphic EST-SSR (simple sequence repeats in expressed sequence tags) molecular primers can be designed, which will greatly facilitate further investigations of the population genetics and demographic histories of DT plants. Methods In the present study, we used the platform Illumina HiSeq™2000 and de novo assembly technology to obtain leaf transcriptomes of B. clarkeana and conducted a BLASTX alignment of the sequencing data and protein databases for sequence classification and annotation. Then, based on the sequence information, the EST-SSR markers were developed, and the functional annotation of ESTs containing polymorphic SSRs were obtained through BLASTX. Results A total of 91,449 unigenes were generated from the leaf cDNA library of B. clarkeana. Based on a sequence similarity search with a known protein database, 72,087 unigenes were annotated. Among the annotated unigenes, a total of 71,170 unigenes showed significant similarity to the known proteins of 463 popular model species in the Nr database, and 59,962 unigenes and 32,336 unigenes were assigned to Gene Ontology (GO) classifications and Cluster of Orthologous Groups (COG), respectively. In addition, 44,924 unigenes were mapped in 128 KEGG pathways. Furthermore, a total of 7,610 unigenes with 8,563 microsatellites were found. Seventy-four primer pairs were selected from 436 primer pairs designed for polymorphism validation. SSRs with higher polymorphism rates were concentrated on dinucleotides, pentanucleotides and hexanucleotides. Finally, 17 pairs with stable, highly polymorphic loci were selected for polymorphism screening. There was a total of 65 alleles, with 2–6 alleles at each locus. Primarily due to the unique biological characteristics of plants, the HE (0–0.196), HO (0.082–0.14) and PIC (0–0.155) per locus were very low. The functional annotation distribution centered on ESTs containing di- and tri-nucleotide SSRs, and the ESTs containing primers BC2, BC4 and BC12 were annotated to vegetative dehydration/desiccation pathways. Discussion This work is the first genetic study of B. clarkeana as a new plant resource of DT genes. A substantial number of transcriptome sequences were generated in this study. These sequences are valuable resources for gene annotation and discovery as well as molecular marker development. These sequences could also provide a valuable basis for future molecular studies of B. clarkeana.


2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Zeeshan Ahmed ◽  
Eduard Gibert Renart ◽  
Saman Zeeshan ◽  
XinQi Dong

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.


2021 ◽  
Vol 17 (11) ◽  
pp. e1009631
Author(s):  
Raquel Linheiro ◽  
John Archer

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.


2021 ◽  
Author(s):  
Daniel Osorio ◽  
Marieke Lydia Kuijjer ◽  
James J. Cai

Motivation: Characterizing cells with rare molecular phenotypes is one of the promises of high throughput single-cell RNA sequencing (scRNA-seq) techniques. However, collecting enough cells with the desired molecular phenotype in a single experiment is challenging, requiring several samples preprocessing steps to filter and collect the desired cells experimentally before sequencing. Data integration of multiple public single-cell experiments stands as a solution for this problem, allowing the collection of enough cells exhibiting the desired molecular signatures. By increasing the sample size of the desired cell type, this approach enables a robust cell type transcriptome characterization. Results: Here, we introduce rPanglaoDB, an R package to download and merge the uniformly processed and annotated scRNA-seq data provided by the PanglaoDB database. To show the potential of rPanglaoDB for collecting rare cell types by integrating multiple public datasets, we present a biological application collecting and characterizing a set of 157 fibrocytes. Fibrocytes are a rare monocyte-derived cell type, that exhibits both the inflammatory features of macrophages and the tissue remodeling properties of fibroblasts. This constitutes the first fibrocytes' unbiased transcriptome profile report. We compared the transcriptomic profile of the fibrocytes against the fibroblasts collected from the same tissue samples and confirm their associated relationship with healing processes in tissue damage and infection through the activation of the prostaglandin biosynthesis and regulation pathway. Availability and Implementation: rPanglaoDB is implemented as an R package available through the CRAN repositories https://CRAN.R-project.org/package=rPanglaoDB.


Sign in / Sign up

Export Citation Format

Share Document