scholarly journals Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

2018 ◽  
Author(s):  
Lisa K. Johnson ◽  
Harriet Alexander ◽  
C. Titus Brown

AbstractBackgroundDe novo transcriptome assemblies are required prior to analyzing RNAseq data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or “pipelines”, on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short read data collected by the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research (NCGR).ResultsNew transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics, with assemblies from the Dinoflagellata and Ciliophora phyla showing a higher percentage of open reading frames and number of contigs than transcriptomes from other phyla.ConclusionsGiven current bioinformatics approaches, there is no single ‘best’ reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally-intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.Key PointsRe-assembly with new tools can yield new resultsAutomated and programmable pipelines can be used to process arbitrarily many samples.Analyzing many samples using a common pipeline identifies taxon-specific trends.


2017 ◽  
Author(s):  
Jonathan Schmitz ◽  
Kristian Ullrich ◽  
Erich Bornberg-Bauer

AbstractA recent surge of studies suggested that many novel genes arise de novo from previously non-coding DNA and not by duplication. However, since most studies concentrated on longer evolutionary time scales and rarely considered protein structural properties, it remains unclear how these properties are shaped by evolution, depend on genetic mechanisms and influence gene survival. Here we compare open reading frames (ORFs) from high coverage transcriptomes from mouse and another four mammals covering 160 million years of evolution. We find that novel ORFs pervasively emerge from intergenic and intronic regions but are rapidly lost again while relatively fewer arise from duplications but are retained over much longer times. Surprisingly, disorder and other protein properties of young ORFs do not change with gene age. Only length and nucleotide composition change, probably to avoid aggregation. Thus de novo genes resemble frozen accidents of randomly emerged ORFs which survived initial purging, likely because they are functional.



2019 ◽  
Vol 109 (2) ◽  
pp. 222-224 ◽  
Author(s):  
Margarita Gomila ◽  
Eduardo Moralejo ◽  
Antonio Busquets ◽  
Guillem Segui ◽  
Diego Olmo ◽  
...  

Xylella fastidiosa is a plant-pathogenic bacterium that causes serious diseases in many crops of economic importance and is a quarantine organism in the European Union. This study reports a de novo-assembled draft genome sequence of the first isolates causing Pierce’s disease in Europe: X. fastidiosa subsp. fastidiosa strains XYL1732/17 and XYL2055/17. Both strains were isolated from grapevines (Vitis vinifera) showing Pierce’s disease symptoms at two different locations in Mallorca, Spain. The XYL1732/17 genome is 2,444,109 bp long, with a G+C content of 51.5%; it contains 2,359 open reading frames and 48 tRNA genes. The XYL2055/17 genome is 2,456,780 bp long, with a G+C content of 51.5%; it contains 2,384 open reading frames and 48 tRNA genes.



2020 ◽  
Vol 12 (11) ◽  
pp. 2183-2195
Author(s):  
Daniel Dowling ◽  
Jonathan F Schmitz ◽  
Erich Bornberg-Bauer

Abstract In addition to known genes, much of the human genome is transcribed into RNA. Chance formation of novel open reading frames (ORFs) can lead to the translation of myriad new proteins. Some of these ORFs may yield advantageous adaptive de novo proteins. However, widespread translation of noncoding DNA can also produce hazardous protein molecules, which can misfold and/or form toxic aggregates. The dynamics of how de novo proteins emerge from potentially toxic raw materials and what influences their long-term survival are unknown. Here, using transcriptomic data from human and five other primates, we generate a set of transcribed human ORFs at six conservation levels to investigate which properties influence the early emergence and long-term retention of these expressed ORFs. As these taxa diverged from each other relatively recently, we present a fine scale view of the evolution of novel sequences over recent evolutionary time. We find that novel human-restricted ORFs are preferentially located on GC-rich gene-dense chromosomes, suggesting their retention is linked to pre-existing genes. Sequence properties such as intrinsic structural disorder and aggregation propensity—which have been proposed to play a role in survival of de novo genes—remain unchanged over time. Even very young sequences code for proteins with low aggregation propensities, suggesting that genomic regions with many novel transcribed ORFs are concomitantly less likely to produce ORFs which code for harmful toxic proteins. Our data indicate that the survival of these novel ORFs is largely stochastic rather than shaped by selection.



2006 ◽  
Vol 188 (17) ◽  
pp. 6261-6268 ◽  
Author(s):  
Jonathon P. Audia ◽  
Herbert H. Winkler

ABSTRACT The obligate intracytoplasmic pathogen Rickettsia prowazekii relies on the transport of many essential compounds from the cytoplasm of the eukaryotic host cell in lieu of de novo synthesis, an evolutionary outcome undoubtedly linked to obligatory growth in this metabolite-replete niche. The paradigm for the study of rickettsial transport systems is the ATP/ADP translocase Tlc1, which exchanges bacterial ADP for host cell ATP as a source of energy, rather than as a source of adenylate. Interestingly, the R. prowazekii genome encodes four open reading frames that are highly homologous to the well-characterized ATP/ADP translocase Tlc1. Therefore, by annotation, the R. prowazekii genome encodes a total of five ATP/ADP translocases: Tlc1, Tlc2, Tlc3, Tlc4, and Tlc5. We have confirmed by quantitative reverse transcriptase PCR that mRNAs corresponding to all five tlc homologues are expressed in R. prowazekii growing in L-929 cells and have shown their heterologous protein expression in Escherichia coli, suggesting that none of the tlc genes are pseudogenes in the process of evolutionary meltdown. However, we demonstrate by heterologous expression in E. coli that only Tlc1 functions as an ATP/ADP transporter. A survey of nucleotides and nucleosides has determined that Tlc4 transports CTP, UTP, and GDP. Intriguingly, although GTP was not transported by Tlc4, it was an inhibitor of CTP and UTP uptake and demonstrated a Ki similar to that of GDP. In addition, we demonstrate that Tlc5 transports GTP and GDP. We postulate that Tlc4 and Tlc5 serve the primary function of maintaining intracellular pools of nucleotides for rickettsial nucleic acid biosynthesis and do not provide the cell with nucleoside triphosphates as an energy source, as is the case for Tlc1. Although heterologous expression of Tlc2 and Tlc3 was observed in E. coli, we were unable to identify substrates for these proteins.



2015 ◽  
Author(s):  
Lorenzo Calviello ◽  
Neelanjan Mukherjee ◽  
Emanuel Wyler ◽  
Henrik Zauber ◽  
Antje Hirsekorn ◽  
...  

RNA sequencing protocols allow for quantifying gene expression regulation at each individual step, from transcription to protein synthesis. Ribosome Profiling (Ribo-seq) maps the positions of translating ribosomes over the entire transcriptome. Despite its great potential, a rigorous statistical approach to identify translated regions by means of the characteristic three-nucleotide periodicity of Ribo-seq data is not yet available. To fill this gap, we developed RiboTaper, which quantifies the significance of periodic Ribo-seq reads via spectral analysis methods. We applied RiboTaper on newly generated, deep Ribo-seq data in HEK293 cells, to derive an extensive map of translation that covers Open Reading Frame (ORF) annotations for more than 11,000 protein- coding genes. We also find distinct ribosomal signatures for several hundred detected upstream ORFs and ORFs in annotated non-coding genes (ncORFs). Mass spectrometry data confirms that RiboTaper achieves excellent coverage of the cellular proteome and validates dozens of novel peptide products. Collectively, RiboTaper (available at https://ohlerlab.mdc-berlin.de/software/ ) is a powerful method for comprehensive de novo identification of actively used ORFs in the human genome.



Plants ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 469
Author(s):  
Denis O. Omelchenko ◽  
Maxim S. Makarenko ◽  
Artem S. Kasianov ◽  
Mikhail I. Schelkunov ◽  
Maria D. Logacheva ◽  
...  

Shepherd’s purse (Capsella bursa-pastoris) is a cosmopolitan annual weed and a promising model plant for studying allopolyploidization in the evolution of angiosperms. Though plant mitochondrial genomes are a valuable source of genetic information, they are hard to assemble. At present, only the complete mitogenome of C. rubella is available out of all species of the genus Capsella. In this work, we have assembled the complete mitogenome of C. bursa-pastoris using high-precision PacBio SMRT third-generation sequencing technology. It is 287,799 bp long and contains 32 protein-coding genes, 3 rRNAs, 25 tRNAs corresponding to 15 amino acids, and 8 open reading frames (ORFs) supported by RNAseq data. Though many repeat regions have been found, none of them is longer than 1 kbp, and the most frequent structural variant originated from these repeats is present in only 4% of the mitogenome copies. The mitochondrial DNA sequence of C. bursa-pastoris differs from C. rubella, but not from C. orientalis, by two long inversions, suggesting that C. orientalis could be its maternal progenitor species. In total, 377 C to U RNA editing sites have been detected. All genes except cox1 and atp8 contain RNA editing sites, and most of them lead to non-synonymous changes of amino acids. Most of the identified RNA editing sites are identical to corresponding RNA editing sites in A. thaliana.



2021 ◽  
Vol 22 (11) ◽  
pp. 5476
Author(s):  
Bing Wang ◽  
Zhiwei Wang ◽  
Ni Pan ◽  
Jiangmei Huang ◽  
Cuihong Wan

Small open reading frames (sORFs) have translational potential to produce peptides that play essential roles in various biological processes. Nevertheless, many sORF-encoded peptides (SEPs) are still on the prediction level. Here, we construct a strategy to analyze SEPs by combining top-down and de novo sequencing to improve SEP identification and sequence coverage. With de novo sequencing, we identified 1682 peptides mapping to 2544 human sORFs, which were all first characterized in this work. Two-thirds of these new sORFs have reading frame shifts and use a non-ATG start codon. The top-down approach identified 241 human SEPs, with high sequence coverage. The average length of the peptides from the bottom-up database search was 19 amino acids (AA); from de novo sequencing, it was 9 AA; and from the top-down approach, it was 25 AA. The longer peptide positively boosts the sequence coverage, more efficiently distinguishing SEPs from the known gene coding sequence. Top-down has the advantage of identifying peptides with sequential K/R or high K/R content, which is unfavorable in the bottom-up approach. Our method can explore new coding sORFs and obtain highly accurate sequences of their SEPs, which can also benefit future function research.



2014 ◽  
Author(s):  
John Stewart Taylor

In 2009 Knowles and McLysaght reported the discovery of three human genes derived from non-coding DNA. They provided evidence that these genes, CLUU1, C22orf45, and DNAH10OS, were transcribed and translated, they identified orthologous non-coding DNA in chimpanzee (Pan troglodytes) and macaque (Macaca mulatta), and for each gene they located the critical ?enabler? mutations that extended the open reading frames (ORFs) allowing the production of a protein. These genes had no BLASTp hits in any other genome and were considered to be novel human genes, possibly responsible for human-specific traits. Since the discovery of these genes, new high quality Denisovan and Neanderthal genomes have been reported. I used these resources in an effort to determine whether or not CLUU1, C22orf45, and DNAH10OS were truly human-specific.



2019 ◽  
Vol 37 (4) ◽  
pp. 1165-1178 ◽  
Author(s):  
Paco Majic ◽  
Joshua L Payne

Abstract Regulatory networks control the spatiotemporal gene expression patterns that give rise to and define the individual cell types of multicellular organisms. In eumetazoa, distal regulatory elements called enhancers play a key role in determining the structure of such networks, particularly the wiring diagram of “who regulates whom.” Mutations that affect enhancer activity can therefore rewire regulatory networks, potentially causing adaptive changes in gene expression. Here, we use whole-tissue and single-cell transcriptomic and chromatin accessibility data from mouse to show that enhancers play an additional role in the evolution of regulatory networks: They facilitate network growth by creating transcriptionally active regions of open chromatin that are conducive to de novo gene evolution. Specifically, our comparative transcriptomic analysis with three other mammalian species shows that young, mouse-specific intergenic open reading frames are preferentially located near enhancers, whereas older open reading frames are not. Mouse-specific intergenic open reading frames that are proximal to enhancers are more highly and stably transcribed than those that are not proximal to enhancers or promoters, and they are transcribed in a limited diversity of cellular contexts. Furthermore, we report several instances of mouse-specific intergenic open reading frames proximal to promoters showing evidence of being repurposed enhancers. We also show that open reading frames gradually acquire interactions with enhancers over macroevolutionary timescales, helping integrate genes—those that have arisen de novo or by other means—into existing regulatory networks. Taken together, our results highlight a dual role of enhancers in expanding and rewiring gene regulatory networks.



2017 ◽  
Vol 5 (6) ◽  
Author(s):  
Christopher Van Horn ◽  
Chung-Jan Chang ◽  
Jianchi Chen

ABSTRACT This study reports a de novo-assembled draft genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 causing blueberry bacterial leaf scorch in Georgia, USA. The BB01 genome is 2,517,579 bp, with a G+C content of 51.8%, 2,943 open reading frames (ORFs), and 48 RNA genes.



Sign in / Sign up

Export Citation Format

Share Document