Integrated modeling of protein-coding genes in theManduca sextagenome using RNA-seq data from the biochemical model insect

Author(s):  
Xiaolong Cao
2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Mikhail Pomaznoy ◽  
Ashu Sethi ◽  
Jason Greenbaum ◽  
Bjoern Peters

Abstract RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lars Gabriel ◽  
Katharina J. Hoff ◽  
Tomáš Brůna ◽  
Mark Borodovsky ◽  
Mario Stanke

Abstract Background BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 2705-2705 ◽  
Author(s):  
Lara Rizzotto ◽  
Arianna Bottoni ◽  
Tzung-Huei Lai ◽  
Chaomei Liu ◽  
Pearlly S Yan ◽  
...  

Abstract Chronic lymphocytic leukemia (CLL) follows a variable clinical course mostly dependent upon genomic factors, with a subset of patients having low risk disease and others displaying rapid progression associated with clonal evolution. Epigenetic mechanisms such as DNA promoter hypermethylation were shown to have a role in CLL evolution where the acquisition of increasingly heterogeneous DNA methylation patters occurred in conjunction with clonal evolution of genetic aberrations and was associated with disease progression. However the role of epigenetic mechanisms regulated by the histone deacetylase group of transcriptional repressors in the progression of CLL has not been well characterized. The histone deacetylases (HDACs) 1 and 2 are recruited onto gene promoters and form a complex with the histone demethylase KDM1. Once recruited, the complex mediate the removal of acetyl groups from specific lysines on histones (H3K9 and H3K14) thus triggering the demethylation of lysine 4 (H3K4me3) and the silencing of gene expression. CLL is characterized by the dysregulation of numerous coding and non coding genes, many of which have key roles in regulating the survival or progression of CLL. For instance, our group showed that the levels of HDAC1 were elevated in high risk as compared to low risk CLL or normal lymphocytes and this over-expression was responsible for the silencing of miR-106b, mR-15, miR-16, and miR-29b which affected CLL survival by modulating the expression of key anti-apoptotic proteins Bcl-2 and Mcl-1. To characterize the HDAC-repressed gene signature in high risk CLL, we conducted chromatin immunoprecipitation (ChIP) of the nuclear lysates from 3 high risk and 3 low risk CLL patients using antibodies against HDAC1, HDAC2 and KDM1 or non-specific IgG, sequenced and aligned the eluted DNA to a reference genome and determined the binding of HDAC1, HDAC2 and KDM1 at the promoters for all protein coding and microRNA genes. Preliminary results from this ChIP-seq showed a strong recruitment of HDAC1, HDAC2 and KDM1 to the promoters of several microRNA as well as protein coding genes in high risk CLL. To further corroborate these data we performed ChIP-Seq in the same 6 CLL samples to analyze the levels of H3K4me2 and H3K4me3 around gene promoters before and after 6h exposure to the HDACi panobinostat. Our goal was to demonstrate that HDAC inhibition elicited an increase in the levels of acetylation on histones and triggered the accrual of H3K4me2 at the repressed promoter, events likely to facilitate the recruitment of RNA polymerase II to this promoter. Initial analysis confirmed a robust accumulation of H3K4me2 and H3K4me3 marks at the gene promoters of representative genes that recruited HDAC1 and its co-repressors in the previous ChIP-Seq analysis in high risk CLL patients. Finally, 5 aggressive CLL samples were treated with the HDACi abexinostat for 48h and RNA before and after treatment was subjected to RNA-seq for small and large RNA to confirm that the regions of chromatin uncoiled by HDACi treatment were actively transcribed. HDAC inhibition induced the expression of a large number of miRNA genes as well as key protein coding genes, such as miR-29b, miR-210, miR-182, miR-183, miR-95, miR-940, FOXO3, EBF1 and BCL2L11. Of note, some of the predicted or validated targets of the induced miRNAs were key facilitators in the progression of CLL, such as BTK, SYK, MCL-1, BCL-2, TCL1, and ROR1. Moreover, RNA-seq showed that the expression of these protein coding genes was reduced by 2-33 folds upon HDAC inhibition. We plan to extend the RNA-seq to 5 CLL samples with indolent disease and combine all the data to identify a common signature of protein coding and miRNA genes that recruited the HDAC1 complex, accumulated activating histone modifications upon treatment with HDACi and altered gene and miRNA expression after HDAC inhibition in high risk CLL versus low risk CLL. The signature will be than validated on a large cohort of indolent and aggressive CLL patients. Our final goal is to define a signature of coding and non coding genes silenced by HDACs in high risk CLL and its role in facilitating disease progression. Disclosures Woyach: Acerta: Research Funding; Karyopharm: Research Funding; Morphosys: Research Funding.


Blood ◽  
2012 ◽  
Vol 120 (21) ◽  
pp. 3298-3298 ◽  
Author(s):  
Eric R. Londin ◽  
Eleftheria Hatzimichael ◽  
Phillipe Loher ◽  
Yue Zhao ◽  
Yi Jing ◽  
...  

Abstract Abstract 3298 The anucleate platelets play a critical role in the formation of thrombi and prevention of bleeding. While the repertoire of platelet transcripts is a reflection of the megakaryocyte at the time of platelet differentiation, post-transcriptional events are known to occur. Furthermore, a strong correlation between the expressed mRNAs and proteome has been identified. Having a complete understanding of the platelet transcriptome is important for generating insights into the genetic basis of platelet disease traits. To capture the complexity of the platelet transcriptome, we performed RNA sequencing (RNA-seq) in leukocyte-depleted platelets from 10 males, with median age of 24.5 yrs and unremarkable medical history. Their short and long RNA platelet transcriptomes were analyzed on the SOLiD 5500xl sequencing platform. We generated ∼3.5 billion sequence reads ∼40% of which could be mapped uniquely to the human genome. Our analysis revealed that ∼9,000 distinct protein-coding mRNAs and ∼800 microRNAs (miRNAs) were present in the transcriptome of each of the 10 sequenced individuals. Comparison of the levels of mRNA expression across the 10 individuals showed an exceptional level of consistency with pair-wise Pearson correlation values ≥0.98. The miRNA expression profiles across the 10 individuals showed a similar consistency with pair-wise Pearson correlation values ≥0.98. Surprisingly, we found that these mRNAs and miRNAs accounted for a little over 1/2 of all of the uniquely mapped sequence reads suggesting the abundant presence of additional non-protein coding RNA (ncRNA) transcripts. Using the annotated entries of the latest release of the ENSEMBL database, we investigated the genetic make-up of these other transcripts. We found that ∼25% of each individual's uniquely mapped reads corresponded to non-protein coding transcripts from mRNA-coding loci. These reads accounted for more than 10,000 distinct such transcripts. In addition, each of the individuals in our cohort expressed an average of ∼1,500 pseudogenes and ∼200 long intergenic non-coding RNAs (lincRNAs). The short RNA profiles of the ten individuals revealed an abundance of diverse categories of ncRNAs including the signal recognition particle RNA (srpRNA), small nuclear RNA (snRNA) and small cytoplasmic RNAs (scRNA). These ncRNAs are involved in the processing of pre-mRNAs and their presence and prevalence in the anucleate platetet suggests the existence of a complex network of mRNA processing that persists after the megakaryocyte fragmentation. We also investigated the RNA-omes of the ten individuals for evidence of transcription of the pyknon category of ncRNAs. Pyknons are of particular interest because each has numerous intergenic and intronic copies whereas nearly all known human protein-coding genes contain one or more pyknons in their mRNA. Recent experimental work has shown that intergenic instances of the pyknons are transcribed in a tissue- and cell-state specific manner. An average of ∼100,000 pyknons are transcribed in each of the 10 sequenced individuals suggesting the possibility of a far-reaching network of interactions that link exonic space to distant non-exonic regions and are active in platelets. Lastly, we found that a large variety of distinct repeat element categories are expressed in the RNA-omes (both short and long) of these individuals. Among the most abundantly represented categories of repeat elements were DNA transposons, long terminal repeat (LTR) retrotransposons, and non-LTR retrotransposons such as long interspersed elements (LINEs) and short interspersed elements (SINEs). In summary, our RNA-seq analyses have revealed a spectrum of platelet transcripts that transcends protein-coding genes and miRNAs. Indeed, the transcripts that have their source in genomic features not previously discussed or analyzed in the platelet context represent a very significant portion of all platelet transcripts. This in turn suggests an unanticipated richness, and presumably commensurate complexity, for the platelet transcriptome. While the role of these novel non-protein coding RNAs is currently unknown it is expected that at least some of them may be of functional significance which will in turn permit a better understanding of the molecular mechanisms that regulate platelet physiology and may contribute to processes beyond thrombosis and hemostasis. Disclosures: No relevant conflicts of interest to declare.


mBio ◽  
2015 ◽  
Vol 6 (6) ◽  
Author(s):  
Vojtěch David ◽  
Pavel Flegontov ◽  
Evgeny Gerasimov ◽  
Goro Tanifuji ◽  
Hassan Hashimi ◽  
...  

ABSTRACT Perkinsela is an enigmatic early-branching kinetoplastid protist that lives as an obligate endosymbiont inside Paramoeba (Amoebozoa). We have sequenced the highly reduced mitochondrial genome of Perkinsela, which possesses only six protein-coding genes (cox1, cox2, cox3, cob, atp6, and rps12), despite the fact that the organelle itself contains more DNA than is present in either the host or endosymbiont nuclear genomes. An in silico analysis of two Perkinsela strains showed that mitochondrial RNA editing and processing machineries typical of kinetoplastid flagellates are generally conserved, and all mitochondrial transcripts undergo U-insertion/deletion editing. Canonical kinetoplastid mitochondrial ribosomes are also present. We have developed software tools for accurate and exhaustive mapping of transcriptome sequencing (RNA-seq) reads with extensive U-insertions/deletions, which allows detailed investigation of RNA editing via deep sequencing. With these methods, we show that up to 50% of reads for a given edited region contain errors of the editing system or, less likely, correspond to alternatively edited transcripts. IMPORTANCE Uridine insertion/deletion-type RNA editing, which occurs in the mitochondrion of kinetoplastid protists, has been well-studied in the model parasite genera Trypanosoma, Leishmania, and Crithidia. Perkinsela provides a unique opportunity to broaden our knowledge of RNA editing machinery from an evolutionary perspective, as it represents the earliest kinetoplastid branch and is an obligatory endosymbiont with extensive reductive trends. Interestingly, up to 50% of mitochondrial transcripts in Perkinsela contain errors. Our study was complemented by use of newly developed software designed for accurate mapping of extensively edited RNA-seq reads obtained by deep sequencing.


2020 ◽  
Vol 6 (2) ◽  
pp. 15 ◽  
Author(s):  
Lucas Maciel ◽  
David Morales-Vicente ◽  
Sergio Verjovski-Almeida

Schistosoma japonicum is a flatworm that causes schistosomiasis, a neglected tropical disease. S. japonicum RNA-Seq analyses has been previously reported in the literature on females and males obtained during sexual maturation from 14 to 28 days post-infection in mouse, resulting in the identification of protein-coding genes and pathways, whose expression levels were related to sexual development. However, this work did not include an analysis of long non-coding RNAs (lncRNAs). Here, we applied a pipeline to identify and annotate lncRNAs in 66 S. japonicum RNA-Seq publicly available libraries, from different life-cycle stages. We also performed co-expression analyses to find stage-specific lncRNAs possibly related to sexual maturation. We identified 12,291 S. japonicum expressed lncRNAs. Sequence similarity search and synteny conservation indicated that some 14% of S. japonicum intergenic lncRNAs have synteny conservation with S. mansoni intergenic lncRNAs. Co-expression analyses showed that lncRNAs and protein-coding genes in S. japonicum males and females have a dynamic co-expression throughout sexual maturation, showing differential expression between the sexes; the protein-coding genes were related to the nervous system development, lipid and drug metabolism, and overall parasite survival. Co-expression pattern suggests that lncRNAs possibly regulate these processes or are regulated by the same activation program as that of protein-coding genes.


Animals ◽  
2021 ◽  
Vol 11 (3) ◽  
pp. 625
Author(s):  
Dongdong Bo ◽  
Xunping Jiang ◽  
Guiqiong Liu ◽  
Ruixue Hu ◽  
Yuqing Chong

Long intergenic non-coding RNAs (lincRNAs) regulate testicular development by acting on protein-coding genes. However, little is known about whether lincRNAs and protein-coding genes exhibit the same expression pattern in the same phase of postnatal testicular development in goats. Therefore, this study aimed to demonstrate the expression patterns and roles of lincRNAs during the postnatal development of the goat testis. Herein, the testes of Yiling goats with average ages of 0, 30, 60, 90, 120, 150, and 180 days postnatal (DP) were used for RNA-seq. In total, 20,269 lincRNAs were identified, including 16,931 novel lincRNAs. We identified seven time-specifically diverse lincRNA modules and six mRNA modules by weighted gene co-expression network analysis (WGCNA). Interestingly, the down-regulation of growth-related lincRNAs was nearly one month earlier than the up-regulation of spermatogenesis-related lincRNAs, while the down-regulation of growth-related protein-coding genes and the correspondent up-regulation of spermatogenesis-related protein-coding genes occurred at the same age. Then, potential lincRNA target genes were predicted. Moreover, the co-expression network of lincRNAs demonstrated that ENSCHIT00000000777, ENSCHIT00000002069, and ENSCHIT00000005076 were the key lincRNAs in the process of testis development. Our study discovered the divergent regulation patterns of lincRNA on spermatogenesis and testis growth, providing a fresh insight into age-biased changes in lincRNA expression in the goat testis.


2019 ◽  
Author(s):  
Jing Li ◽  
Urminder Singh ◽  
Zebulun Arendsee ◽  
Eve Syrkin Wurtele

AbstractThe “dark transcriptome” can be considered the multitude of sequences that are transcribed but not annotated as genes. We evaluated expression of 6,692 annotated genes and 29,354 unannotated ORFs in the Saccharomyces cerevisiae genome across diverse environmental, genetic and developmental conditions (3,457 RNA-Seq samples). Over 48% of the transcribed ORFs have translation evidence. Phylostratigraphic analysis infers most of these transcribed ORFs would encode species-specific proteins (“orphan-ORFs”); hundreds have mean expression comparable to annotated genes. These data reveal unannotated ORFs most likely to be protein-coding genes. We partitioned a co-expression matrix by Markov Chain Clustering; the resultant clusters contain 2,468 orphan-ORFs. We provide the aggregated RNA-Seq yeast data with extensive metadata as a project in MetaOmGraph, a tool designed for interactive analysis and visualization. This approach enables reuse of public RNA-Seq data for exploratory discovery, providing a rich context for experimentalists to make novel, experimentally-testable hypotheses about candidate genes.


2019 ◽  
Author(s):  
Deirdre C. Tatomer ◽  
Nathan D. Elrod ◽  
Dongming Liang ◽  
Mei-Sheng Xiao ◽  
Jeffrey Z. Jiang ◽  
...  

ABSTRACTCellular homeostasis requires transcriptional outputs to be coordinated, and many events post transcription initiation can dictate the levels and functions of mature transcripts. To systematically identify regulators of inducible gene expression, we performed high-throughput RNAi screening of the Drosophila Metallothionein A (MtnA) promoter. This revealed that the Integrator complex, which has a well-established role in 3’ end processing of small nuclear RNAs (snRNAs), attenuates MtnA transcription during copper stress. Integrator complex subunit 11 (IntS11) endonucleolytically cleaves MtnA transcripts, resulting in premature transcription termination and degradation of the nascent RNAs by the RNA exosome, a complex also identified in the screen. Using RNA-seq, we then identified >400 additional Drosophila protein-coding genes whose expression increases upon Integrator depletion. We focused on a subset of these genes and confirmed that Integrator is bound to their 5’ ends and negatively regulates their transcription via IntS11 endonuclease activity. Many non-catalytic Integrator subunits, which are largely dispensable for snRNA processing, also have regulatory roles at these protein-coding genes, possibly by controlling Integrator recruitment or RNA polymerase II dynamics. Altogether, our results suggest that attenuation via Integrator cleavage limits production of many full-length mRNAs, allowing precise control of transcription outputs.


Sign in / Sign up

Export Citation Format

Share Document