scholarly journals Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi

2017 ◽  
Author(s):  
Jens Keilwagen ◽  
Frank Hartung ◽  
Michael Paulini ◽  
Sven O. Twardziok ◽  
Jan Grau

MotivationGenome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction.ResultsHere, we present an extension of the gene prediction tool GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions.AvailabilityGeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/[email protected]

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lars Gabriel ◽  
Katharina J. Hoff ◽  
Tomáš Brůna ◽  
Mark Borodovsky ◽  
Mario Stanke

Abstract Background BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.


2013 ◽  
Vol 42 (5) ◽  
pp. 2820-2832 ◽  
Author(s):  
Nicolas Philippe ◽  
Elias Bou Samra ◽  
Anthony Boureux ◽  
Alban Mancheron ◽  
Florence Rufflé ◽  
...  

Abstract Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.


2021 ◽  
Author(s):  
Lars Gabriel ◽  
Katharina J Hoff ◽  
Tomas Bruna ◽  
Mark Borodovsky ◽  
Mario Stanke

Background: BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results: We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion: TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.


2018 ◽  
Author(s):  
Sonali Arora ◽  
Siobhan S. Pattwell ◽  
Eric C. Holland ◽  
Hamid Bolouri

RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that for >12% of protein-coding genes, in at least 1% of samples, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold using the same samples and the same set of RNA-seq reads, raising clinical concern.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Mikhail Pomaznoy ◽  
Ashu Sethi ◽  
Jason Greenbaum ◽  
Bjoern Peters

Abstract RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount.


2021 ◽  
Author(s):  
◽  
Mirko Brüggemann

Most cellular processes are regulated by RNA-binding proteins (RBPs). These RBPs usually use defined binding sites to recognize and directly interact with their target RNA molecule. Individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments are an important tool to de- scribe such interactions in cell cultures in-vivo. This experimental protocol yields millions of individual sequencing reads from which the binding spec- trum of the RBP under study can be deduced. In this PhD thesis I studied how RNA processing is driven from RBP binding by analyzing iCLIP-derived sequencing datasets. First, I described a complete data analysis pipeline to detect RBP binding sites from iCLIP sequencing reads. This workflow covers all essential process- ing steps, from the first quality control to the final annotation of binding sites. I described the accurate integration of biological iCLIP replicates to boost the initial peak calling step while ensuring high specificity through replicate re- producibility analysis. Further I proposed a routine to level binding site width to streamline downstream analysis processes. This was exemplified in the re- analysis of the binding spectrum of the U2 small nuclear RNA auxiliary factor 2 (U2AF2, U2AF65). I recaptured the known dominance of U2AF65 to bind to intronic sequences of protein-coding genes, where it likely recognizes the polypyrimidine tract as part of the core spliceosome machinery. In the second part of my thesis, I analyzed the binding spectrum of the serine and arginine rich splicing factor 6 (SRSF6) in the context of diabetes. In pancreatic beta-cells, the expression of SRSF6 is regulated by the transcription factor GLIS3, which encodes for a diabetes susceptibility gene. It is known that SRSF6 promotes beta-cell death through the splicing dysregulation of genes essential to beta-cell function and survival. However, the exact mechanism of how these RNAs are targeted by SRSF6 remains poorly understood. Here, I applied the defined iCLIP processing pipeline to describe the binding landscape of the splicing factor SRSF6 in the human pancreatic beta-cell line EndoC-H1. The initial binding sites definition revealed a predominant binding to coding sequences (CDS) of protein-coding genes. This was followed up by extensive motif analysis which revealed a so far, in human, unknown purine-rich binding motif. SRSF6 seemed to specifically recognize repetitions of the triplet GAA. I also showed that the number of contiguous triplets correlated with increasing binding site strength. I further integrated RNA-sequencing data from the same cell type, with SRSF6 in KD and in basal conditions, to analyze SRSF6- related splicing changes. I showed that the exact positioning of SRSF6 on alternatively spliced exons regulates the produced transcript isoforms. This mechanism seemed to control exons in several known susceptibility genes for diabetes. In summary, in my PhD thesis, I presented a comprehensive workflow for the processing of iCLIP-derived sequencing data. I applied this pipeline on a dataset from pancreatic beta-cells to unveil the impact of SRSF6-mediated splicing changes. Thus, my analysis provides novel insights into the regulation of diabetes susceptibility genes.


2019 ◽  
Vol 8 (31) ◽  
Author(s):  
Rikky W. Purbojati ◽  
Daniela I. Drautz-Moses ◽  
Akira Uchida ◽  
Anthony Wong ◽  
Megan E. Clare ◽  
...  

Brevundimonas sp. strain SGAir0440 was isolated from indoor air samples collected in Singapore. Its genome was assembled using single-molecule real-time sequencing data, resulting in one circular chromosome with a length of 3.1 Mbp. The genome consists of 3,033 protein-coding genes, 48 tRNAs, and 6 rRNA operons.


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 2705-2705 ◽  
Author(s):  
Lara Rizzotto ◽  
Arianna Bottoni ◽  
Tzung-Huei Lai ◽  
Chaomei Liu ◽  
Pearlly S Yan ◽  
...  

Abstract Chronic lymphocytic leukemia (CLL) follows a variable clinical course mostly dependent upon genomic factors, with a subset of patients having low risk disease and others displaying rapid progression associated with clonal evolution. Epigenetic mechanisms such as DNA promoter hypermethylation were shown to have a role in CLL evolution where the acquisition of increasingly heterogeneous DNA methylation patters occurred in conjunction with clonal evolution of genetic aberrations and was associated with disease progression. However the role of epigenetic mechanisms regulated by the histone deacetylase group of transcriptional repressors in the progression of CLL has not been well characterized. The histone deacetylases (HDACs) 1 and 2 are recruited onto gene promoters and form a complex with the histone demethylase KDM1. Once recruited, the complex mediate the removal of acetyl groups from specific lysines on histones (H3K9 and H3K14) thus triggering the demethylation of lysine 4 (H3K4me3) and the silencing of gene expression. CLL is characterized by the dysregulation of numerous coding and non coding genes, many of which have key roles in regulating the survival or progression of CLL. For instance, our group showed that the levels of HDAC1 were elevated in high risk as compared to low risk CLL or normal lymphocytes and this over-expression was responsible for the silencing of miR-106b, mR-15, miR-16, and miR-29b which affected CLL survival by modulating the expression of key anti-apoptotic proteins Bcl-2 and Mcl-1. To characterize the HDAC-repressed gene signature in high risk CLL, we conducted chromatin immunoprecipitation (ChIP) of the nuclear lysates from 3 high risk and 3 low risk CLL patients using antibodies against HDAC1, HDAC2 and KDM1 or non-specific IgG, sequenced and aligned the eluted DNA to a reference genome and determined the binding of HDAC1, HDAC2 and KDM1 at the promoters for all protein coding and microRNA genes. Preliminary results from this ChIP-seq showed a strong recruitment of HDAC1, HDAC2 and KDM1 to the promoters of several microRNA as well as protein coding genes in high risk CLL. To further corroborate these data we performed ChIP-Seq in the same 6 CLL samples to analyze the levels of H3K4me2 and H3K4me3 around gene promoters before and after 6h exposure to the HDACi panobinostat. Our goal was to demonstrate that HDAC inhibition elicited an increase in the levels of acetylation on histones and triggered the accrual of H3K4me2 at the repressed promoter, events likely to facilitate the recruitment of RNA polymerase II to this promoter. Initial analysis confirmed a robust accumulation of H3K4me2 and H3K4me3 marks at the gene promoters of representative genes that recruited HDAC1 and its co-repressors in the previous ChIP-Seq analysis in high risk CLL patients. Finally, 5 aggressive CLL samples were treated with the HDACi abexinostat for 48h and RNA before and after treatment was subjected to RNA-seq for small and large RNA to confirm that the regions of chromatin uncoiled by HDACi treatment were actively transcribed. HDAC inhibition induced the expression of a large number of miRNA genes as well as key protein coding genes, such as miR-29b, miR-210, miR-182, miR-183, miR-95, miR-940, FOXO3, EBF1 and BCL2L11. Of note, some of the predicted or validated targets of the induced miRNAs were key facilitators in the progression of CLL, such as BTK, SYK, MCL-1, BCL-2, TCL1, and ROR1. Moreover, RNA-seq showed that the expression of these protein coding genes was reduced by 2-33 folds upon HDAC inhibition. We plan to extend the RNA-seq to 5 CLL samples with indolent disease and combine all the data to identify a common signature of protein coding and miRNA genes that recruited the HDAC1 complex, accumulated activating histone modifications upon treatment with HDACi and altered gene and miRNA expression after HDAC inhibition in high risk CLL versus low risk CLL. The signature will be than validated on a large cohort of indolent and aggressive CLL patients. Our final goal is to define a signature of coding and non coding genes silenced by HDACs in high risk CLL and its role in facilitating disease progression. Disclosures Woyach: Acerta: Research Funding; Karyopharm: Research Funding; Morphosys: Research Funding.


GigaScience ◽  
2019 ◽  
Vol 8 (10) ◽  
Author(s):  
Yunhai Guo ◽  
Yi Zhang ◽  
Qin Liu ◽  
Yun Huang ◽  
Guangyao Mao ◽  
...  

AbstractBackgroundAchatina fulica, the giant African snail, is the largest terrestrial mollusk species. Owing to its voracious appetite, wide environmental adaptability, high growth rate, and reproductive capacity, it has become an invasive species across the world, mainly in Southeast Asia, Japan, the western Pacific islands, and China. This pest can damage agricultural crops and is an intermediate host of many parasites that can threaten human health. However, genomic information of A. fulica remains limited, hindering genetic and genomic studies for invasion control and management of the species.FindingsUsing a k-mer–based method, we estimated the A. fulica genome size to be 2.12 Gb, with a high repeat content up to 71%. Roughly 101.6 Gb genomic long-read data of A. fulica were generated from the Pacific Biosciences sequencing platform and assembled to produce a first A. fulica genome of 1.85 Gb with a contig N50 length of 726 kb. Using contact information from the Hi-C sequencing data, we successfully anchored 99.32% contig sequences into 31 chromosomes, leading to the final contig and scaffold N50 length of 721 kb and 59.6 Mb, respectively. The continuity, completeness, and accuracy were evaluated by genome comparison with other mollusk genomes, BUSCO assessment, and genomic read mapping. A total of 23,726 protein-coding genes were predicted from the assembled genome, among which 96.34% of the genes were functionally annotated. The phylogenetic analysis using whole-genome protein-coding genes revealed that A. fulica separated from a common ancestor with Biomphalaria glabrata ∼182 million years ago.ConclusionTo our knowledge, the A. fulica genome is the first terrestrial mollusk genome published to date. The chromosome sequence of A. fulica will provide the research community with a valuable resource for population genetics and environmental adaptation studies for the species, as well as investigations of the chromosome-level of evolution within mollusks.


Sign in / Sign up

Export Citation Format

Share Document