WISCOD: A Statistical Web-Enabled Tool for the Identification of Significant Protein Coding Regions

Classically, gene prediction programs are based on detecting signals such as boundary sites (splice sites, starts, and stops) and coding regions in the DNA sequence in order to build potential exons and join them into a gene structure. Although nowadays it is possible to improve their performance with additional information from related species or/and cDNA databases, further improvement at any step could help to obtain better predictions. Here, we present WISCOD, a web-enabled tool for the identification of significant protein coding regions, a novel software tool that tackles the exon prediction problem in eukaryotic genomes. WISCOD has the capacity to detect real exons from large lists of potential exons, and it provides an easy way to use globalPvalue called expected probability of being a false exon (EPFE) that is useful for ranking potential exons in a probabilistic framework, without additional computational costs. The advantage of our approach is that it significantly increases the specificity and sensitivity (both between 80% and 90%) in comparison to other ab initio methods (where they are in the range of 70–75%). WISCOD is written in JAVA and R and is available to download and to run in a local mode on Linux and Windows platforms.

Download Full-text

TSEBRA: transcript selector for BRAKER

BMC Bioinformatics ◽

10.1186/s12859-021-04482-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lars Gabriel ◽

Katharina J. Hoff ◽

Tomáš Brůna ◽

Mark Borodovsky ◽

Mario Stanke

Keyword(s):

Statistical Models ◽

Gene Prediction ◽

Software Tool ◽

Genome Project ◽

Rna Seq ◽

Protein Coding ◽

Homologous Protein ◽

Protein Coding Genes ◽

Overlapping Transcripts ◽

Eukaryotic Genomes

Abstract Background BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

Download Full-text

TSEBRA: Transcript Selector for BRAKER

10.1101/2021.06.07.447316 ◽

2021 ◽

Author(s):

Lars Gabriel ◽

Katharina J Hoff ◽

Tomas Bruna ◽

Mark Borodovsky ◽

Mario Stanke

Keyword(s):

Statistical Models ◽

Gene Prediction ◽

Software Tool ◽

Genome Project ◽

Rna Seq ◽

Protein Coding ◽

Homologous Protein ◽

Protein Coding Genes ◽

Overlapping Transcripts ◽

Eukaryotic Genomes

Background: BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results: We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion: TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

Download Full-text

Gene prediction by multiple syntenic alignment

Journal of Integrative Bioinformatics ◽

10.1515/jib-2005-13 ◽

2005 ◽

Vol 2 (1) ◽

pp. 38-47

Author(s):

Said S. Adi ◽

Carlos E. Ferreira

Keyword(s):

Computer Program ◽

Genomic Dna ◽

Gene Prediction ◽

Genomic Sequences ◽

Prediction Problem ◽

Huge Amount ◽

Protein Coding ◽

Coding Regions ◽

Conserved Regions ◽

Similarity Information

Summary Given the increasing number of available genomic sequences, one now faces the task of identifying their functional parts, like the protein coding regions. The gene prediction problem can be addressed in several ways. One of the most promising methods makes use of similarity information between the genomic DNA and previously annotated sequences (proteins, cDNAs and ESTs). Recently, given the huge amount of newly sequenced genomes, new similarity-based methods are being successfully applied in the task of gene prediction. The so-called comparative-based methods lie in the similarities shared by regions of two evolutionary related genomic sequences. Despite the number of different gene prediction approaches in the literature, this problem remains challenging. In this paper we present a new comparative-based approach to the gene prediction problem. It is based on a syntenic alignment of three or more genomic sequences. With syntenic alignment we mean an alignment that is constructed taking into account the fact that the involved sequences include conserved regions intervened by unconserved ones. We have implemented the proposed algorithm in a computer program and confirm the validity of the approach on a benchmark including triples of human, mouse and rat genomic sequences.

Download Full-text

Distinctive functional regime of endogenous lncRNAs in dark regions of human genome

10.1101/2020.12.06.413880 ◽

2020 ◽

Author(s):

Anyou Wang ◽

Rong Hai

Keyword(s):

Human Genome ◽

Rna Processing ◽

Self Regulation ◽

Post Translational Modification ◽

Protein Coding ◽

Noncoding Regions ◽

Coding Regions ◽

Rnaseq Data ◽

Response To Stress ◽

Eukaryotic Genomes

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.

Download Full-text

A Fast DFT based Gene Prediction Algorithm for Identification of Protein Coding Regions

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. ◽

10.1109/icassp.2005.1416388 ◽

2006 ◽

Cited By ~ 22

Author(s):

S. Datta ◽

A. Asif

Keyword(s):

Gene Prediction ◽

Prediction Algorithm ◽

Protein Coding ◽

Coding Regions

Download Full-text

Gene Prediction by Spectral Rotation Measure: A New Method for Identifying Protein-Coding Regions

Genome Research ◽

10.1101/gr.1261703 ◽

2003 ◽

Cited By ~ 14

Author(s):

D. Kotlar

Keyword(s):

Gene Prediction ◽

New Method ◽

Protein Coding ◽

Coding Regions ◽

Rotation Measure

Download Full-text

NOVOMIR: De Novo Prediction of MicroRNA-Coding Regions in a Single Plant-Genome

Journal of Nucleic Acids ◽

10.4061/2010/495904 ◽

2010 ◽

Vol 2010 ◽

pp. 1-10 ◽

Cited By ~ 25

Author(s):

Jan-Hendrik Teune ◽

Gerhard Steger

Keyword(s):

Noncoding Rna ◽

De Novo ◽

Plant Genome ◽

Messenger Rnas ◽

Protein Coding ◽

Rna Molecules ◽

Coding Regions ◽

Mirna Genes ◽

Genomic Scale ◽

Eukaryotic Genomes

MicroRNAs (miRNA) are small regulatory, noncoding RNA molecules that are transcribed as primary miRNAs (pri-miRNA) from eukaryotic genomes. At least in plants, their regulatory activity is mediated through base-pairing with protein-coding messenger RNAs (mRNA) followed by mRNA degradation or translation repression. We describeNOVOMIR, a program for the identification of miRNA genes in plant genomes. It uses a series of filter steps and a statistical model to discriminate a pre-miRNA from other RNAs and does rely neither on prior knowledge of a miRNA target nor on comparative genomics. The sensitivity and specificity ofNOVOMIR for detection of premiRNAs fromArabidopsis thalianais ~0.83 and ~0.99, respectively. Plant pre-miRNAs are more heterogeneous with respect to size and structure than animal pre-miRNAs. Despite these difficulties,NOVOMIR is well suited to perform searches for pre-miRNAs on a genomic scale.NOVOMIR is written in Perl and relies on two additional, free programs for prediction of RNA secondary structure (RNALFOLD, RNASHAPES).

Download Full-text

Gene Prediction by the Scale-limited Gabor Wavelet Transform for Identifying the Protein Coding Regions

2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV) ◽

10.1109/icarcv50220.2020.9305490 ◽

2020 ◽

Author(s):

Qian Zheng ◽

Wenxiang Zhou ◽

Tao Chen ◽

Lei Xie ◽

Hong Ye Su

Keyword(s):

Wavelet Transform ◽

Gene Prediction ◽

Gabor Wavelet ◽

Protein Coding ◽

Gabor Wavelet Transform ◽

Coding Regions

Download Full-text

Evolutionary Analysis of DNA-Protein-Coding Regions Based on a Genetic Code Cube Metric

Current Topics in Medicinal Chemistry ◽

10.2174/1568026613666131204110022 ◽

2014 ◽

Vol 14 (3) ◽

pp. 407-417

Author(s):

Robersy Sanchez

Keyword(s):

Genetic Code ◽

Evolutionary Analysis ◽

Protein Coding ◽

Coding Regions

Download Full-text

The open targets post-GWAS analysis pipeline

Bioinformatics ◽

10.1093/bioinformatics/btaa020 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2936-2937 ◽

Cited By ~ 4

Author(s):

Gareth Peat ◽

William Jones ◽

Michael Nuhn ◽

José Carlos Marugán ◽

William Newell ◽

...

Keyword(s):

Drug Targets ◽

Gene Expression Regulation ◽

Association Studies ◽

Genome Wide Association Studies ◽

Protein Coding ◽

Data Resource ◽

Coding Regions ◽

Genome Wide ◽

Causal Genes ◽

Interactive Data

Abstract Motivation Genome-wide association studies (GWAS) are a powerful method to detect even weak associations between variants and phenotypes; however, many of the identified associated variants are in non-coding regions, and presumably influence gene expression regulation. Identifying potential drug targets, i.e. causal protein-coding genes, therefore, requires crossing the genetics results with functional data. Results We present a novel data integration pipeline that analyses GWAS results in the light of experimental epigenetic and cis-regulatory datasets, such as ChIP-Seq, Promoter-Capture Hi-C or eQTL, and presents them in a single report, which can be used for inferring likely causal genes. This pipeline was then fed into an interactive data resource. Availability and implementation The analysis code is available at www.github.com/Ensembl/postgap and the interactive data browser at postgwas.opentargets.io.

Download Full-text