GENETACK: FRAMESHIFT IDENTIFICATION IN PROTEIN-CODING SEQUENCES BY THE VITERBI ALGORITHM

We describe a new program for ab initio frameshift detection in protein-coding nucleotide sequences. The task is to distinguish the same strand overlapping ORFs that occur in the sequence due to a presence of a frameshifted gene from the same strand overlapping ORFs that encompass true overlapping or adjacent genes. The GeneTack program uses a hidden Markov model (HMM) of genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm finds the maximum likelihood path that discriminates between true adjacent genes and those adjacent protein-coding regions that just appear to be separate entities due to frameshifts. Therefore, the program can identify spurious predictions made by a conventional gene-finding program misled by a frameshift. We tested GeneTack as well as two earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn + Sp)/2 values, was higher by a significant margin than the accuracy of two other programs. In addition, we observed that the average accuracy of GeneTack is favorably compared with the accuracy of the FSFind-BLAST program that uses protein database search to verify predicted frameshifts, even though GeneTack does not use external evidence. GeneTack is freely available at .

Download Full-text

GeneSV – an Approach to Help Characterize Possible Variations in Genomic and Protein Sequences

Bioinformatics and Biology Insights ◽

10.4137/bbi.s13076 ◽

2014 ◽

Vol 8 ◽

pp. BBI.S13076 ◽

Cited By ~ 4

Author(s):

Adam Zemla ◽

Tanya Kostova ◽

Rodion Gorchakov ◽

Evgeniya Volkova ◽

David W. C. Beasley ◽

...

Keyword(s):

Amino Acid ◽

Reverse Genetics ◽

Genomic Sequence ◽

Rna Virus ◽

Protein Sequences ◽

Single Amino Acid ◽

Amino Acid Substitutions ◽

Protein Coding ◽

Nucleotide Variability ◽

Coding Regions

A computational approach for identification and assessment of genomic sequence variability (GeneSV) is described. For a given nucleotide sequence, GeneSV collects information about the permissible nucleotide variability (changes that potentially preserve function) observed in corresponding regions in genomic sequences, and combines it with conservation/variability results from protein sequence and structure-based analyses of evaluated protein coding regions. GeneSV was used to predict effects (functional vs. non-functional) of 37 amino acid substitutions on the NS5 polymerase (RdRp) of dengue virus type 2 (DENV-2), 36 of which are not observed in any publicly available DENV-2 sequence. 32 novel mutants with single amino acid substitutions in the RdRp were generated using a DENV-2 reverse genetics system. In 81% (26 of 32) of predictions tested, GeneSV correctly predicted viability of introduced mutations. In 4 of 5 (80%) mutants with double amino acid substitutions proximal in structure to one another GeneSV was also correct in its predictions. Predictive capabilities of the developed system were illustrated on dengue RNA virus, but described in the manuscript a general approach to characterize real or theoretically possible variations in genomic and protein sequences can be applied to any organism.

Download Full-text

BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database

10.1101/2020.08.10.245134 ◽

2020 ◽

Cited By ~ 5

Author(s):

Tomáš Brůna ◽

Katharina J. Hoff ◽

Alexandre Lomsadze ◽

Mario Stanke ◽

Mark Borodovsky

Keyword(s):

Genome Annotation ◽

Prediction Accuracy ◽

Gene Prediction ◽

Accurate Method ◽

Eukaryotic Genome ◽

Structural Annotation ◽

Protein Database ◽

Protein Coding ◽

Annotation Pipeline ◽

The One

AbstractFull automation of gene prediction has become an important bioinformatics task since the advent of next generation sequencing. The eukaryotic genome annotation pipeline BRAKER1 had combined self-training GeneMark-ET with AUGUSTUS to generate genes’ coordinates with support of transcriptomic data. Here, we introduce BRAKER2, a pipeline with GeneMark-EP+ and AUGUSTUS externally supported by cross-species protein sequences aligned to the genome. Among the challenges addressed in the development of the new pipeline was generation of reliable hints to the locations of protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. Under equal conditions, the gene prediction accuracy of BRAKER2 was shown to be higher than the one of MAKER2, yet another genome annotation pipeline. Also, in comparison with BRAKER1 supported by a large volume of transcript data, BRAKER2 could produce a better gene prediction accuracy if the evolutionary distances to the reference species in the protein database were rather small. All over, our tests demonstrated that fully automatic BRAKER2 is a fast and accurate method for structural annotation of novel eukaryotic genomes.

Download Full-text

A regulatory-sequence classifier with a neural network for genomic information processing

10.1101/355974 ◽

2018 ◽

Cited By ~ 1

Author(s):

Koh Onimaru ◽

Osamu Nishimura ◽

Shigehiro Kuraku

Keyword(s):

Deep Learning ◽

Genomic Sequence ◽

Regulatory Sequence ◽

Sequence Information ◽

Regulatory Sequences ◽

Genomic Information ◽

Protein Coding ◽

Coding Regions ◽

Gene Regulatory ◽

Genomic Sequence Information

Genotype-phenotype mapping is one of the fundamental challenges in biology. The difficulties stem in part from the large amount of sequence information and the puzzling genomic code, particularly of non-protein-coding regions such as gene regulatory sequences. However, recently deep learning–based methods were shown to have the ability to decipher the gene regulatory code of genomes. Still, prediction accuracy needs improvement. Here, we report the design of convolution layers that efficiently process genomic sequence information and developed a software, DeepGMAP, to train and compare different deep learning-based models (https://github.com/koonimaru/DeepGMAP). First, we demonstrate that our convolution layers, termed forward- and reverse-sequence scan (FRSS) layers, enhance the power to predict gene regulatory sequences. Second, we assessed previous studies and identified problems associated with data structures that caused overfitting. Finally, we introduce several visualization methods that provide insights into the syntax of gene regulatory sequences.

Download Full-text

Evolutionary Analysis of DNA-Protein-Coding Regions Based on a Genetic Code Cube Metric

Current Topics in Medicinal Chemistry ◽

10.2174/1568026613666131204110022 ◽

2014 ◽

Vol 14 (3) ◽

pp. 407-417

Author(s):

Robersy Sanchez

Keyword(s):

Genetic Code ◽

Evolutionary Analysis ◽

Protein Coding ◽

Coding Regions

Download Full-text

The open targets post-GWAS analysis pipeline

Bioinformatics ◽

10.1093/bioinformatics/btaa020 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2936-2937 ◽

Cited By ~ 4

Author(s):

Gareth Peat ◽

William Jones ◽

Michael Nuhn ◽

José Carlos Marugán ◽

William Newell ◽

...

Keyword(s):

Drug Targets ◽

Gene Expression Regulation ◽

Association Studies ◽

Genome Wide Association Studies ◽

Protein Coding ◽

Data Resource ◽

Coding Regions ◽

Genome Wide ◽

Causal Genes ◽

Interactive Data

Abstract Motivation Genome-wide association studies (GWAS) are a powerful method to detect even weak associations between variants and phenotypes; however, many of the identified associated variants are in non-coding regions, and presumably influence gene expression regulation. Identifying potential drug targets, i.e. causal protein-coding genes, therefore, requires crossing the genetics results with functional data. Results We present a novel data integration pipeline that analyses GWAS results in the light of experimental epigenetic and cis-regulatory datasets, such as ChIP-Seq, Promoter-Capture Hi-C or eQTL, and presents them in a single report, which can be used for inferring likely causal genes. This pipeline was then fed into an interactive data resource. Availability and implementation The analysis code is available at www.github.com/Ensembl/postgap and the interactive data browser at postgwas.opentargets.io.

Download Full-text

Novel exon 1 protein‐coding regions N‐terminally extend human KCNE3 and KCNE4

The FASEB Journal ◽

10.1096/fj.201600467r ◽

2016 ◽

Vol 30 (8) ◽

pp. 2959-2969 ◽

Cited By ~ 8

Author(s):

Geoffrey W. Abbott

Keyword(s):

Protein Coding ◽

Coding Regions ◽

Exon 1 ◽

Novel Exon

Download Full-text

Protein-coding structured RNAs: A computational survey of conserved RNA secondary structures overlapping coding regions in drosophilids

Biochimie ◽

10.1016/j.biochi.2011.07.023 ◽

2011 ◽

Vol 93 (11) ◽

pp. 2019-2023 ◽

Cited By ~ 8

Author(s):

Sven Findeiß ◽

Jan Engelhardt ◽

Sonja J. Prohaska ◽

Peter F. Stadler

Keyword(s):

Secondary Structures ◽

Protein Coding ◽

Rna Secondary Structures ◽

Coding Regions

Download Full-text

Structure and expression of canary myc family genes

Molecular and Cellular Biology ◽

10.1128/mcb.11.3.1770-1776.1991 ◽

1991 ◽

Vol 11 (3) ◽

pp. 1770-1776

Author(s):

R G Collum ◽

D F Clayton ◽

F W Alt

Keyword(s):

Untranslated Region ◽

Untranslated Regions ◽

Coding Region ◽

Protein Coding ◽

Coding Regions ◽

Neuronal Precursors ◽

Myc Gene ◽

Mature Neurons

We found that the canary N-myc gene is highly related to mammalian N-myc genes in both the protein-coding region and the long 3' untranslated region. Examined coding regions of the canary c-myc gene were also highly related to their mammalian counterparts, but in contrast to N-myc, the canary and mammalian c-myc genes were quite divergent in their 3' untranslated regions. We readily detected N-myc and c-myc expression in the adult canary brain and found N-myc expression both at sites of proliferating neuronal precursors and in mature neurons.

Download Full-text

Sequence and phylogenetic analysis of the non-structural 3A and 3B protein-coding regions of foot-and-mouth disease virus subtype A Iran 05

Journal of Veterinary Science ◽

10.4142/jvs.2010.11.3.243 ◽

2010 ◽

Vol 11 (3) ◽

pp. 243

Author(s):

Saber Jelokhani-Niaraki ◽

Majid Esmaelizad ◽

Morteza Daliri ◽

Rasoul Vaez-Torshizi ◽

Morteza Kamalzadeh ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Disease Virus ◽

Foot And Mouth Disease ◽

Mouth Disease ◽

Protein Coding ◽

Coding Regions ◽

Mouth Disease Virus ◽

Foot And Mouth ◽

Subtype A ◽

Virus Subtype

Download Full-text

Genome Sequence of Rheinheimera salexigens sp. nov. Isolated from a Fishing Hook off O‘ahu, Hawai‘i

Genome Announcements ◽

10.1128/genomea.01390-16 ◽

2016 ◽

Vol 4 (6) ◽

Cited By ~ 1

Author(s):

Xuehua Wan ◽

Shaobin Hou ◽

Kazukuni Hayashi ◽

James Anderson ◽

Stuart P. Donachie

Keyword(s):

Genome Sequence ◽

Draft Genome ◽

Draft Genome Sequence ◽

Protein Coding ◽

Coding Sequences ◽

Coding Regions ◽

Roche 454

Rheinheimera salexigens KH87 T is an obligately halophilic gammaproteobacterium. The strain’s draft genome sequence, generated by the Roche 454 GS FLX+ platform, comprises two scaffolds of ~3.4 Mbp and ~3 kbp, with 3,030 protein-coding sequences and 58 tRNA coding regions. The G+C content is 42 mol%.

Download Full-text