scholarly journals GENETACK: FRAMESHIFT IDENTIFICATION IN PROTEIN-CODING SEQUENCES BY THE VITERBI ALGORITHM

2010 ◽  
Vol 08 (03) ◽  
pp. 535-551 ◽  
Author(s):  
IVAN ANTONOV ◽  
MARK BORODOVSKY

We describe a new program for ab initio frameshift detection in protein-coding nucleotide sequences. The task is to distinguish the same strand overlapping ORFs that occur in the sequence due to a presence of a frameshifted gene from the same strand overlapping ORFs that encompass true overlapping or adjacent genes. The GeneTack program uses a hidden Markov model (HMM) of genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm finds the maximum likelihood path that discriminates between true adjacent genes and those adjacent protein-coding regions that just appear to be separate entities due to frameshifts. Therefore, the program can identify spurious predictions made by a conventional gene-finding program misled by a frameshift. We tested GeneTack as well as two earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn + Sp)/2 values, was higher by a significant margin than the accuracy of two other programs. In addition, we observed that the average accuracy of GeneTack is favorably compared with the accuracy of the FSFind-BLAST program that uses protein database search to verify predicted frameshifts, even though GeneTack does not use external evidence. GeneTack is freely available at .

2014 ◽  
Vol 8 ◽  
pp. BBI.S13076 ◽  
Author(s):  
Adam Zemla ◽  
Tanya Kostova ◽  
Rodion Gorchakov ◽  
Evgeniya Volkova ◽  
David W. C. Beasley ◽  
...  

A computational approach for identification and assessment of genomic sequence variability (GeneSV) is described. For a given nucleotide sequence, GeneSV collects information about the permissible nucleotide variability (changes that potentially preserve function) observed in corresponding regions in genomic sequences, and combines it with conservation/variability results from protein sequence and structure-based analyses of evaluated protein coding regions. GeneSV was used to predict effects (functional vs. non-functional) of 37 amino acid substitutions on the NS5 polymerase (RdRp) of dengue virus type 2 (DENV-2), 36 of which are not observed in any publicly available DENV-2 sequence. 32 novel mutants with single amino acid substitutions in the RdRp were generated using a DENV-2 reverse genetics system. In 81% (26 of 32) of predictions tested, GeneSV correctly predicted viability of introduced mutations. In 4 of 5 (80%) mutants with double amino acid substitutions proximal in structure to one another GeneSV was also correct in its predictions. Predictive capabilities of the developed system were illustrated on dengue RNA virus, but described in the manuscript a general approach to characterize real or theoretically possible variations in genomic and protein sequences can be applied to any organism.


Author(s):  
Tomáš Brůna ◽  
Katharina J. Hoff ◽  
Alexandre Lomsadze ◽  
Mario Stanke ◽  
Mark Borodovsky

AbstractFull automation of gene prediction has become an important bioinformatics task since the advent of next generation sequencing. The eukaryotic genome annotation pipeline BRAKER1 had combined self-training GeneMark-ET with AUGUSTUS to generate genes’ coordinates with support of transcriptomic data. Here, we introduce BRAKER2, a pipeline with GeneMark-EP+ and AUGUSTUS externally supported by cross-species protein sequences aligned to the genome. Among the challenges addressed in the development of the new pipeline was generation of reliable hints to the locations of protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. Under equal conditions, the gene prediction accuracy of BRAKER2 was shown to be higher than the one of MAKER2, yet another genome annotation pipeline. Also, in comparison with BRAKER1 supported by a large volume of transcript data, BRAKER2 could produce a better gene prediction accuracy if the evolutionary distances to the reference species in the protein database were rather small. All over, our tests demonstrated that fully automatic BRAKER2 is a fast and accurate method for structural annotation of novel eukaryotic genomes.


2018 ◽  
Author(s):  
Koh Onimaru ◽  
Osamu Nishimura ◽  
Shigehiro Kuraku

Genotype-phenotype mapping is one of the fundamental challenges in biology. The difficulties stem in part from the large amount of sequence information and the puzzling genomic code, particularly of non-protein-coding regions such as gene regulatory sequences. However, recently deep learning–based methods were shown to have the ability to decipher the gene regulatory code of genomes. Still, prediction accuracy needs improvement. Here, we report the design of convolution layers that efficiently process genomic sequence information and developed a software, DeepGMAP, to train and compare different deep learning-based models (https://github.com/koonimaru/DeepGMAP). First, we demonstrate that our convolution layers, termed forward- and reverse-sequence scan (FRSS) layers, enhance the power to predict gene regulatory sequences. Second, we assessed previous studies and identified problems associated with data structures that caused overfitting. Finally, we introduce several visualization methods that provide insights into the syntax of gene regulatory sequences.


2020 ◽  
Vol 36 (9) ◽  
pp. 2936-2937 ◽  
Author(s):  
Gareth Peat ◽  
William Jones ◽  
Michael Nuhn ◽  
José Carlos Marugán ◽  
William Newell ◽  
...  

Abstract Motivation Genome-wide association studies (GWAS) are a powerful method to detect even weak associations between variants and phenotypes; however, many of the identified associated variants are in non-coding regions, and presumably influence gene expression regulation. Identifying potential drug targets, i.e. causal protein-coding genes, therefore, requires crossing the genetics results with functional data. Results We present a novel data integration pipeline that analyses GWAS results in the light of experimental epigenetic and cis-regulatory datasets, such as ChIP-Seq, Promoter-Capture Hi-C or eQTL, and presents them in a single report, which can be used for inferring likely causal genes. This pipeline was then fed into an interactive data resource. Availability and implementation The analysis code is available at www.github.com/Ensembl/postgap and the interactive data browser at postgwas.opentargets.io.


Biochimie ◽  
2011 ◽  
Vol 93 (11) ◽  
pp. 2019-2023 ◽  
Author(s):  
Sven Findeiß ◽  
Jan Engelhardt ◽  
Sonja J. Prohaska ◽  
Peter F. Stadler

1991 ◽  
Vol 11 (3) ◽  
pp. 1770-1776
Author(s):  
R G Collum ◽  
D F Clayton ◽  
F W Alt

We found that the canary N-myc gene is highly related to mammalian N-myc genes in both the protein-coding region and the long 3' untranslated region. Examined coding regions of the canary c-myc gene were also highly related to their mammalian counterparts, but in contrast to N-myc, the canary and mammalian c-myc genes were quite divergent in their 3' untranslated regions. We readily detected N-myc and c-myc expression in the adult canary brain and found N-myc expression both at sites of proliferating neuronal precursors and in mature neurons.


2010 ◽  
Vol 11 (3) ◽  
pp. 243
Author(s):  
Saber Jelokhani-Niaraki ◽  
Majid Esmaelizad ◽  
Morteza Daliri ◽  
Rasoul Vaez-Torshizi ◽  
Morteza Kamalzadeh ◽  
...  

2016 ◽  
Vol 4 (6) ◽  
Author(s):  
Xuehua Wan ◽  
Shaobin Hou ◽  
Kazukuni Hayashi ◽  
James Anderson ◽  
Stuart P. Donachie

Rheinheimera salexigens KH87 T is an obligately halophilic gammaproteobacterium. The strain’s draft genome sequence, generated by the Roche 454 GS FLX+ platform, comprises two scaffolds of ~3.4 Mbp and ~3 kbp, with 3,030 protein-coding sequences and 58 tRNA coding regions. The G+C content is 42 mol%.


Sign in / Sign up

Export Citation Format

Share Document