scholarly journals Accurate annotation of protein coding sequences with IDTAXA

2021 ◽  
Vol 3 (3) ◽  
Author(s):  
Nicholas P Cooley ◽  
Erik S Wright

Abstract The observed diversity of protein coding sequences continues to increase far more rapidly than knowledge of their functions, making classification algorithms essential for assigning a function to proteins using only their sequence. Most pipelines for annotating proteins rely on searches for homologous sequences in databases of previously annotated proteins using BLAST or HMMER. Here, we develop a new approach for classifying proteins into a taxonomy of functions and demonstrate its utility for genome annotation. Our algorithm, IDTAXA, was more accurate than BLAST or HMMER at assigning sequences to KEGG ortholog groups. Moreover, IDTAXA correctly avoided classifying sequences with novel functions to existing groups, which is a common error mode for classification approaches that rely on E-values as a proxy for confidence. We demonstrate IDTAXA’s utility for annotating eukaryotic and prokaryotic genomes by assigning functions to proteins within a multi-level ontology and applied IDTAXA to detect genome contamination in eukaryotic genomes. Finally, we re-annotated 8604 microbial genomes with known antibiotic resistance phenotypes to discover two novel associations between proteins and antibiotic resistance. IDTAXA is available as a web tool (http://DECIPHER.codes/Classification.html) or as part of the open source DECIPHER R package from Bioconductor.

2019 ◽  
Author(s):  
Deepank R Korandla ◽  
Jacob M Wozniak ◽  
Anaamika Campeau ◽  
David J Gonzalez ◽  
Erik S Wright

Abstract Motivation A core task of genomics is to identify the boundaries of protein coding genes, which may cover over 90% of a prokaryote's genome. Several programs are available for gene finding, yet it is currently unclear how well these programs perform and whether any offers superior accuracy. This is in part because there is no universal benchmark for gene finding and, therefore, most developers select their own benchmarking strategy. Results Here, we introduce AssessORF, a new approach for benchmarking prokaryotic gene predictions based on evidence from proteomics data and the evolutionary conservation of start and stop codons. We applied AssessORF to compare gene predictions offered by GenBank, GeneMarkS-2, Glimmer and Prodigal on genomes spanning the prokaryotic tree of life. Gene predictions were 88–95% in agreement with the available evidence, with Glimmer performing the worst but no clear winner. All programs were biased towards selecting start codons that were upstream of the actual start. Given these findings, there remains considerable room for improvement, especially in the detection of correct start sites. Availability and implementation AssessORF is available as an R package via the Bioconductor package repository. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Yulia M. Suvorova ◽  
Eugene V. Korotkov

AbstractTriplet periodicity (TP) is a distinctive feature of the protein coding sequences of both prokaryotic and eukaryotic genomes. In this work, we explored the TP difference inside and between 45 prokaryotic genomes. We constructed two hypotheses of TP distribution on a set of coding sequences and generated artificial datasets that correspond to the hypotheses. We found that TP is more similar inside a genome than between genomes and that TP distribution inside a real genome dataset corresponds to the hypothesis which implies that a common TP pattern exists for the majority of sequences inside a genome. Additionally, we performed gene classification based on TP matrixes. This classification showed that TP allows identification of the genome to which a given gene belongs with more than 85% accuracy.


2015 ◽  
Vol 376 ◽  
pp. 8-14 ◽  
Author(s):  
Jia-Feng Yu ◽  
Qing-Li Chen ◽  
Jing Ren ◽  
Yan-Ling Yang ◽  
Ji-Hua Wang ◽  
...  

2019 ◽  
Vol 8 (23) ◽  
Author(s):  
Si Chul Kim ◽  
Hyo Jung Lee

Here, we report the draft genome sequence of Pseudorhodobacter sp. strain E13, a Gram-negative, aerobic, nonflagellated, and rod-shaped bacterium which was isolated from the Yellow Sea in South Korea. The assembled genome sequence is 3,878,578 bp long with 3,646 protein-coding sequences in 159 contigs.


Insects ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 326
Author(s):  
Yu-Jun Wang ◽  
Hua-Ling Wang ◽  
Xiao-Wei Wang ◽  
Shu-Sheng Liu

Females and males often differ obviously in morphology and behavior, and the differences between sexes are the result of natural selection and/or sexual selection. To a great extent, the differences between the two sexes are the result of differential gene expression. In haplodiploid insects, this phenomenon is obvious, since males develop from unfertilized zygotes and females develop from fertilized zygotes. Whiteflies of the Bemisia tabaci species complex are typical haplodiploid insects, and some species of this complex are important pests of many crops worldwide. Here, we report the transcriptome profiles of males and females in three species of this whitefly complex. Between-species comparisons revealed that non-sex-biased genes display higher variation than male-biased or female-biased genes. Sex-biased genes evolve at a slow rate in protein coding sequences and gene expression and have a pattern of evolution that differs from those of social haplodiploid insects and diploid animals. Genes with high evolutionary rates are more related to non-sex-biased traits—such as nutrition, immune system, and detoxification—than to sex-biased traits, indicating that the evolution of protein coding sequences and gene expression has been mainly driven by non-sex-biased traits.


2018 ◽  
Vol 7 (14) ◽  
Author(s):  
Nikolay V. Volozhantsev ◽  
Angelina A. Kislichkina ◽  
Anastasia I. Lev ◽  
Ekaterina V. Solovieva ◽  
Vera P. Myakinina ◽  
...  

We report here the genome sequences of 10 Klebsiella pneumoniae strains of capsular type K2 isolated in Russia from patients in an infectious clinical hospital and neurosurgical intensive care unit. The draft genome sizes range from 5.34 to 5.87 Mb and include 5,448 to 6,137 protein-coding sequences.


2016 ◽  
Vol 4 (6) ◽  
Author(s):  
Xuehua Wan ◽  
Shaobin Hou ◽  
Kazukuni Hayashi ◽  
James Anderson ◽  
Stuart P. Donachie

Rheinheimera salexigens KH87 T is an obligately halophilic gammaproteobacterium. The strain’s draft genome sequence, generated by the Roche 454 GS FLX+ platform, comprises two scaffolds of ~3.4 Mbp and ~3 kbp, with 3,030 protein-coding sequences and 58 tRNA coding regions. The G+C content is 42 mol%.


2019 ◽  
Vol 8 (43) ◽  
Author(s):  
Zothanpuia ◽  
Ajit Kumar Passari ◽  
Purbajyoti Deka ◽  
Vinay Rajput ◽  
Lakshmi P. M. Priya ◽  
...  

We report the draft genome sequence of Streptomyces sp. strain BPSDS2, isolated from freshwater sediments in Northeast India. The draft genome has a size of 8.27 Mb and 7,559 protein-coding sequences.


Sign in / Sign up

Export Citation Format

Share Document