Study of triplet periodicity differences inside and between genomes

Author(s):  
Yulia M. Suvorova ◽  
Eugene V. Korotkov

AbstractTriplet periodicity (TP) is a distinctive feature of the protein coding sequences of both prokaryotic and eukaryotic genomes. In this work, we explored the TP difference inside and between 45 prokaryotic genomes. We constructed two hypotheses of TP distribution on a set of coding sequences and generated artificial datasets that correspond to the hypotheses. We found that TP is more similar inside a genome than between genomes and that TP distribution inside a real genome dataset corresponds to the hypothesis which implies that a common TP pattern exists for the majority of sequences inside a genome. Additionally, we performed gene classification based on TP matrixes. This classification showed that TP allows identification of the genome to which a given gene belongs with more than 85% accuracy.

2021 ◽  
Vol 10 (28) ◽  
Author(s):  
Ryosuke Nakai ◽  
Hiroyuki Kusada ◽  
Fumihiro Sassa ◽  
Susumu Morigasaki ◽  
Hisayoshi Hayashi ◽  
...  

We report the draft genome sequence of a novel Rhodospirillales bacterium strain, TMPK1, isolated from a micropore-filtered soil suspension. This strain has a genome of 4,249,070 bp, comprising 4,151 protein-coding sequences. The genome sequence data further suggest that strain TMPK1 is an alphaproteobacterium capable of carotenoid production.


2019 ◽  
Vol 8 (7) ◽  
Author(s):  
Juan J. Marizcurrena ◽  
Danilo Morales ◽  
Pablo Smircich ◽  
Susana Castro-Sowinski

We report the draft genome sequence of the Antarctic UV-resistant bacterium Sphingomonas sp. strain UV9. The strain has a genome size of 4.25 Mb, a 65.62% GC content, and 3,879 protein-coding sequences.


2019 ◽  
Vol 8 (27) ◽  
Author(s):  
Ji Young Jung ◽  
Jin-Woo Jeong ◽  
Seung-Young Lee ◽  
Hyun Mi Jin ◽  
Hee Won Choi ◽  
...  

ABSTRACT Leuconostoc kimchii strain NKJ218 was isolated from homemade kimchi in South Korea. The whole genome was sequenced using the PacBio RS II and Illumina NovoSeq 6000 platforms. Here, we report a genome sequence of strain NKJ218, which consists of a 1.9-Mbp chromosome and three plasmid contigs. A total of 2,005 coding sequences (CDS) were predicted, including 1,881 protein-coding sequences.


2021 ◽  
Vol 3 (3) ◽  
Author(s):  
Nicholas P Cooley ◽  
Erik S Wright

Abstract The observed diversity of protein coding sequences continues to increase far more rapidly than knowledge of their functions, making classification algorithms essential for assigning a function to proteins using only their sequence. Most pipelines for annotating proteins rely on searches for homologous sequences in databases of previously annotated proteins using BLAST or HMMER. Here, we develop a new approach for classifying proteins into a taxonomy of functions and demonstrate its utility for genome annotation. Our algorithm, IDTAXA, was more accurate than BLAST or HMMER at assigning sequences to KEGG ortholog groups. Moreover, IDTAXA correctly avoided classifying sequences with novel functions to existing groups, which is a common error mode for classification approaches that rely on E-values as a proxy for confidence. We demonstrate IDTAXA’s utility for annotating eukaryotic and prokaryotic genomes by assigning functions to proteins within a multi-level ontology and applied IDTAXA to detect genome contamination in eukaryotic genomes. Finally, we re-annotated 8604 microbial genomes with known antibiotic resistance phenotypes to discover two novel associations between proteins and antibiotic resistance. IDTAXA is available as a web tool (http://DECIPHER.codes/Classification.html) or as part of the open source DECIPHER R package from Bioconductor.


2006 ◽  
Vol 72 (5) ◽  
pp. 3274-3283 ◽  
Author(s):  
Agn�s Cimerman ◽  
Guillaume Arnaud ◽  
Xavier Foissac

ABSTRACT Phytoplasmas are unculturable bacterial plant pathogens transmitted by phloem-feeding hemipteran insects. DNA of phytoplasmas is difficult to purify because of their exclusive phloem location and low abundance in plants. To overcome this constraint, suppression subtractive hybridization (SSH) was modified and used to selectively amplify DNA of the stolbur phytoplasma infecting a periwinkle plant. Plasmid libraries were constructed, and the origins of the DNA inserts were verified by hybridization and PCR screenings. After a single round of SSH, there was still a significant level of contamination with plant DNA (around 50%). However, the modified SSH, which included a second round of subtraction (double SSH), resulted in an increased phytoplasma DNA purity (97%). Results validated double SSH as an efficient way to produce a genome survey for microbial agents unavailable in culture. Assembly of 266 insert sequences revealed 181 phytoplasma genetic loci which were annotated. Comparative analysis of 113 kbp indicated that among 217 protein coding sequences, 83% were homologous to “Candidatus Phytoplasma asteris” (OY-M strain) genes, with hits widely distributed along the chromosome. Most of the stolbur-specific SSH sequences were orphan genes, with the exception of two partial coding sequences encoding proteins homologous to a mycoplasma surface protein and riboflavin kinase.


2015 ◽  
Vol 376 ◽  
pp. 8-14 ◽  
Author(s):  
Jia-Feng Yu ◽  
Qing-Li Chen ◽  
Jing Ren ◽  
Yan-Ling Yang ◽  
Ji-Hua Wang ◽  
...  

2017 ◽  
Vol 3 ◽  
pp. e118 ◽  
Author(s):  
Andrew E. Webb ◽  
Thomas A. Walsh ◽  
Mary J. O’Connell

Background Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges, particularly when working with entire proteomes (all protein coding sequences in a genome) from a large number of species. Methods We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and Perl and is designed to run within a UNIX environment. Results We have benchmarked VESPA and our results show that the method is consistent, performs well on both large scale and smaller scale datasets, and produces results in line with previously published datasets. Discussion Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: http://www.mol-evol.org/VESPA.


2019 ◽  
Vol 8 (23) ◽  
Author(s):  
Si Chul Kim ◽  
Hyo Jung Lee

Here, we report the draft genome sequence of Pseudorhodobacter sp. strain E13, a Gram-negative, aerobic, nonflagellated, and rod-shaped bacterium which was isolated from the Yellow Sea in South Korea. The assembled genome sequence is 3,878,578 bp long with 3,646 protein-coding sequences in 159 contigs.


Sign in / Sign up

Export Citation Format

Share Document