scholarly journals CANT-HYD: A Curated Database of Phylogeny-Derived Hidden Markov Models for Annotation of Marker Genes Involved in Hydrocarbon Degradation

2022 ◽  
Vol 12 ◽  
Author(s):  
Varada Khot ◽  
Jackie Zorz ◽  
Daniel A. Gittins ◽  
Anirban Chakraborty ◽  
Emma Bell ◽  
...  

Many pathways for hydrocarbon degradation have been discovered, yet there are no dedicated tools to identify and predict the hydrocarbon degradation potential of microbial genomes and metagenomes. Here we present the Calgary approach to ANnoTating HYDrocarbon degradation genes (CANT-HYD), a database of 37 HMMs of marker genes involved in anaerobic and aerobic degradation pathways of aliphatic and aromatic hydrocarbons. Using this database, we identify understudied or overlooked hydrocarbon degradation potential in many phyla. We also demonstrate its application in analyzing high-throughput sequence data by predicting hydrocarbon utilization in large metagenomic datasets from diverse environments. CANT-HYD is available at https://github.com/dgittins/CANT-HYD-HydrocarbonBiodegradation.

2021 ◽  
Author(s):  
Varada Khot ◽  
Jackie Zorz ◽  
Daniel A Gittins ◽  
Anirban Chakraborty ◽  
Emma Bell ◽  
...  

Discovery of microbial hydrocarbon degradation pathways has traditionally relied on laboratory isolation and characterization of microorganisms. Although many metabolic pathways for hydrocarbon degradation have been discovered, the absence of tools dedicated to their annotation makes it difficult to identify the relevant genes and predict the hydrocarbon degradation potential of microbial genomes and metagenomes. Furthermore, sequence homology between hydrocarbon degradation genes and genes with other functions often results in misannotation. A tool that systematically identifies hydrocarbon metabolic potential is therefore needed. We present the Calgary approach to ANnoTating HYDrocarbon degradation genes (CANT-HYD), a database containing HMMs of 37 marker genes involved in anaerobic and aerobic degradation pathways of aliphatic and aromatic hydrocarbons. Using this database, we show that hydrocarbon metabolic potential is widespread in the tree of life and identify understudied or overlooked hydrocarbon degradation potential in many phyla. We also demonstrate scalability by analyzing large metagenomic datasets for the prediction of hydrocarbon utilization in diverse environments. To the best of our knowledge, CANT-HYD is the first comprehensive tool for robust and accurate identification of marker genes associated with aerobic and anaerobic hydrocarbon degradation.


2018 ◽  
Vol 30 (1) ◽  
pp. 216-236
Author(s):  
Rasmus Troelsgaard ◽  
Lars Kai Hansen

Model-based classification of sequence data using a set of hidden Markov models is a well-known technique. The involved score function, which is often based on the class-conditional likelihood, can, however, be computationally demanding, especially for long data sequences. Inspired by recent theoretical advances in spectral learning of hidden Markov models, we propose a score function based on third-order moments. In particular, we propose to use the Kullback-Leibler divergence between theoretical and empirical third-order moments for classification of sequence data with discrete observations. The proposed method provides lower computational complexity at classification time than the usual likelihood-based methods. In order to demonstrate the properties of the proposed method, we perform classification of both simulated data and empirical data from a human activity recognition study.


2010 ◽  
Vol 10 (4-6) ◽  
pp. 449-464
Author(s):  
HENNING CHRISTIANSEN ◽  
CHRISTIAN THEIL HAVE ◽  
OLE TORP LASSEN ◽  
MATTHIEU PETIT

AbstractA Hidden Markov Model (HMM) is a common statistical model which is widely used for analysis of biological sequence data and other sequential phenomena. In the present paper we show how HMMs can be extended with side-constraints and present constraint solving techniques for efficient inference. Defining HMMs with side-constraints in Constraint Logic Programming has advantages in terms of more compact expression and pruning opportunities during inference. We present a PRISM-based framework for extending HMMs with side-constraints and show how well-known constraints such as cardinality and all_different are integrated. We experimentally validate our approach on the biologically motivated problem of global pairwise alignment.


2019 ◽  
Author(s):  
Rodney T. Richardson ◽  
Douglas B. Sponsler ◽  
Harper McMinn-Sauder ◽  
Reed M. Johnson

SummaryThe community-level analysis of samples containing diverse genetic material, via metabarcoding and metagenomic approaches, is increasingly popular. While the production of sequence data for such studies has become straightforward, questions remain about how best to analyze and taxonomically characterize sequence data. For many sequence classification approaches, an important component of the workflow involves the curation of reference sequences. Ideally, this involves trimming away extraneous sequence at the 3 prime and 5 prime ends of the target marker of interest, as well as the removal of reference sequence duplicates. Here, we present MetaCurator, a software package written in Python, designed for automated reference sequence curation and highly generalizable across markers and study systems. MetaCurator is organized in a modular fashion, so users can implement tools individually in addition to utilizing the automated and flexible MetaCurator parental code. Aside from modules used to organize and format taxonomic lineage data, MetaCurator contains two signature tools. IterRazor utilizes profile hidden Markov models and an iterative search framework to exhaustively identify and extract the precise amplicon marker of interest from available reference sequence data. DerepByTaxonomy then facilitates sequence dereplication using a taxonomically aware approach, removing duplicates only when they belong to the same taxon. This is important for cases of incomplete lineage sorting between species and for highly conserved markers, such as plantrbcLandtrnL, which often display no sequence divergence across taxa, even at the genus level.Availability and implementationMetaCurator is supported on OSX and Linux (RedHat/CentOS) and is freely available under a GPL v3.0 license athttps://github.com/RTRichar/[email protected] informationCode associated with this work is available athttps://github.com/RTRichar/MetabarcodeDBsV2and additional analysis is presented in supplementary files.


Viruses ◽  
2019 ◽  
Vol 11 (2) ◽  
pp. 195 ◽  
Author(s):  
Cynthia Maria Chibani ◽  
Anton Farr ◽  
Sandra Klama ◽  
Sascha Dietrich ◽  
Heiko Liesegang

This work reports the method ClassiPhage to classify phage genomes using sequence derived taxonomic features. ClassiPhage uses a set of phage specific Hidden Markov Models (HMMs) generated from clusters of related proteins. The method was validated on all publicly available genomes of phages that are known to infect Vibrionaceae. The phages belong to the well-described phage families of Myoviridae, Podoviridae, Siphoviridae, and Inoviridae. The achieved classification is consistent with the assignments of the International Committee on Taxonomy of Viruses (ICTV), all tested phages were assigned to the corresponding group of the ICTV-database. In addition, 44 out of 58 genomes of Vibrio phages not yet classified could be assigned to a phage family. The remaining 14 genomes may represent phages of new families or subfamilies. Comparative genomics indicates that the ability of the approach to identify and classify phages is correlated to the conserved genomic organization. ClassiPhage classifies phages exclusively based on genome sequence data and can be applied on distinct phage genomes as well as on prophage regions within host genomes. Possible applications include (a) classifying phages from assembled metagenomes; and (b) the identification and classification of integrated prophages and the splitting of phage families into subfamilies.


PLoS ONE ◽  
2014 ◽  
Vol 9 (8) ◽  
pp. e105067 ◽  
Author(s):  
Peter Skewes-Cox ◽  
Thomas J. Sharpton ◽  
Katherine S. Pollard ◽  
Joseph L. DeRisi

2006 ◽  
Vol 17 (4) ◽  
pp. 1734-1743 ◽  
Author(s):  
Bill Wickstead ◽  
Keith Gull

Kinesin superfamily proteins are ubiquitous to all eukaryotes and essential for several key cellular processes. With the establishment of genome sequence data for a substantial number of eukaryotes, it is now possible for the first time to analyze the complete kinesin repertoires of a diversity of organisms from most eukaryotic kingdoms. Such a “holistic” approach using 486 kinesin-like sequences from 19 eukaryotes and analyzed by Bayesian techniques, identifies three new kinesin families, two new phylum-specific groups, and unites two previously identified families. The paralogue distribution suggests that the eukaryotic cenancestor possessed nearly all kinesin families. However, multiple losses in individual lineages mean that no family is ubiquitous to all organisms and that the present day distribution reflects common biology more than it does common ancestry. In particular, the distribution of four families—Kinesin-2, -9, and the proposed new families Kinesin-16 and -17—correlates with the possession of cilia/flagella, and this can be used to predict a flagellar function for two new kinesin families. Finally, we present a set of hidden Markov models that can reliably place most new kinesin sequences into families, even when from an organism at a great evolutionary distance from those in the analysis.


Sign in / Sign up

Export Citation Format

Share Document