CANT-HYD: A Curated Database of Phylogeny-Derived Hidden Markov Models for Annotation of Marker Genes Involved in Hydrocarbon Degradation

Many pathways for hydrocarbon degradation have been discovered, yet there are no dedicated tools to identify and predict the hydrocarbon degradation potential of microbial genomes and metagenomes. Here we present the Calgary approach to ANnoTating HYDrocarbon degradation genes (CANT-HYD), a database of 37 HMMs of marker genes involved in anaerobic and aerobic degradation pathways of aliphatic and aromatic hydrocarbons. Using this database, we identify understudied or overlooked hydrocarbon degradation potential in many phyla. We also demonstrate its application in analyzing high-throughput sequence data by predicting hydrocarbon utilization in large metagenomic datasets from diverse environments. CANT-HYD is available at https://github.com/dgittins/CANT-HYD-HydrocarbonBiodegradation.

Download Full-text

CANT-HYD: A curated database of phylogeny-derived Hidden Markov Models for annotation of marker genes involved in hydrocarbon degradation

10.1101/2021.06.10.447808 ◽

2021 ◽

Author(s):

Varada Khot ◽

Jackie Zorz ◽

Daniel A Gittins ◽

Anirban Chakraborty ◽

Emma Bell ◽

...

Keyword(s):

Markov Models ◽

Hydrocarbon Degradation ◽

Marker Genes ◽

Degradation Pathways ◽

Metabolic Potential ◽

Accurate Identification ◽

Isolation And Characterization ◽

Anaerobic Hydrocarbon Degradation ◽

Aliphatic And Aromatic Hydrocarbons

Discovery of microbial hydrocarbon degradation pathways has traditionally relied on laboratory isolation and characterization of microorganisms. Although many metabolic pathways for hydrocarbon degradation have been discovered, the absence of tools dedicated to their annotation makes it difficult to identify the relevant genes and predict the hydrocarbon degradation potential of microbial genomes and metagenomes. Furthermore, sequence homology between hydrocarbon degradation genes and genes with other functions often results in misannotation. A tool that systematically identifies hydrocarbon metabolic potential is therefore needed. We present the Calgary approach to ANnoTating HYDrocarbon degradation genes (CANT-HYD), a database containing HMMs of 37 marker genes involved in anaerobic and aerobic degradation pathways of aliphatic and aromatic hydrocarbons. Using this database, we show that hydrocarbon metabolic potential is widespread in the tree of life and identify understudied or overlooked hydrocarbon degradation potential in many phyla. We also demonstrate scalability by analyzing large metagenomic datasets for the prediction of hydrocarbon utilization in diverse environments. To the best of our knowledge, CANT-HYD is the first comprehensive tool for robust and accurate identification of marker genes associated with aerobic and anaerobic hydrocarbon degradation.

Download Full-text

Sequence Classification Using Third-Order Moments

Neural Computation ◽

10.1162/neco_a_01033 ◽

2018 ◽

Vol 30 (1) ◽

pp. 216-236

Author(s):

Rasmus Troelsgaard ◽

Lars Kai Hansen

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Sequence Data ◽

Hidden Markov ◽

Score Function ◽

Simulated Data ◽

Discrete Observations ◽

Third Order ◽

Leibler Divergence

Model-based classification of sequence data using a set of hidden Markov models is a well-known technique. The involved score function, which is often based on the class-conditional likelihood, can, however, be computationally demanding, especially for long data sequences. Inspired by recent theoretical advances in spectral learning of hidden Markov models, we propose a score function based on third-order moments. In particular, we propose to use the Kullback-Leibler divergence between theoretical and empirical third-order moments for classification of sequence data with discrete observations. The proposed method provides lower computational complexity at classification time than the usual likelihood-based methods. In order to demonstrate the properties of the proposed method, we perform classification of both simulated data and empirical data from a human activity recognition study.

Download Full-text

Inference with constrained hidden Markov models in PRISM

Theory and Practice of Logic Programming ◽

10.1017/s1471068410000219 ◽

2010 ◽

Vol 10 (4-6) ◽

pp. 449-464

Author(s):

HENNING CHRISTIANSEN ◽

CHRISTIAN THEIL HAVE ◽

OLE TORP LASSEN ◽

MATTHIEU PETIT

Keyword(s):

Markov Model ◽

Statistical Model ◽

Markov Models ◽

Sequence Data ◽

Hidden Markov ◽

Pairwise Alignment ◽

Constraint Solving ◽

Biological Sequence ◽

Side Constraints ◽

Compact Expression

AbstractA Hidden Markov Model (HMM) is a common statistical model which is widely used for analysis of biological sequence data and other sequential phenomena. In the present paper we show how HMMs can be extended with side-constraints and present constraint solving techniques for efficient inference. Defining HMMs with side-constraints in Constraint Logic Programming has advantages in terms of more compact expression and pruning opportunities during inference. We present a PRISM-based framework for extending HMMs with side-constraints and show how well-known constraints such as cardinality and all_different are integrated. We experimentally validate our approach on the biologically motivated problem of global pairwise alignment.

Download Full-text

Learning Frequent Episodes Based Hierarchical Hidden Markov Models in Sequence Data

Communications in Computer and Information Science - Advanced Research on Computer Science and Information Engineering ◽

10.1007/978-3-642-21411-0_19 ◽

2011 ◽

pp. 120-124 ◽

Cited By ~ 1

Author(s):

Li Wan

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Sequence Data ◽

Hidden Markov ◽

Frequent Episodes ◽

Hierarchical Hidden Markov Models

Download Full-text

Identifying Virus-Like Regions in Microbial Genomes Using Hidden Markov Models

10.1007/978-3-030-67742-8_17 ◽

2021 ◽

pp. 263-270

Author(s):

Frank O. Aylward

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Microbial Genomes

Download Full-text

Combining Sequence Analysis and Hidden Markov Models in the Analysis of Complex Life Sequence Data

Life Course Research and Social Policies - Sequence Analysis and Related Approaches ◽

10.1007/978-3-319-95420-2_11 ◽

2018 ◽

pp. 185-200 ◽

Cited By ~ 4

Author(s):

Satu Helske ◽

Jouni Helske ◽

Mervi Eerola

Keyword(s):

Sequence Analysis ◽

Hidden Markov Models ◽

Markov Models ◽

Sequence Data ◽

Hidden Markov

Download Full-text

MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers

10.1101/672782 ◽

2019 ◽

Cited By ~ 1

Author(s):

Rodney T. Richardson ◽

Douglas B. Sponsler ◽

Harper McMinn-Sauder ◽

Reed M. Johnson

Keyword(s):

Markov Models ◽

Sequence Data ◽

Hidden Markov ◽

Incomplete Lineage Sorting ◽

Genetic Material ◽

Sequence Divergence ◽

Reference Sequence ◽

Sequence Classification ◽

Lineage Sorting ◽

Link Type

SummaryThe community-level analysis of samples containing diverse genetic material, via metabarcoding and metagenomic approaches, is increasingly popular. While the production of sequence data for such studies has become straightforward, questions remain about how best to analyze and taxonomically characterize sequence data. For many sequence classification approaches, an important component of the workflow involves the curation of reference sequences. Ideally, this involves trimming away extraneous sequence at the 3 prime and 5 prime ends of the target marker of interest, as well as the removal of reference sequence duplicates. Here, we present MetaCurator, a software package written in Python, designed for automated reference sequence curation and highly generalizable across markers and study systems. MetaCurator is organized in a modular fashion, so users can implement tools individually in addition to utilizing the automated and flexible MetaCurator parental code. Aside from modules used to organize and format taxonomic lineage data, MetaCurator contains two signature tools. IterRazor utilizes profile hidden Markov models and an iterative search framework to exhaustively identify and extract the precise amplicon marker of interest from available reference sequence data. DerepByTaxonomy then facilitates sequence dereplication using a taxonomically aware approach, removing duplicates only when they belong to the same taxon. This is important for cases of incomplete lineage sorting between species and for highly conserved markers, such as plantrbcLandtrnL, which often display no sequence divergence across taxa, even at the genus level.Availability and implementationMetaCurator is supported on OSX and Linux (RedHat/CentOS) and is freely available under a GPL v3.0 license athttps://github.com/RTRichar/[email protected] informationCode associated with this work is available athttps://github.com/RTRichar/MetabarcodeDBsV2and additional analysis is presented in supplementary files.

Download Full-text

Classifying the Unclassified: A Phage Classification Method

Viruses ◽

10.3390/v11020195 ◽

2019 ◽

Vol 11 (2) ◽

pp. 195 ◽

Cited By ~ 10

Author(s):

Cynthia Maria Chibani ◽

Anton Farr ◽

Sandra Klama ◽

Sascha Dietrich ◽

Heiko Liesegang

Keyword(s):

Comparative Genomics ◽

Hidden Markov Models ◽

Genome Sequence ◽

Genomic Organization ◽

Markov Models ◽

Sequence Data ◽

Hidden Markov ◽

International Committee ◽

Related Proteins

This work reports the method ClassiPhage to classify phage genomes using sequence derived taxonomic features. ClassiPhage uses a set of phage specific Hidden Markov Models (HMMs) generated from clusters of related proteins. The method was validated on all publicly available genomes of phages that are known to infect Vibrionaceae. The phages belong to the well-described phage families of Myoviridae, Podoviridae, Siphoviridae, and Inoviridae. The achieved classification is consistent with the assignments of the International Committee on Taxonomy of Viruses (ICTV), all tested phages were assigned to the corresponding group of the ICTV-database. In addition, 44 out of 58 genomes of Vibrio phages not yet classified could be assigned to a phage family. The remaining 14 genomes may represent phages of new families or subfamilies. Comparative genomics indicates that the ability of the approach to identify and classify phages is correlated to the conserved genomic organization. ClassiPhage classifies phages exclusively based on genome sequence data and can be applied on distinct phage genomes as well as on prophage regions within host genomes. Possible applications include (a) classifying phages from assembled metagenomes; and (b) the identification and classification of integrated prophages and the splitting of phage families into subfamilies.

Download Full-text

Profile Hidden Markov Models for the Detection of Viruses within Metagenomic Sequence Data

PLoS ONE ◽

10.1371/journal.pone.0105067 ◽

2014 ◽

Vol 9 (8) ◽

pp. e105067 ◽

Cited By ~ 73

Author(s):

Peter Skewes-Cox ◽

Thomas J. Sharpton ◽

Katherine S. Pollard ◽

Joseph L. DeRisi

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Sequence Data ◽

Hidden Markov ◽

Metagenomic Sequence ◽

Profile Hidden Markov Models ◽

Metagenomic Sequence Data

Download Full-text

A “Holistic” Kinesin Phylogeny Reveals New Kinesin Families and Predicts Protein Functions

Molecular Biology of the Cell ◽

10.1091/mbc.e05-11-1090 ◽

2006 ◽

Vol 17 (4) ◽

pp. 1734-1743 ◽

Cited By ~ 101

Author(s):

Bill Wickstead ◽

Keith Gull

Keyword(s):

Markov Models ◽

Sequence Data ◽

Hidden Markov ◽

Holistic Approach ◽

Common Ancestry ◽

Bayesian Techniques ◽

Cellular Processes ◽

Protein Functions ◽

Kinesin Superfamily ◽

First Time

Kinesin superfamily proteins are ubiquitous to all eukaryotes and essential for several key cellular processes. With the establishment of genome sequence data for a substantial number of eukaryotes, it is now possible for the first time to analyze the complete kinesin repertoires of a diversity of organisms from most eukaryotic kingdoms. Such a “holistic” approach using 486 kinesin-like sequences from 19 eukaryotes and analyzed by Bayesian techniques, identifies three new kinesin families, two new phylum-specific groups, and unites two previously identified families. The paralogue distribution suggests that the eukaryotic cenancestor possessed nearly all kinesin families. However, multiple losses in individual lineages mean that no family is ubiquitous to all organisms and that the present day distribution reflects common biology more than it does common ancestry. In particular, the distribution of four families—Kinesin-2, -9, and the proposed new families Kinesin-16 and -17—correlates with the possession of cilia/flagella, and this can be used to predict a flagellar function for two new kinesin families. Finally, we present a set of hidden Markov models that can reliably place most new kinesin sequences into families, even when from an organism at a great evolutionary distance from those in the analysis.

Download Full-text