scholarly journals A new algorithm to train hidden Markov models for biological sequences with partial labels

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jiefu Li ◽  
Jung-Youn Lee ◽  
Li Liao

Abstract Background Hidden Markov models (HMM) are a powerful tool for analyzing biological sequences in a wide variety of applications, from profiling functional protein families to identifying functional domains. The standard method used for HMM training is either by maximum likelihood using counting when sequences are labelled or by expectation maximization, such as the Baum–Welch algorithm, when sequences are unlabelled. However, increasingly there are situations where sequences are just partially labelled. In this paper, we designed a new training method based on the Baum–Welch algorithm to train HMMs for situations in which only partial labeling is available for certain biological problems. Results Compared with a similar method previously reported that is designed for the purpose of active learning in text mining, our method achieves significant improvements in model training, as demonstrated by higher accuracy when the trained models are tested for decoding with both synthetic data and real data. Conclusions A novel training method is developed to improve the training of hidden Markov models by utilizing partial labelled data. The method will impact on detecting de novo motifs and signals in biological sequence data. In particular, the method will be deployed in active learning mode to the ongoing research in detecting plasmodesmata targeting signals and assess the performance with validations from wet-lab experiments.

2018 ◽  
Vol 30 (1) ◽  
pp. 216-236
Author(s):  
Rasmus Troelsgaard ◽  
Lars Kai Hansen

Model-based classification of sequence data using a set of hidden Markov models is a well-known technique. The involved score function, which is often based on the class-conditional likelihood, can, however, be computationally demanding, especially for long data sequences. Inspired by recent theoretical advances in spectral learning of hidden Markov models, we propose a score function based on third-order moments. In particular, we propose to use the Kullback-Leibler divergence between theoretical and empirical third-order moments for classification of sequence data with discrete observations. The proposed method provides lower computational complexity at classification time than the usual likelihood-based methods. In order to demonstrate the properties of the proposed method, we perform classification of both simulated data and empirical data from a human activity recognition study.


2005 ◽  
Vol 03 (06) ◽  
pp. 1441-1460 ◽  
Author(s):  
STEINAR THORVALDSEN

Hidden Markov Models (HMM) can be extremely useful tools for the analysis of data from biological sequences, and provide a probabilistic model of protein families. Most reviews and general introductions follow the excellent tutorial by Rabiner,1 where the focus is outside biology. Mendel's famous experiments in plant hybridisation were published in 1866 and are often considered the icebreaking work of modern genetics. He had no prior knowledge of the dual nature of genes, but through a series of experiments he was able to anticipate the hidden concept and name it "Elemente". In this paper we present the background, theory and algorithms of HMM based on examples from Mendel's experiments, and introduce the toolbox "mendelHMM". This approach is considered to have some intuitive advantages in a biological and bioinformatical setting. Applications to analysing bio-sequences like nucleic acids and proteins are also discussed.


Sign in / Sign up

Export Citation Format

Share Document