scholarly journals Dynamic order Markov model for categorical sequence clustering

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Rongbo Chen ◽  
Haojun Sun ◽  
Lifei Chen ◽  
Jianfei Zhang ◽  
Shengrui Wang

AbstractMarkov models are extensively used for categorical sequence clustering and classification due to their inherent ability to capture complex chronological dependencies hidden in sequential data. Existing Markov models are based on an implicit assumption that the probability of the next state depends on the preceding context/pattern which is consist of consecutive states. This restriction hampers the models since some patterns, disrupted by noise, may be not frequent enough in a consecutive form, but frequent in a sparse form, which can not make use of the information hidden in the sequential data. A sparse pattern corresponds to a pattern in which one or some of the state(s) between the first and last one in the pattern is/are replaced by wildcard(s) that can be matched by a subset of values in the state set. In this paper, we propose a new model that generalizes the conventional Markov approach making it capable of dealing with the sparse pattern and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern with variable wildcards. The model, named Dynamic order Markov model (DOMM), allows deriving a new similarity measure between a sequence and a set of sequences/cluster. DOMM builds a sparse pattern from sub-frequent patterns that contain significant statistical information veiled by the noise. To implement DOMM, we propose a sparse pattern detector (SPD) based on the probability suffix tree (PST) capable of discovering both sparse and consecutive patterns, and then we develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for categorical sequence clustering. Experimental results on real-world datasets demonstrate the promising performance of the proposed model.

Author(s):  
Zekun Xu ◽  
Ye Liu

Hidden Markov model (HMM) has been a popular choice for financial time series modeling due to its advantage in capturing dynamic regimes. However, HMM's implicit assumption that the state duration follows a geometric distribution is too strong to hold in practice. In this work, we propose a regularized vector autoregressive hidden semi-Markov model to analyze multivariate financial time series. One challenge in such a model setting is that the number of parameters is too large to be reliably estimated unless the time series is extremely long. To address this issue, an augmented EM algorithm is developed for parameter estimation by using regularized estimators for the state-dependent covariance matrices and autoregression matrices in the M-step. The performance of the proposed model is evaluated in a simulation experiment, and demonstrated with the New York Stock Exchange financial portfolio data.


Author(s):  
Marius Ötting ◽  
Roland Langrock ◽  
Antonello Maruotti

AbstractWe investigate the potential occurrence of change points—commonly referred to as “momentum shifts”—in the dynamics of football matches. For that purpose, we model minute-by-minute in-game statistics of Bundesliga matches using hidden Markov models (HMMs). To allow for within-state dependence of the variables, we formulate multivariate state-dependent distributions using copulas. For the Bundesliga data considered, we find that the fitted HMMs comprise states which can be interpreted as a team showing different levels of control over a match. Our modelling framework enables inference related to causes of momentum shifts and team tactics, which is of much interest to managers, bookmakers, and sports fans.


2021 ◽  
Vol 15 (5) ◽  
pp. 1-32
Author(s):  
Quang-huy Duong ◽  
Heri Ramampiaro ◽  
Kjetil Nørvåg ◽  
Thu-lan Dam

Dense subregion (subgraph & subtensor) detection is a well-studied area, with a wide range of applications, and numerous efficient approaches and algorithms have been proposed. Approximation approaches are commonly used for detecting dense subregions due to the complexity of the exact methods. Existing algorithms are generally efficient for dense subtensor and subgraph detection, and can perform well in many applications. However, most of the existing works utilize the state-or-the-art greedy 2-approximation algorithm to capably provide solutions with a loose theoretical density guarantee. The main drawback of most of these algorithms is that they can estimate only one subtensor, or subgraph, at a time, with a low guarantee on its density. While some methods can, on the other hand, estimate multiple subtensors, they can give a guarantee on the density with respect to the input tensor for the first estimated subsensor only. We address these drawbacks by providing both theoretical and practical solution for estimating multiple dense subtensors in tensor data and giving a higher lower bound of the density. In particular, we guarantee and prove a higher bound of the lower-bound density of the estimated subgraph and subtensors. We also propose a novel approach to show that there are multiple dense subtensors with a guarantee on its density that is greater than the lower bound used in the state-of-the-art algorithms. We evaluate our approach with extensive experiments on several real-world datasets, which demonstrates its efficiency and feasibility.


2021 ◽  
Vol 15 (6) ◽  
pp. 1-18
Author(s):  
Kai Liu ◽  
Xiangyu Li ◽  
Zhihui Zhu ◽  
Lodewijk Brand ◽  
Hua Wang

Nonnegative Matrix Factorization (NMF) is broadly used to determine class membership in a variety of clustering applications. From movie recommendations and image clustering to visual feature extractions, NMF has applications to solve a large number of knowledge discovery and data mining problems. Traditional optimization methods, such as the Multiplicative Updating Algorithm (MUA), solves the NMF problem by utilizing an auxiliary function to ensure that the objective monotonically decreases. Although the objective in MUA converges, there exists no proof to show that the learned matrix factors converge as well. Without this rigorous analysis, the clustering performance and stability of the NMF algorithms cannot be guaranteed. To address this knowledge gap, in this article, we study the factor-bounded NMF problem and provide a solution algorithm with proven convergence by rigorous mathematical analysis, which ensures that both the objective and matrix factors converge. In addition, we show the relationship between MUA and our solution followed by an analysis of the convergence of MUA. Experiments on both toy data and real-world datasets validate the correctness of our proposed method and its utility as an effective clustering algorithm.


2018 ◽  
Vol 42 (4) ◽  
pp. 380 ◽  
Author(s):  
Jiqiong You ◽  
Yuejen Zhao ◽  
Paul Lawton ◽  
Steven Guthridge ◽  
Stephen P. McDonald ◽  
...  

Objective The aim of the present study was to evaluate the potential effects of different health intervention strategies on demand for renal replacement therapy (RRT) services in the Northern Territory (NT). Methods A Markov chain simulation model was developed to estimate demand for haemodialysis (HD) and kidney transplantation (Tx) over the next 10 years, based on RRT registry data between 2002 and 2013. Four policy-relevant scenarios were evaluated: (1) increased Tx; (2) increased self-care dialysis; (3) reduced incidence of end-stage kidney disease (ESKD); and (4) reduced mortality. Results There were 957 new cases of ESKD during the study period, with most patients being Indigenous people (85%). The median age was 50 years at onset and 57 years at death, 12 and 13 years younger respectively than Australian medians. The prevalence of RRT increased 5.6% annually, 20% higher than the national rate (4.7%). If current trends continue (baseline scenario), the demand for facility-based HD (FHD) would approach 100 000 treatments (95% confidence interval 75 000–121 000) in 2023, a 5% annual increase. Increasing Tx (0.3%), increasing self-care (5%) and reducing incidence (5%) each attenuate demand for FHD to ~70 000 annually by 2023. Conclusions The present study demonstrates the effects of changing service patterns to increase Tx, self-care and prevention, all of which will substantially attenuate the growth in FHD requirements in the NT. What is known about the topic? The burden of ESKD is projected to increase in the NT, with demand for FHD doubling every 15 years. Little is known about the potential effect of changes in health policy and clinical practice on demand. What does this paper add? This study assessed the usefulness of a stochastic Markov model to evaluate the effects of potential policy changes on FHD demand. What are the implications for practitioners? The scenarios simulated by the stochastic Markov models suggest that changes in current ESKD management practices would have a large effect on future demand for FHD.


2016 ◽  
Vol 25 (05) ◽  
pp. 1640001 ◽  
Author(s):  
Sotirios Chatzis ◽  
Dimitrios Kosmopoulos ◽  
George Papadourakis

Hidden Markov models (HMMs) are a popular approach for modeling sequential data, typically based on the assumption of a first-order Markov chain. In other words, only one-step back dependencies are modeled which is a rather unrealistic assumption in most applications. In this paper, we propose a method for postulating HMMs with approximately infinitely-long time-dependencies. Our approach considers the whole history of model states in the postulated dependencies, by making use of a recently proposed nonparametric Bayesian method for modeling label sequences with infinitely-long time dependencies, namely the sequence memoizer. We manage to derive training and inference algorithms for our model with computational costs identical to simple first-order HMMs, despite its entailed infinitely-long time-dependencies, by employing a mean-field-like approximation. The efficacy of our proposed model is experimentally demonstrated.


2018 ◽  
Vol 6 (1) ◽  
pp. 41-64 ◽  
Author(s):  
Aslak Tveito ◽  
Mary M. Maleckar ◽  
Glenn T. Lines

AbstractSingle channel dynamics can be modeled using stochastic differential equations, and the dynamics of the state of the channel (e.g. open, closed, inactivated) can be represented using Markov models. Such models can also be used to represent the effect of mutations as well as the effect of drugs used to alleviate deleterious effects of mutations. Based on the Markov model and the stochastic models of the single channel, it is possible to derive deterministic partial differential equations (PDEs) giving the probability density functions (PDFs) of the states of the Markov model. In this study, we have analyzed PDEs modeling wild type (WT) channels, mutant channels (MT) and mutant channels for which a drug has been applied (MTD). Our aim is to show that it is possible to optimize the parameters of a given drug such that the solution of theMTD model is very close to that of the WT: the mutation’s effect is, theoretically, reduced significantly.We will present the mathematical framework underpinning this methodology and apply it to several examples. In particular, we will show that it is possible to use the method to, theoretically, improve the properties of some well-known existing drugs.


Entropy ◽  
2021 ◽  
Vol 23 (3) ◽  
pp. 313
Author(s):  
Imon Banerjee ◽  
Vinayak A. Rao ◽  
Harsha Honnappa

Datasets displaying temporal dependencies abound in science and engineering applications, with Markov models representing a simplified and popular view of the temporal dependence structure. In this paper, we consider Bayesian settings that place prior distributions over the parameters of the transition kernel of a Markov model, and seek to characterize the resulting, typically intractable, posterior distributions. We present a Probably Approximately Correct (PAC)-Bayesian analysis of variational Bayes (VB) approximations to tempered Bayesian posterior distributions, bounding the model risk of the VB approximations. Tempered posteriors are known to be robust to model misspecification, and their variational approximations do not suffer the usual problems of over confident approximations. Our results tie the risk bounds to the mixing and ergodic properties of the Markov data generating model. We illustrate the PAC-Bayes bounds through a number of example Markov models, and also consider the situation where the Markov model is misspecified.


2017 ◽  
Vol 33 (8) ◽  
pp. 2765-2779 ◽  
Author(s):  
António Simões ◽  
José Manuel Viegas ◽  
José Torres Farinha ◽  
Inácio Fonseca

Sign in / Sign up

Export Citation Format

Share Document