mdl principle Latest Research Papers

Mint: MDL-based approach for Mining INTeresting Numerical Pattern Sets

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00799-9 ◽

2021 ◽

Author(s):

Tatiana Makhalova ◽

Sergei O. Kuznetsov ◽

Amedeo Napoli

Keyword(s):

Data Mining ◽

Pattern Mining ◽

Research Area ◽

Subgroup Discovery ◽

Mdl Principle

AbstractPattern mining is well established in data mining research, especially for mining binary datasets. Surprisingly, there is much less work about numerical pattern mining and this research area remains under-explored. In this paper we propose Mint, an efficient MDL-based algorithm for mining numerical datasets. The MDL principle is a robust and reliable framework widely used in pattern mining, and as well in subgroup discovery. In Mint we reuse MDL for discovering useful patterns and returning a set of non-redundant overlapping patterns with well-defined boundaries and covering meaningful groups of objects. Mint is not alone in the category of numerical pattern miners based on MDL. In the experiments presented in the paper we show that Mint outperforms competitors among which IPD, RealKrimp, and Slim.

Download Full-text

Learning CNF Theories Using MDL and Predicate Invention

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/358 ◽

2021 ◽

Author(s):

Arcchit Jain ◽

Clément Gautrais ◽

Angelika Kimmig ◽

Luc De Raedt

Keyword(s):

Machine Learning ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Pattern ◽

Predicate Invention ◽

Mdl Principle ◽

Distinguishing Features ◽

Novel Algorithm ◽

Inverse Resolution

We revisit the problem of learning logical theories from examples, one of the most quintessential problems in machine learning. More specifically, we develop an approach to learn CNF-formulae from satisfiability. This is a setting in which the examples correspond to partial interpretations and an example is classified as positive when it is logically consistent with the theory. We present a novel algorithm, called Mistle -- Minimal SAT Theory Learner, for learning such theories. The distinguishing features are that 1) Mistle performs predicate invention and inverse resolution, 2) is based on the MDL principle to compress the data, and 3) combines this with frequent pattern mining to find the most interesting theories. The experiments demonstrate that Mistle can learn CNF theories accurately and works well in tasks involving compression and classification.

Download Full-text

Log Pattern Mining for Distributed System Maintenance

Complexity ◽

10.1155/2020/6628165 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Jia Chen ◽

Peng Wang ◽

Shiqing Du ◽

Wei Wang

Keyword(s):

Distributed System ◽

Pattern Mining ◽

Minimum Description Length ◽

Log Analysis ◽

Event Sequences ◽

Mdl Principle ◽

Temporal Relationships ◽

Efficiency And Effectiveness ◽

System Maintenance ◽

Rich Information

Due to the complexity of the network structure, log analysis is usually necessary for the maintenance of network-based distributed systems since logs record rich information about the system behaviors. In recent years, numerous works have been proposed for log analysis; however, they ignore temporal relationships between logs. In this paper, we target on the problem of mining informative patterns from temporal log data. We propose an approach to discover sequential patterns from event sequences with temporal regularities. Discovered patterns are useful for engineers to understand the behaviors of a network-based distributed system. To solve the well-known problem of pattern explosion, we resort to the minimum description length (MDL) principle and take a step forward in summarizing the temporal relationships between adjacent events of a pattern. Experiments on real log datasets prove the efficiency and effectiveness of our method.

Download Full-text

Online summarization of dynamic graphs using subjective interestingness for sequential data

Data Mining and Knowledge Discovery ◽

10.1007/s10618-020-00714-8 ◽

2020 ◽

Author(s):

Sarang Kapoor ◽

Dhish Kumar Saxena ◽

Matthijs van Leeuwen

Keyword(s):

Real World ◽

Information Gain ◽

Current Knowledge ◽

Maximum Entropy Principle ◽

Sequential Data ◽

Dynamic Graphs ◽

Dynamic Graph ◽

Real World Data ◽

Mdl Principle ◽

Subjective Interestingness

Abstract Many real-world phenomena can be represented as dynamic graphs, i.e., networks that change over time. The problem of dynamic graph summarization, i.e., to succinctly describe the evolution of a dynamic graph, has been widely studied. Existing methods typically use objective measures to find fixed structures such as cliques, stars, and cores. Most of the methods, however, do not consider the problem of online summarization, where the summary is incrementally conveyed to the analyst as the graph evolves, and (thus) do not take into account the knowledge of the analyst at a specific moment in time. We address this gap in the literature through a novel, generic framework for subjective interestingness for sequential data. Specifically, we iteratively identify atomic changes, called ‘actions’, that provide most information relative to the current knowledge of the analyst. For this, we introduce a novel information gain measure, which is motivated by the minimum description length (MDL) principle. With this measure, our approach discovers compact summaries without having to decide on the number of patterns. As such, we are the first to combine approaches for data mining based on subjective interestingness (using the maximum entropy principle) with pattern-based summarization (using the MDL principle). We instantiate this framework for dynamic graphs and dense subgraph patterns, and present DSSG, a heuristic algorithm for the online summarization of dynamic graphs by means of informative actions, each of which represents an interpretable change to the connectivity structure of the graph. The experiments on real-world data demonstrate that our approach effectively discovers informative summaries. We conclude with a case study on data from an airline network to show its potential for real-world applications.

Download Full-text

Discovering outstanding subgroup lists for numeric targets using MDL

10.31219/osf.io/wsdez ◽

2020 ◽

Author(s):

Hugo Manuel Proença ◽

Peter Grünwald ◽

Thomas Bäck ◽

Matthijs van Leeuwen

Keyword(s):

State Of The Art ◽

Minimum Description Length ◽

Problem Formulation ◽

Subgroup Discovery ◽

Target Variable ◽

Trade Off ◽

Mdl Principle ◽

Large Numbers ◽

The One ◽

Target Attribute

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that---in addition---it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

Download Full-text

Vouw: Geometric Pattern Mining Using the MDL Principle

Lecture Notes in Computer Science - Advances in Intelligent Data Analysis XVIII ◽

10.1007/978-3-030-44584-3_13 ◽

2020 ◽

pp. 158-170

Author(s):

Micky Faas ◽

Matthijs van Leeuwen

Keyword(s):

Pattern Mining ◽

Geometric Pattern ◽

Mdl Principle

Download Full-text

Detecting Metachanges in Data Streams from the Viewpoint of the MDL Principle

Entropy ◽

10.3390/e21121134 ◽

2019 ◽

Vol 21 (12) ◽

pp. 1134 ◽

Cited By ~ 1

Author(s):

Shintaro Fukushima ◽

Kenji Yamanishi

Keyword(s):

Data Streams ◽

Data Stream ◽

Minimum Description Length ◽

Detection Algorithm ◽

Change Points ◽

Detection Methods ◽

Code Length ◽

Warning Signals ◽

Mdl Principle ◽

Synthetic Datasets

This paper addresses the issue of how we can detect changes of changes, which we call metachanges, in data streams. A metachange refers to a change in patterns of when and how changes occur, referred to as “metachanges along time” and “metachanges along state”, respectively. Metachanges along time mean that the intervals between change points significantly vary, whereas metachanges along state mean that the magnitude of changes varies. It is practically important to detect metachanges because they may be early warning signals of important events. This paper introduces a novel notion of metachange statistics as a measure of the degree of a metachange. The key idea is to integrate metachanges along both time and state in terms of “code length” according to the minimum description length (MDL) principle. We develop an online metachange detection algorithm (MCD) based on the statistics to apply it to a data stream. With synthetic datasets, we demonstrated that MCD detects metachanges earlier and more accurately than existing methods. With real datasets, we demonstrated that MCD can lead to the discovery of important events that might be overlooked by conventional change detection methods.

Download Full-text

Model Selection for Non-Negative Tensor Factorization with Minimum Description Length

Entropy ◽

10.3390/e21070632 ◽

2019 ◽

Vol 21 (7) ◽

pp. 632

Author(s):

Yunhui Fu ◽

Shin Matsushima ◽

Kenji Yamanishi

Keyword(s):

Missing Values ◽

Minimum Description Length ◽

Selection Criterion ◽

Real Data ◽

Tensor Factorization ◽

Trial And Error ◽

Negative Factor ◽

Mdl Principle ◽

Normalized Maximum Likelihood ◽

Selection For

Non-negative tensor factorization (NTF) is a widely used multi-way analysis approach that factorizes a high-order non-negative data tensor into several non-negative factor matrices. In NTF, the non-negative rank has to be predetermined to specify the model and it greatly influences the factorized matrices. However, its value is conventionally determined by specialists’ insights or trial and error. This paper proposes a novel rank selection criterion for NTF on the basis of the minimum description length (MDL) principle. Our methodology is unique in that (1) we apply the MDL principle on tensor slices to overcome a problem caused by the imbalance between the number of elements in a data tensor and that in factor matrices, and (2) we employ the normalized maximum likelihood (NML) code-length for histogram densities. We employ synthetic and real data to empirically demonstrate that our method outperforms other criteria in terms of accuracies for estimating true ranks and for completing missing values. We further show that our method can produce ranks suitable for knowledge discovery.

Download Full-text