Constraint-Based Pattern Discovery

Author(s):  
Francesco Bonchi

Devising fast and scalable algorithms, able to crunch huge amount of data, was for many years one of the main goals of data mining research. But then we realized that this was not enough. It does not matter how efficient such algorithms can be, the results we obtain are often of limited use in practice. Typically, the knowledge we seek is in a small pool of local patterns hidden within an ocean of irrelevant patterns generated from a sea of data. Therefore, it is the volume of the results itself that creates a second order mining problem for the human expert. This is, typically, the case of association rules and frequent itemset mining (Agrawal & Srikant, 1994), to which, during the last decade a lot of researchers have dedicated their (mainly algorithmic) investigations. The computational problem is that of efficiently mining from a database of transactions, those itemsets which satisfy a user-defined constraint of minimum frequency. Recently the research community has turned its attention to more complex kinds of frequent patterns extracted from more structured data: sequences, trees, and graphs. All these different kinds of pattern have different peculiarities and application fields, but they all share the same computational aspects: a usually very large input, an exponential search space, and a too large solution set. This situation—too many data yielding too many patterns—is harmful for two reasons. First, performance degrades: mining generally becomes inefficient or, often, simply unfeasible. Second, the identification of the fragments of interesting knowledge, blurred within a huge quantity of mostly useless patterns, is difficult. The paradigm of constraintbased pattern mining was introduced as a solution to both these problems. In such paradigm, it is the user who specifies to the system what is interesting for the current application: constraints are a tool to drive the mining process towards potentially interesting patterns, moreover they can be pushed deep inside the mining algorithm in order to fight the exponential search space curse, and to achieve better performance (Srikant et al., 1997; Ng et al. 1998; Han et al., 1999; Grahne et al., 2000).

Sequential pattern mining is a data mining approach; aims to discover common interesting patterns in sequence datasets, which attracted a significant research interest due to its real world applications in various fields such as web click stream mining, retail business, stock market and bio-informatics. Each sequence in sequence dataset is composed of time ordered events and each event is an item set. It discovers all frequent subsequences having frequency greater than the given minimum support threshold. Discovering sequential patterns is expensive with respect to mining time as well as the amount of memory used, because of aggressive search space growth due to generation of explosive number of frequent subsequences with the sequence length as well as count of distinct items and large volume of sequence dataset. So, research in this domain aims at developing effective data structures which address frequency counting and large search space as well as scalable algorithms to reduce the execution time and the amount of memory utilized. We propose two efficient data structures called Pre-order Post-order Coded Aggregate Tree (PPCA-Tree) for compact representation of the sequence dataset and Root-node List of First-Occurrence Sub Trees Map (RLFOST-Map) for efficient representation of projected databases. We also developed an efficient Partially ordered Sequential PAttern Mining algorithm called PSPAM and Parallel implementation of Partially ordered Sequential PAttern Mining algorithm called PAPSPAM based on PPCA-Tree using RLFOST-Map which eliminates reconstruction of the projected databases. Experimental analysis done on various synthetic datasets proves that our algorithms PSPAM and PAPSPAM outperform prefixspan and other conventional & state-of-the-art algorithms over dense datasets with better scalability.


2017 ◽  
Vol 16 (06) ◽  
pp. 1549-1579 ◽  
Author(s):  
Jerry Chun-Wei Lin ◽  
Wensheng Gan ◽  
Philippe Fournier-Viger ◽  
Tzung-Pei Hong ◽  
Han-Chieh Chao

Frequent itemset mining (FIM) is a fundamental set of techniques used to discover useful and meaningful relationships between items in transaction databases. In recent decades, extensions of FIM such as weighted frequent itemset mining (WFIM) and frequent itemset mining in uncertain databases (UFIM) have been proposed. WFIM considers that items may have different weight/importance. It can thus discover itemsets that are more useful and meaningful by ignoring irrelevant itemsets with lower weights. UFIM takes into account that data collected in a real-life environment may often be inaccurate, imprecise, or incomplete. Recently, these two ideas have been combined in the HEWI-Uapriori algorithm. This latter considers both item weights and transaction uncertainty to mine the high expected weighted itemsets (HEWIs) using a two-phase Apriori-based approach. Although the upper-bound proposed in HEWI-Uapriori can reduce the size of the search space, it still generates a large amount of candidates and uses a level-wise search. In this paper, a more efficient algorithm named HEWI-Utree is developed to efficiently mine HEWIs without performing multiple database scans and without generating candidates. This algorithm relies on three novel structures named element (E)-table, weighted-probability (WP)-table and WP-tree to maintain the information required for identifying and pruning unpromising itemsets early. Experimental results show that the proposed algorithm is generally much more efficient than traditional methods for WFIM and UFIM, as well as the state-of-the-art HEWI-Uapriori algorithm, in terms of runtime, memory consumption, and scalability.


2017 ◽  
Vol 10 (2) ◽  
pp. 67
Author(s):  
Vina Ayumi ◽  
L.M. Rasdi Rere ◽  
Mohamad Ivan Fanany ◽  
Aniati Murni Arymurthy

Metaheuristic algorithm is a powerful optimization method, in which it can solve problemsby exploring the ordinarily large solution search space of these instances, that are believed tobe hard in general. However, the performances of these algorithms signicantly depend onthe setting of their parameter, while is not easy to set them accurately as well as completelyrelying on the problem's characteristic. To ne-tune the parameters automatically, manymethods have been proposed to address this challenge, including fuzzy logic, chaos, randomadjustment and others. All of these methods for many years have been developed indepen-dently for automatic setting of metaheuristic parameters, and integration of two or more ofthese methods has not yet much conducted. Thus, a method that provides advantage fromcombining chaos and random adjustment is proposed. Some popular metaheuristic algo-rithms are used to test the performance of the proposed method, i.e. simulated annealing,particle swarm optimization, dierential evolution, and harmony search. As a case study ofthis research is contrast enhancement for images of Cameraman, Lena, Boat and Rice. Ingeneral, the simulation results show that the proposed methods are better than the originalmetaheuristic, chaotic metaheuristic, and metaheuristic by random adjustment.


2021 ◽  
Author(s):  
Giacomo Bertoldi ◽  
Stefano Campanella ◽  
Emanuele Cordano ◽  
Alberto Sartori

<p>Proper characterization of uncertainty remains a major research and operational challenge in Earth and Environmental Systems Models (EESMs). In fact, model calibration is often more an art than a science: one must make several discretionary choices, guided more by his own experience and intuition than by the scientific method. In practice, this means that the result of calibration (CA) could be suboptimal. One of the challenges of CA is the large number of parameters involved in EESM, which hence are usually selected with the help of a preliminary sensitivity analysis (SA). Finally, the computational burden of EESMs models and the large volume of the search space make SA and CA very time-consuming processes.</p><p>This work applies a modern HPC approach to optimize a complex, over parameterized hydrological model, improving the computational efficiency of SA/CA. We apply the derivative-free optimization algorithms implemented in the Facebook Nevergrad Python library (Rapin and Teytaud, 2018) on a HPC cluster, thanks to the Dask framework (Dask Development Team, 2016).</p><p>The approach has been applied to the GEOtop hydrological model (Rigon et al., 2006; Endrizzi et al., 2014) to predict the time evolution of variables as soil water content and evapotranspiration for several mountain agricultural sites in South Tyrol with different elevation, land cover (pasture, meadow, orchard), soil types.</p><p>We performed simulations on one-dimensional domains, where the model solves the energy and water budget equations in a column of soil and neglects the lateral water fluxes.  Even neglecting the distribution of parameters across layers of soil, considering a homogeneous column, one has tens of parameters, controlling soil and vegetation properties, where only a few of them are experimentally available. </p><p>Because the interpretation of global SA could be difficult or misleading and the number of model evaluations needed by SA is comparable with CA, we employed the following strategy. We performed CA using a full set of continuous parameters and SA after CA, using the samples collected during CA, to interpret the results. However, given the above-mentioned computational challenges, this strategy is possible only using HPC resources. For this reason, we focused on the computational aspects of calibration from an HPC perspective and examined the scaling of these algorithms and their implementation up to 1024 cores on a cluster. Other issues that we had to address were the complex shape of the search space and robustness of CA and SA against model convergence failure.</p><p>HPC  techniques allow to calibrate models with a high number of parameters within a reasonable computing time and  exploring the parameters space properly. This is particularly important with noisy, multimodal objective functions. In our case, HPC was essential to determine the  parameters controlling the water retention curve, which is highly not linear.  The developed  framework, which is published and freely available on GitHub, shows also how libraries and tools used within the machine learning community could be useful and easily adapted to EESMs CA.</p>


Author(s):  
Mohammad Al Hasan

The research on mining interesting patterns from transactions or scientific datasets has matured over the last two decades. At present, numerous algorithms exist to mine patterns of variable complexities, such as set, sequence, tree, graph, etc. Collectively, they are referred as Frequent Pattern Mining (FPM) algorithms. FPM is useful in most of the prominent knowledge discovery tasks, like classification, clustering, outlier detection, etc. They can be further used, in database tasks, like indexing and hashing while storing a large collection of patterns. But, the usage of FPM in real-life knowledge discovery systems is considerably low in comparison to their potential. The prime reason is the lack of interpretability caused from the enormity of the output-set size. For instance, a moderate size graph dataset with merely thousand graphs can produce millions of frequent graph patterns with a reasonable support value. This is expected due to the combinatorial search space of pattern mining. However, classification, clustering, and other similar Knowledge discovery tasks should not use that many patterns as their knowledge nuggets (features), as it would increase the time and memory complexity of the system. Moreover, it can cause a deterioration of the task quality because of the popular “curse of dimensionality” effect. So, in recent years, researchers felt the need to summarize the output set of FPM algorithms, so that the summary-set is small, non-redundant and discriminative. There are different summarization techniques: lossless, profile-based, cluster-based, statistical, etc. In this article, we like to overview the main concept of these summarization techniques, with a comparative discussion of their strength, weakness, applicability and computation cost.


2021 ◽  
Vol 11 (1) ◽  
pp. 18-37
Author(s):  
Mehmet Bicer ◽  
Daniel Indictor ◽  
Ryan Yang ◽  
Xiaowen Zhang

Association rule mining is a common technique used in discovering interesting frequent patterns in data acquired in various application domains. The search space combinatorically explodes as the size of the data increases. Furthermore, the introduction of new data can invalidate old frequent patterns and introduce new ones. Hence, while finding the association rules efficiently is an important problem, maintaining and updating them is also crucial. Several algorithms have been introduced to find the association rules efficiently. One of them is Apriori. There are also algorithms written to update or maintain the existing association rules. Update with early pruning (UWEP) is one such algorithm. In this paper, the authors propose that in certain conditions it is preferable to use an incremental algorithm as opposed to the classic Apriori algorithm. They also propose new implementation techniques and improvements to the original UWEP paper in an algorithm we call UWEP2. These include the use of memorization and lazy evaluation to reduce scans of the dataset.


Author(s):  
Logeswaran K. ◽  
Suresh P. ◽  
Savitha S. ◽  
Prasanna Kumar K. R.

In recent years, the data analysts are facing many challenges in high utility itemset (HUI) mining from given transactional database using existing traditional techniques. The challenges in utility mining algorithms are exponentially growing search space and the minimum utility threshold appropriate to the given database. To overcome these challenges, evolutionary algorithm-based techniques can be used to mine the HUI from transactional database. However, testing each of the supporting functions in the optimization problem is very inefficient and it increases the time complexity of the algorithm. To overcome this drawback, reinforcement learning-based approach is proposed for improving the efficiency of the algorithm, and the most appropriate fitness function for evaluation can be selected automatically during execution of an algorithm. Furthermore, during the optimization process when distinct functions are skillful, dynamic selection of current optimal function is done.


Author(s):  
Mohammad Karim Sohrabi ◽  
Hossein Azgomi

Various problems are just rising with regard to mining in massive datasets, among which finding similar documents can be pinpointed. The Shingling method converts this problem to a set-based problem. Some of existing methods have used min-hashing to compress the results already driven from the shingling method and then have exploited LSH method to find candidate pairs for similarity search from all pairs of documents. In this paper, an apriori-based method is proposed for finding similar documents based on frequent itemset mining approach. To this end, the apriori algorithm is modified and is customized for similarity search problem. Modeling the similarity search problem as a frequent pattern mining problem, using a modified version of apriori, and dynamic selection the minimum support threshold are the most important advantages of the proposed method, which lead to its appropriate execution time and high quality results. The proposed method finds similar documents in less time than the combined method and MCVM method because it generates fewer candidate pairs for finding similar documents. Furthermore, experimental results show the high quality of the answers of the proposed methods.


2018 ◽  
Vol 7 (3.8) ◽  
pp. 77
Author(s):  
Prof. Mangesh Ghonge ◽  
Miss Neha Rane

Essentially the most primary and crucial part of data mining is pattern mining. For acquiring important corre-lations among the information, method called itemset mining plays vital role Earlier, the notion of itemset mining was used to acquire the absolute most often occurring items in the itemset. In some situation, though having utility value less than threshold it is necessary to locate such items because they are of great use. Considering the thought of weight for each and every apparent items brings effectiveness for mining the pattern efficiently. Different mining algorithms are utilized to obtain the correlations among the information items based on frequency with the items in the dataset occurs. In frequent itemset, those things which occurs frequently whereas, in infrequent itemset the items that occur very rarely are obtained. Determining such form of data is tougher than to locate data which occurs frequently. Frequent Itemset Mining (FISM) locates large and frequent itemsets in huge data for example market baskets. Such data has two properties that are not addressed by FISM; Mixture property and projection property. Here the proposed system combines both mixture as well as projection property further providing automated support thresholds.  


2008 ◽  
Vol 17 (02) ◽  
pp. 303-320 ◽  
Author(s):  
WEI SONG ◽  
BINGRU YANG ◽  
ZHANGYAN XU

Because of the inherent computational complexity, mining the complete frequent item-set in dense datasets remains to be a challenging task. Mining Maximal Frequent Item-set (MFI) is an alternative to address the problem. Set-Enumeration Tree (SET) is a common data structure used in several MFI mining algorithms. For this kind of algorithm, the process of mining MFI's can also be viewed as the process of searching in set-enumeration tree. To reduce the search space, in this paper, a new algorithm, Index-MaxMiner, for mining MFI is proposed by employing a hybrid search strategy blending breadth-first and depth-first. Firstly, the index array is proposed, and based on bitmap, an algorithm for computing index array is presented. By adding subsume index to frequent items, Index-MaxMiner discovers the candidate MFI's using breadth-first search at one time, which avoids first-level nodes that would not participate in the answer set and reduces drastically the number of candidate itemsets. Then, for candidate MFI's, depth-first search strategy is used to generate all MFI's. Thus, the jumping search in SET is implemented, and the search space is reduced greatly. The experimental results show that the proposed algorithm is efficient especially for dense datasets.


Sign in / Sign up

Export Citation Format

Share Document