HIGH CANDIDATES GENERATION: A NEW EFFICIENT METHOD FOR MINING SHARE-FREQUENT PATTERNS

The share frequent patterns mining is more practical than the traditional frequent patternset mining because it can reflect useful knowledge such as total costs and profits of patterns. Mining share-frequent patterns becomes one of the most important research issue in the data mining. However, previous algorithms extract a large number of candidate and spend a lot of time to generate and test a large number of useless candidate in the mining process. This paper proposes a new efficient method for discovering share-frequent patterns. The new method reduces a number of candidates by generating candidates from only high transaction-measure-value patterns. The downward closure property of transaction-measure-value patterns assures correctness of the proposed method. Experimental results on dense and sparse datasets show that the proposed method is very efficient in terms of execution time. Also, it decreases the number of generated useless candidates in the mining process by at least 70%.

Download Full-text

A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3465238 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1-30

Author(s):

Guangtao Wang ◽

Gao Cong ◽

Ying Zhang ◽

Zhen Hai ◽

Jieping Ye

Keyword(s):

Frequency Estimation ◽

Frequent Itemsets ◽

Frequent Itemset ◽

Experimental Results ◽

Closure Property ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Minimum Value ◽

Downward Closure ◽

Bounded Size

The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k -Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.

Download Full-text

Research on Data Mining Optimization and Security Based on MapReduce

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.631-632.1053 ◽

2014 ◽

Vol 631-632 ◽

pp. 1053-1056

Author(s):

Hui Xia

Keyword(s):

Data Mining ◽

Execution Time ◽

Cluster Computing ◽

Limited Resource ◽

Experimental Results ◽

Computing Environment ◽

Cluster Systems ◽

National Education ◽

Distributed Cluster ◽

Data Optimization

The paper addressed the issues of limited resource for data optimization for efficiency, reliability, scalability and security of data in distributed, cluster systems with huge datasets. The study’s experimental results predicted that the MapReduce tool developed improved data optimization. The system exhibits undesired speedup with smaller datasets, but reasonable speedup is achieved with a larger enough datasets that complements the number of computing nodes reducing the execution time by 30% as compared to normal data mining and processing. The MapReduce tool is able to handle data growth trendily, especially with larger number of computing nodes. Scaleup gracefully grows as data and number of computing nodes increases. Security of data is guaranteed at all computing nodes since data is replicated at various nodes on the cluster system hence reliable. Our implementation of the MapReduce runs on distributed cluster computing environment of a national education web portal and is highly scalable.

Download Full-text

Novel User Level Data Leakage Detection Algorithm

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.g5313.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 2378-2381

Keyword(s):

Performance Metrics ◽

Detection Algorithm ◽

Point Of View ◽

Experimental Results ◽

Detection Technique ◽

Data Sources ◽

Research Issue ◽

Leakage Detection ◽

Important Research ◽

Level Data

Data leakage detection (DLD) is the most widely used detection technique in many applications such as etc. detecting data leakage by various data sources is an important research issue. Several researchers contributed to detect the data leakage by proposing various techniques. In the existing DLD techniques the performance metrics such as accuracy and time have been neglected. In this paper, we have proposed a new DLD algorithm and named it as novel user level data leakage detection algorithm (NULDLDA). In the proposed NULDLDA we have considered the user point of view to know the leakage of data by which agent among several existing agents. We have implemented and compared the NULDLDA with existing DLD. The experimental results indicate that proposed NULDLDA improved the performance over DLD with respect to time and accuracy.

Download Full-text

An Intelligent Decision in Smart Systems Using A Weighted Frequent Itemset Mining Algorithm

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1195296 ◽

2019 ◽

pp. 526-536

Author(s):

K. Lavanya ◽

K. Triveni ◽

K. Bala Mamatha ◽

K. Meghana ◽

Dr. G. Sanjay Gandhi

Keyword(s):

Execution Time ◽

Computation Time ◽

Search Space ◽

Frequent Itemsets ◽

Frequent Itemset ◽

Closure Property ◽

Smart Systems ◽

Intelligent Decision ◽

Input Dataset ◽

Downward Closure

Intelligent decision is the key technology of smart systems. Data mining technology has been playing an increasingly important role in decision making activities. The introduction of weight makes the weighted frequent itemsets not satisfy the downward closure property any longer. As a result, the search space of frequent itemsets cannot be narrowed according to downward closure property which leads to a poor time efficiency. In this paper, the weight judgment downward closure property for weighted frequent itemsets and the existence property of weighted frequent subsets are introduced and proved first. The Fuzzy-based WARM satisfies the downward closure property and prunes the insignificant rules by assigning the weight to the itemset. This reduces the computation time and execution time. This paper presents an Enhanced Fuzzy-based Weighted AssociationRuleMining(E-FWARM) algorithm for efficient mining of the frequent itemsets. The pre-filtering method is applied to the input dataset to remove the item having low variance. Data discretization is performed and E-FWARM is applied for mining the frequent itemsets. The experimental results show that the proposed E-FWARM algorithm yields maximum frequent items, association rules, accuracy and minimum execution time than the existing algorithms.

Download Full-text

Personalized Web Information Recommendation Based on Data Mining

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.225-226.546 ◽

2011 ◽

Vol 225-226 ◽

pp. 546-549 ◽

Cited By ~ 1

Author(s):

Bo He

Keyword(s):

Data Mining ◽

Recommendation System ◽

Experimental Results ◽

User Profiles ◽

Important Research ◽

Web Information ◽

Research Task ◽

Information Recommendation ◽

Recommendation Strategy

Personalized web information recommendation service had becoming an important research task increasingly as the time goes by. This paper established user profiles and put forward a recommendation strategy. On the base of these, the paper designed a personalized web information recommendation system based on data mining, namely, PWIRS. The experimental results indicate that the recommendation strategy of PWIRS is feasible.

Download Full-text

A Disk-Based Algorithm for Fast Outlier Detection in Large Datasets

Intelligent Databases ◽

10.4018/978-1-59904-120-9.ch002 ◽

2011 ◽

pp. 29-43

Author(s):

Faxin Zhao ◽

Yubin Bao ◽

Huanliang Sun ◽

Ge Yu

Keyword(s):

Data Mining ◽

Outlier Detection ◽

Large Datasets ◽

Index Structure ◽

Research Issue ◽

Important Research ◽

Cluster Technique ◽

Data Points ◽

Data Objects ◽

Number Of Cells

In data mining fields, outlier detection is an important research issue. The number of cells in the cell-based disk algorithm increases exponentially. The performance of this algorithm will decrease dramatically with the increasing of the number of cells and data points. Through further analysis, we find that there are many empty cells that are useless to outlier detection. So this chapter proposes a novel index structure, called CD-Tree, in which only non-empty cells are stored, and a cluster technique is adopted to store the data objects in the same cell into linked disk pages. Some experiments are made to test the performance of the proposed algorithms. The experimental results show that the performance of the CD-Tree structure and of the cluster technique based disk algorithm outperforms that of the cell-based disk algorithm, and the dimensionality processed by the proposed algorithm is higher than that of the old one.

Download Full-text

An Approach for Interesting Subgraph Mining from Web Log Data Using W-Gaston Algorithm

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488519500132 ◽

2019 ◽

Vol 27 (02) ◽

pp. 277-301

Author(s):

N. Jayalakshmi ◽

P. Padmaja ◽

G. Jaya Suma

Keyword(s):

Data Mining ◽

Research Topic ◽

Experimental Results ◽

Frequent Patterns ◽

Graph Database ◽

Essential Information ◽

Log Data ◽

Web Log ◽

Subgraph Mining ◽

Discovery Phase

Graph-Based Data Mining (GBDM) is an emerging research topic nowadays, for the retrieval of the essential information from the graph database. There exist many algorithms that find frequent patterns in a given graph database. One such algorithm, GASTON uses support based on frequency to discover frequent patterns. The discovery phase in the Gaston algorithm is time-consuming, and the pages captured the interest of the users are ignored by the existing GASTON algorithm. This paper proposes an algorithm, Weighted-Gaston (W-Gaston) algorithm, by modifying the existing Gaston algorithm. Here, four interesting measures are developed based on the frequency, entropy, and the page duration, for the retrieval of the interesting sub-graphs. The proposed interesting measures include four types of support: (1) Support based on the page duration (W-Support), (2) Support based on the entropy (E-Support), (3) Support based on the page duration and the entropy (WE-Support), and (4) Support based on the frequency, page duration, and the entropy (FWE-Support). The simulation of the proposed work is done using the MSNBC and the weblog databases. The experimental results show that the proposed algorithm performed well as compared with the existing algorithms.

Download Full-text

FP-outlier: Frequent pattern based outlier detection

Computer Science and Information Systems ◽

10.2298/csis0501103h ◽

2005 ◽

Vol 2 (1) ◽

pp. 103-118 ◽

Cited By ~ 86

Author(s):

Zengyou He ◽

Xiaofei Xu ◽

Zhexue Huang ◽

Shengchun Deng

Keyword(s):

Data Mining ◽

Outlier Detection ◽

Frequent Itemsets ◽

Research Community ◽

Experimental Results ◽

New Method ◽

Frequent Pattern ◽

Data Detection ◽

Frequent Patterns ◽

Data Set

An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent itemsets) from the data set. The outliers are defined as the data transactions that contain less frequent patterns in their itemsets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the FindFPOF algorithm to discover outliers. The experimental results have shown that our approach outperformed the existing methods on identifying interesting outliers.

Download Full-text

Adaptive Initialization Method Based on Spatial Local Information fork-Means Algorithm

Mathematical Problems in Engineering ◽

10.1155/2014/761468 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 4

Author(s):

Honghong Liao ◽

Jinhai Xiang ◽

Weiping Sun ◽

Jianghua Dai ◽

Shengsheng Yu

Keyword(s):

Machine Learning ◽

Data Mining ◽

Learning Community ◽

Clustering Algorithm ◽

Local Density ◽

Data Distribution ◽

Initial Guess ◽

Local Information ◽

Research Issue ◽

Important Research

k-means algorithm is a widely used clustering algorithm in data mining and machine learning community. However, the initial guess of cluster centers affects the clustering result seriously, which means that improper initialization cannot lead to a desirous clustering result. How to choose suitable initial centers is an important research issue fork-means algorithm. In this paper, we propose an adaptive initialization framework based on spatial local information (AIF-SLI), which takes advantage of local density of data distribution. As it is difficult to estimate density correctly, we develop two approximate estimations: density byt-nearest neighborhoods (t-NN) and density byϵ-neighborhoods (ϵ-Ball), leading to two implements of the proposed framework. Our empirical study on more than 20 datasets shows promising performance of the proposed framework and denotes that it has several advantages: (1) can find the reasonable candidates of initial centers effectively; (2) it can reduce the iterations ofk-means’ methods significantly; (3) it is robust to outliers; and (4) it is easy to implement.

Download Full-text

OFCOD: On the Fly Clustering Based Outlier Detection Framework

Data ◽

10.3390/data6010001 ◽

2020 ◽

Vol 6 (1) ◽

pp. 1

Author(s):

Ahmed Elmogy ◽

Hamada Rizk ◽

Amany M. Sarhan

Keyword(s):

Data Mining ◽

Image Processing ◽

Intrusion Detection ◽

Real Time ◽

Outlier Detection ◽

Real World ◽

Medical Data ◽

Experimental Results ◽

Real Time Applications ◽

Real World Datasets

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.

Download Full-text