scholarly journals Comparing sets of patterns with the Jaccard index

Author(s):  
Sam Fletcher ◽  
Md Zahidul Islam

The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modeling long before even that. Actionable knowledge often takes the form of patterns, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of patterns. Our solution allows comparisons between sets of patterns that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard index to measure the similarity between sets of patterns by converting each pattern into a single element within the set. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to prediction accuracy in the context of a real-world data mining scenario.

Data mining helps to solve many problems in the area of medical diagnosis using real-world data. However, much of the data is unrealizable as it does not have desirable features and contains a lot of gaps and errors. A complete set of data is a prerequisite for precise grouping and classification of a dataset. Preprocessing is a data mining technique that transforms the unrefined dataset into reliable and useful data. It is used for resolving the issues and changes raw data for next level processing. Discretization is a necessary step for data preprocessing task. It reduces the large chunks of numeric values to a group of well-organized values. It offers remarkable improvements in speed and accuracy in classification. This paper investigates the impact of preprocessing on the classification process. This work implements three techniques such as NaiveBayes, Logistic Regression, and SVM to classify Diabetes dataset. The experimental system is validated using discretize techniques and various classification algorithms.


2018 ◽  
Vol 7 (3.3) ◽  
pp. 233
Author(s):  
Prakash Chandra Jena ◽  
Subhendu Kumar Pani ◽  
Debahuti Mishra

Several data mining techniques have been proposed to take out hidden information from databases. Data mining and knowledge extraction becomes challenging when data is massive, distributed and heterogeneous. Classification is an extensively applied task in data mining for prediction. Huge numbers of machine learning techniques have been developed for the purpose. Ensemble learning merges multiple base classifiers to improve the performance of individual classification algorithms. In particular, ensemble learning plays a significant role in distributed data mining. So, study of ensemble learning is crucial in order to apply it in real-world data mining problems. We propose a technique to construct ensemble of classifiers and study its performance using popular learning techniques on a range of publicly available datasets from biomedical domain.  


Plants ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 95
Author(s):  
Heba Kurdi ◽  
Amal Al-Aldawsari ◽  
Isra Al-Turaiki ◽  
Abdulrahman S. Aldawood

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.


2021 ◽  
Author(s):  
Kek Zhi Xuan ◽  
Shuhaida Ismail ◽  
Intan Syazwani Noorain ◽  
Nur Aliaa Dalila A. Muhaime

2007 ◽  
Vol 18 (3) ◽  
pp. 255-279 ◽  
Author(s):  
P. Compieta ◽  
S. Di Martino ◽  
M. Bertolotto ◽  
F. Ferrucci ◽  
T. Kechadi

Sign in / Sign up

Export Citation Format

Share Document