A Comparative Analysis of Classification Algorithms on Diverse Datasets

Data mining involves the computational process to find patterns from large data sets. Classification, one of the main domains of data mining, involves known structure generalizing to apply to a new dataset and predict its class. There are various classification algorithms being used to classify various data sets. They are based on different methods such as probability, decision tree, neural network, nearest neighbor, boolean and fuzzy logic, kernel-based etc. In this paper, we apply three diverse classification algorithms on ten datasets. The datasets have been selected based on their size and/or number and nature of attributes. Results have been discussed using some performance evaluation measures like precision, accuracy, F-measure, Kappa statistics, mean absolute error, relative absolute error, ROC Area etc. Comparative analysis has been carried out using the performance evaluation measures of accuracy, precision, and F-measure. We specify features and limitations of the classification algorithms for the diverse nature datasets.

Download Full-text

Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation*

Revista Perspectiva Empresarial ◽

10.16967/23898186.667 ◽

2020 ◽

Vol 7 (2-1) ◽

pp. 31-43

Author(s):

Nidia Rodríguez Mazahua ◽

Lisbeth Rodríguez Mazahua ◽

Asdrúbal López Chau ◽

Giner Alor Hernández

Keyword(s):

Data Mining ◽

Comparative Analysis ◽

Decision Tree ◽

Data Warehouse ◽

Data Sets ◽

Star Schema ◽

Tree Algorithms ◽

Horizontal Fragmentation ◽

Roc Area ◽

F Measure

One of the main problems faced by Data Warehouse designers is fragmentation.Several studies have proposed data mining-based horizontal fragmentation methods.However, not exists a horizontal fragmentation technique that uses a decision tree. This paper presents the analysis of different decision tree algorithms to select the best one to implement the fragmentation method. Such analysis was performed under version 3.9.4 of Weka, considering four evaluation metrics (Precision, ROC Area, Recall and F-measure) for different selected data sets using the Star Schema Benchmark. The results showed that the two best algorithms were J48 and Random Forest in most cases; nevertheless, J48 was selected because it is more efficient in building the model.

Download Full-text

Linear Approximation of F-measure for the Performance Evaluation of Classification Algorithms on Imbalanced Data Sets

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2020.2986749 ◽

2020 ◽

pp. 1-1

Author(s):

Tzu-Tsung Wong

Keyword(s):

Performance Evaluation ◽

Linear Approximation ◽

Imbalanced Data ◽

Classification Algorithms ◽

Data Sets ◽

Imbalanced Data Sets ◽

F Measure

Download Full-text

Comparative analysis of HAR datasets using classification algorithms

Computer Science and Information Systems ◽

10.2298/csis201221043n ◽

2021 ◽

pp. 43-43

Author(s):

Suvra Nayak ◽

Chhabi Panigrahi ◽

Bibudhendu Pati ◽

Sarmistha Nanda ◽

Meng-Yen Hsieh

Keyword(s):

Comparative Analysis ◽

Assisted Living ◽

Absolute Error ◽

Vital Role ◽

Living Condition ◽

Classification Algorithms ◽

Kappa Statistics ◽

Human Society ◽

Learning Approaches ◽

Human Beings

In the current research and development era, Human Activity Recognition (HAR) plays a vital role in analyzing the movements and activities of a human being. The main objective of HAR is to infer the current behaviour by extracting previous information. Now-a-days, the continuous improvement of living condition of human beings changes human society dramatically. To detect the activities of human beings, various devices, such as smartphones and smart watches, use different types of sensors, such as multi modal sensors, non-video based and video-based sensors, and so on. Among the entire machine learning approaches, tasks in different applications adopt extensively classification techniques, in terms of smart homes by active and assisted living, healthcare, security and surveillance, making decisions, tele-immersion, forecasting weather, official tasks, and prediction of risk analysis in society. In this paper, we perform three classification algorithms, Sequential Minimal Optimization (SMO), Random Forest (RF), and Simple Logistic (SL) with the two HAR datasets, UCI HAR and WISDM, downloaded from the UCI repository. The experiment described in this paper uses the WEKA tool to evaluate performance with the matrices, Kappa statistics, relative absolute error, mean absolute error, ROC Area, and PRC Area by 10-fold cross validation technique. We also provide a comparative analysis of the classification algorithms with the two determined datasets by calculating the accuracy with precision, recall, and F-measure metrics. In the experimental results, all the three algorithms with the UCI HAR datasets achieve nearly the same accuracy of 98%.The RF algorithm with the WISDM dataset has the accuracy of 90.69%,better than the others.

Download Full-text

Data Mining Approach Improving Decision-Making Competency along the Business Digital Transformation Journey: A Case Study – Home Appliances after Sales Service

SEEU Review ◽

10.2478/seeur-2021-0008 ◽

2021 ◽

Vol 16 (1) ◽

pp. 45-65

Author(s):

Hyrmet Mydyti

Keyword(s):

Data Mining ◽

Decision Making ◽

Nearest Neighbor ◽

Practical Implication ◽

Absolute Error ◽

Digital Transformation ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Home Appliances

Abstract Data mining, as an essential part of artificial intelligence, is a powerful digital technology, which makes businesses predict future trends and alleviate the process of decision-making and enhancing customer experience along their digital transformation journey. This research provides a practical implication – a case study - to provide guidance on analyzing information and predicting repairs in home appliances after sales services business. The main benefit of this practical comparative study of various classification algorithms, by using the Weka tool, is the analysis of information and the prediction of repairs in the home appliances after sales services business. The comparison of algorithms is performed considering different parameters, such as the mean absolute error, root mean square error, relative absolute error and root relative squared error, receiver operating characteristic area, accuracy, Matthews’s correlation coefficient, precision-recall curve, precision, F-measure, recall and statistical criteria. Five classification algorithms such as the Naive Bayes, J48, random forest, K-Nearest Neighbor, and logistic regression were implemented in the dataset. J48 has proved to provide the best accuracy and the lowest error among the other examined algorithms applied to a home appliances after sales services dataset to predict repairs based on product guarantee period. The extracted information and results of an after sales services business by using data mining techniques prove to alleviate the process of streamlining decision-making and provide reliable predictions, especially for the customers, as well as increase businesses’ efficiency along their digital transformation journey.

Download Full-text

Early Detection of Red Palm Weevil, Rhynchophorus ferrugineus (Olivier), Infestation Using Data Mining

Plants ◽

10.3390/plants10010095 ◽

2021 ◽

Vol 10 (1) ◽

pp. 95

Author(s):

Heba Kurdi ◽

Amal Al-Aldawsari ◽

Isra Al-Turaiki ◽

Abdulrahman S. Aldawood

Keyword(s):

Data Mining ◽

Plant Size ◽

Support Vector ◽

Classification Algorithms ◽

Palm Tree ◽

Rhynchophorus Ferrugineus ◽

Red Palm Weevil ◽

Palm Weevil ◽

Using Data ◽

F Measure

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.

Download Full-text

A Survey on Major Classification Algorithms and Comparative Analysis of Few Classification Algorithms on Contact Lenses Data Set Using Data Mining Tool

New Trends in Computational Vision and Bio-inspired Computing ◽

10.1007/978-3-030-41862-5_121 ◽

2020 ◽

pp. 1201-1209

Author(s):

Syed Nawaz Pasha ◽

D. Ramesh ◽

Mohammad Sallauddin

Keyword(s):

Data Mining ◽

Comparative Analysis ◽

Contact Lenses ◽

Classification Algorithms ◽

Data Set ◽

Data Mining Tool ◽

Mining Tool ◽

Using Data

Download Full-text

Web Graph Clustering for Displays and Navigation of Cyberspace

Web Mining ◽

10.4018/978-1-59140-414-9.ch012 ◽

2011 ◽

pp. 253-275

Author(s):

Xiaodi Huang ◽

Wei Lai

Keyword(s):

Data Mining ◽

Nearest Neighbor ◽

Structural Information ◽

Graph Clustering ◽

Data Sets ◽

K Nearest Neighbor ◽

New Approach ◽

Web Graph

This chapter presents a new approach to clustering graphs, and applies it to Web graph display and navigation. The proposed approach takes advantage of the linkage patterns of graphs, and utilizes an affinity function in conjunction with the k-nearest neighbor. This chapter uses Web graph clustering as an illustrative example, and offers a potentially more applicable method to mine structural information from data sets, with the hope of informing readers of another aspect of data mining and its applications.

Download Full-text

Performance Evaluation of Classification Algorithms on Different Data Sets

Indian Journal of Science and Technology ◽

10.17485/ijst/2016/v9i40/99425 ◽

2016 ◽

Vol 9 (40) ◽

Author(s):

Meenu Gupta ◽

Deepak Dahiya

Keyword(s):

Performance Evaluation ◽

Classification Algorithms ◽

Data Sets

Download Full-text

Detection of Anomalous Transactions in Mobile Payment Systems

International Journal of Data Analytics ◽

10.4018/ijda.2020070105 ◽

2020 ◽

Vol 1 (2) ◽

pp. 58-66

Author(s):

Ibrar Hussain ◽

Muhammad Asif

Keyword(s):

Data Mining ◽

Comparative Analysis ◽

Economic Activity ◽

Research Study ◽

Mobile Payment ◽

Financial Fraud ◽

Classification Algorithms ◽

Classification Models ◽

Payment Systems ◽

Substantial Impact

Mobile payment systems are providing an opportunity for smartphone users for transferring money to each other with ease. This simple way of transferring through mobile payment systems has great potential for economic activity. However, fraudulent transactions may occur and can have a substantial impact on the economy of a country. Financial fraud and anomalous transactions can cause a loss of billions of dollars annually. Therefore, there is a need to detect anomalous transactions through mobile payment systems to prevent financial fraud. For this research study, a synthetic dataset is generated by using a PAYSIM simulator due to the lack of availability of a realistic dataset. This research study performed experiments on a financial transactional dataset using eight data mining classification algorithms. The performance of classification models was measured by using evaluation metrics: accuracy, precision, F-score, recall, and specificity. A comparative analysis of classification models was also performed based on their performance.

Download Full-text

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHMS: HOLDOUTS AS ACCURACY ESTIMATION

CogITo Smart Journal ◽

10.31154/cogito.v1i1.2.13-23 ◽

2016 ◽

Vol 1 (1) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Debby Erce Sondakh

Keyword(s):

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Decision Rules ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Accuracy Estimation ◽

F Measure

Penelitian ini bertujuan untuk mengukur dan membandingkan kinerja lima algoritma klasifikasi teks berbasis pembelajaran mesin, yaitu decision rules, decision tree, k-nearest neighbor (k-NN), naïve Bayes, dan Support Vector Machine (SVM), menggunakan dokumen teks multi-class. Perbandingan dilakukan pada efektifiatas algoritma, yaitu kemampuan untuk mengklasifikasi dokumen pada kategori yang tepat, menggunakan metode holdout atau percentage split. Ukuran efektifitas yang digunakan adalah precision, recall, F-measure, dan akurasi. Hasil eksperimen menunjukkan bahwa untuk algoritma naïve Bayes, semakin besar persentase dokumen pelatihan semakin tinggi akurasi model yang dihasilkan. Akurasi tertinggi naïve Bayes pada persentase 90/10, SVM pada 80/20, dan decision tree pada 70/30. Hasil eksperimen juga menunjukkan, algoritma naïve Bayes memiliki nilai efektifitas tertinggi di antara lima algoritma yang diuji, dan waktu membangun model klasiifikasi yang tercepat, yaitu 0.02 detik. Algoritma decision tree dapat mengklasifikasi dokumen teks dengan nilai akurasi yang lebih tinggi dibanding SVM, namun waktu membangun modelnya lebih lambat. Dalam hal waktu membangun model, k-NN adalah yang tercepat namun nilai akurasinya kurang.

Download Full-text