COMPARATIVE STUDY OF CLASSIFICATION ALGORITHMS: HOLDOUTS AS ACCURACY ESTIMATION

Debby Erce Sondakh

doi:10.31154/cogito.v1i1.2.13-23

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHMS: HOLDOUTS AS ACCURACY ESTIMATION

CogITo Smart Journal ◽

10.31154/cogito.v1i1.2.13-23 ◽

2016 ◽

Vol 1 (1) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Debby Erce Sondakh

Keyword(s):

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Decision Rules ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Accuracy Estimation ◽

F Measure

Penelitian ini bertujuan untuk mengukur dan membandingkan kinerja lima algoritma klasifikasi teks berbasis pembelajaran mesin, yaitu decision rules, decision tree, k-nearest neighbor (k-NN), naïve Bayes, dan Support Vector Machine (SVM), menggunakan dokumen teks multi-class. Perbandingan dilakukan pada efektifiatas algoritma, yaitu kemampuan untuk mengklasifikasi dokumen pada kategori yang tepat, menggunakan metode holdout atau percentage split. Ukuran efektifitas yang digunakan adalah precision, recall, F-measure, dan akurasi. Hasil eksperimen menunjukkan bahwa untuk algoritma naïve Bayes, semakin besar persentase dokumen pelatihan semakin tinggi akurasi model yang dihasilkan. Akurasi tertinggi naïve Bayes pada persentase 90/10, SVM pada 80/20, dan decision tree pada 70/30. Hasil eksperimen juga menunjukkan, algoritma naïve Bayes memiliki nilai efektifitas tertinggi di antara lima algoritma yang diuji, dan waktu membangun model klasiifikasi yang tercepat, yaitu 0.02 detik. Algoritma decision tree dapat mengklasifikasi dokumen teks dengan nilai akurasi yang lebih tinggi dibanding SVM, namun waktu membangun modelnya lebih lambat. Dalam hal waktu membangun model, k-NN adalah yang tercepat namun nilai akurasinya kurang.

Download Full-text

KOMPARASI METODE KLASIFIKASI PADA ANALISIS SENTIMEN USAHA WARALABA BERDASARKAN DATA TWITTER

Jurnal Pilar Nusa Mandiri ◽

10.33480/pilar.v15i2.752 ◽

2019 ◽

Vol 15 (2) ◽

pp. 267-274

Author(s):

Tati Mardiana ◽

Hafiz Syahreva ◽

Tuslaela Tuslaela

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Confusion Matrix ◽

Naïve Bayes ◽

Support Vector ◽

K Nearest Neighbor

Saat ini usaha waralaba di Indonesia memiliki daya tarik yang relatif tinggi. Namun, para pelaku usaha banyak juga yang mengalami kegagalan. Bagi seseorang yang ingin memulai usaha perlu mempertimbangkan sentimen masyarakat terhadap usaha waralaba. Meskipun demikian, tidak mudah untuk melakukan analisis sentimen karena banyaknya jumlah percakapan di Twitter terkait usaha waralaba dan tidak terstruktur. Tujuan penelitian ini adalah melakukan komparasi akurasi metode Neural Network, K-Nearest Neighbor, Naïve Bayes, Support Vector Machine, dan Decision Tree dalam mengekstraksi atribut pada dokumen atau teks yang berisi komentar untuk mengetahui ekspresi didalamnya dan mengklasifikasikan menjadi komentar positif dan negatif. Penelitian ini menggunakan data realtime dari tweets pada Twitter. Selanjutnya mengolah data tersebut dengan terlebih dulu membersihkannya dari noise dengan menggunakan Phyton. Hasil pengujian dengan confusion matrix diperoleh nilai akurasi Neural Network sebesar 83%, K-Nearest Neighbor sebesar 52%, Support Vector Machine sebesar 83%, dan Decision Tree sebesar 81%. Penelitian ini menunjukkan metode Support Vector Machine dan Neural Network paling baik untuk mengklasifikasikan komentar positif dan negatif terkait usaha waralaba.

Download Full-text

Real Time Smartphone Data for Prediction of Nomophobia Severity using Supervised Machine Learning

10.21467/proceedings.114.11 ◽

2021 ◽

Author(s):

Anshika Arora ◽

Pinaki Chakraborty ◽

M.P.S. Bhatia

Keyword(s):

Machine Learning ◽

Real Time ◽

Undergraduate Students ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Supervised Machine Learning ◽

Support Vector ◽

K Nearest Neighbor ◽

F Measure

Excessive use of smartphones throughout the day having dependency on them for social interaction, entertainment and information retrieval may lead users to develop nomophobia. This makes them feel anxious during non-availability of smartphones. This study describes the usefulness of real time smartphone usage data for prediction of nomophobia severity using machine learning. Data is collected from 141 undergraduate students analyzing their perception about their smartphone using the Nomophobia Questionnaire (NMP-Q) and their real time smartphone usage patterns using a purpose-built android application. Supervised machine learning models including Random Forest, Decision Tree, Support Vector Machines, Naïve Bayes and K-Nearest Neighbor are trained using two features sets where the first feature set comprises only the NMP-Q features and the other comprises real time smartphone usage features along with the NMP-Q features. Performance of these models is evaluated using f-measure and area under ROC and It is observed that all the models perform better when provided with smartphone usage features along with the NMP-Q features. Naïve Bayes outperforms other models in prediction of nomophobia achieving a f-measure value of 0.891 and ROC area value of 0.933.

Download Full-text

Integrating synthetic minority oversampling and gradient boosting decision tree for bogie fault diagnosis in rail vehicles

Proceedings of the Institution of Mechanical Engineers Part F Journal of Rail and Rapid Transit ◽

10.1177/0954409718795089 ◽

2018 ◽

Vol 233 (3) ◽

pp. 312-325 ◽

Cited By ~ 11

Author(s):

Linlin Kou ◽

Yong Qin ◽

Xunjun Zhao ◽

Yong Fu

Keyword(s):

Support Vector Machine ◽

Fault Diagnosis ◽

Fault Detection ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor

Bogies are critical components of a rail vehicle, which are important for the safe operation of rail transit. In this study, the authors analyzed the real vibration data of the bogies of a railway vehicle obtained from a Chinese subway company under four different operating conditions. The authors selected 15 feature indexes – that ranged from time-domain, energy, and entropy – as well as their correlations. The adaptive synthetic sampling approach–gradient boosting decision tree (ADASYN–GBDT) method is proposed for the bogie fault diagnosis. A comparison between ADASYN–GBDT and the three commonly used classifiers (K-nearest neighbor, support vector machine, and Gaussian naïve Bayes), combined with random forest as the feature selection, was done under different test data sizes. A confusion matrix was used to evaluate those classifiers. In K-nearest neighbor, support vector machine, and Gaussian naïve Bayes, the optimal features should be selected first, while the proposed method of this study does not need to select the optimal features. K-nearest neighbor, support vector machine, and Gaussian naïve Bayes produced inaccurate results in multi-class identification. It can be seen that the lowest false detection rates of the proposed ADASYN–GBDT model are 92.95% and 87.81% when proportion of the test dataset is 0.4 and 0.9, respectively. In addition, the ADASYN–GBDT model has the ability to correctly identify a fault, which makes it more practical and suitable for use in railway operations. The entire process (training and testing) was finished in 2.4231 s and the detection procedure took 0.0027 s on average. The results show that the proposed ADASYN–GBDT method satisfied the requirements of real-time performance and accuracy for online fault detection. It might therefore aid in the fault detection of bogies.

Download Full-text

Multiclass Severity Classification for Software Bugs Using Support Vector Machine, K-Nearest Neighbor, Decision Tree and Naïve Bayes

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9348 ◽

2020 ◽

Vol 17 (11) ◽

pp. 5109-5112

Author(s):

Raj Kumar ◽

Sanjay Singla

Keyword(s):

Decision Tree ◽

Software Development ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Data Mining Algorithm ◽

K Nearest Neighbor ◽

Software Bugs ◽

The Impact

During the software development, all most 30–35 present cost is due to the testing. This means that if a bug travels from one phase to succeeding phases without detection, it will definitely increase the cost of the software development and due to this software quality may be compromised. So use of the data mining algorithm for the software bug classification is highly appreciable. Bug severity may be categorised into S1, S2, S3, S4 and S5 categories, depending on the impact of the severity. In this paper, multiclass of bug severity is done using SVM, KNN, Decision Tree and Naïve Bayes. Comparative analysis of these algorithms is done with respect to accuracy, precision, recall and execution time.

Download Full-text

Fake News Detection from Online media using Machine learning Classifiers

Journal of Physics Conference Series ◽

10.1088/1742-6596/2161/1/012027 ◽

2022 ◽

Vol 2161 (1) ◽

pp. 012027

Author(s):

Shalini Pandey ◽

Sankeerthi Prabhakaran ◽

N V Subba Reddy ◽

Dinesh Acharya

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Imbalanced Data ◽

Naïve Bayes ◽

Support Vector ◽

Fake News ◽

K Nearest Neighbor

Abstract With the advancement in technology, the consumption of news has shifted from Print media to social media. The convenience and accessibility are major factors that have contributed to this shift in consumption of the news. However, this change has bought upon a new challenge in the form of “Fake news” being spread with not much supervision available on the net. In this paper, this challenge has been addressed through a Machine learning concept. The algorithms such as K-Nearest Neighbor, Support Vector Machine, Decision Tree, Naïve Bayes and Logistic regression Classifiers to identify the fake news from real ones in a given dataset and also have increased the efficiency of these algorithms by pre-processing the data to handle the imbalanced data more appropriately. Additionally, comparison of the working of these classifiers is presented along with the results. The model proposed has achieved an accuracy of 89.98% for KNN, 90.46% for Logistic Regression, 86.89% for Naïve Bayes, 73.33% for Decision Tree and 89.33% for SVM in our experiment.

Download Full-text

Classification methods comparison for customer churn prediction in the telecommunication industry

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.12.001 ◽

2021 ◽

Vol 8 (12) ◽

pp. 1-8

Author(s):

Makruf et al. ◽

Keyword(s):

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Telecommunication Service ◽

Support Vector ◽

Telecommunication Industry ◽

Classification Methods ◽

K Nearest Neighbor ◽

Customer Churn

The need for telecommunication services has increased dramatically in schools, offices, entertainment, and other areas. On the other hand, the competition between telecommunication companies is getting tougher. Customer churn is one of the areas that each company gains more competitive advantage. This paper proposes a comparison of several classification methods to make a prediction whether the customers cancel the subscription to a telecommunication service by highlighting key factors of customer churn or not. The comparison is non-trivial due to the urgent requirements from the telecommunication industry to infer the most appropriate techniques in analyzing their customer churn. This comparison is often of huge commercial value. The result shows that Artificial Neural Network (ANN) can predict churn with an accuracy of 79%, Support Vector Machine (SVM) with 78% accuracy, Gaussian Naïve Bayes, and K-Nearest Neighbor (KNN) with 75% accuracy, while Decision Tree with 70% accuracy. Moreover, the technique with the highest F-Measure is Gaussian Naïve Bayes with 65% and the technique with the lowest one is Decision Tree with 49%. Hence, ANN and Gaussian Naïve Bayes are two methods with high recommendation to predict the customer churn in the telecommunication industry.

Download Full-text

RB-Bayes algorithm for the prediction of diabetic in Pima Indian dataset

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i6.pp4866-4872 ◽

2019 ◽

Vol 9 (6) ◽

pp. 4866

Author(s):

Rajni Rajni ◽

Amandeep Amandeep

Keyword(s):

Nearest Neighbor ◽

Naive Bayes ◽

Early Stage ◽

Human Life ◽

Naïve Bayes ◽

Support Vector ◽

Pima Indians ◽

K Nearest Neighbor ◽

Fast Pace ◽

Bayes Algorithm

<p>Diabetes is a major concern all over the world. It is increasing at a fast pace. People can avoid diabetes at an early stage without any test. The goal of this paper is to predict the probability of whether the person has a risk of diabetes or not at an early stage. This would lead to having a great impact on their quality of human life. The datasets are Pima Indians diabetes and Cleveland coronary illness and consist of 768 records. Though there are a number of solutions available for information extraction from a huge datasets and to predict the possibility of having diabetes, but the accuracy of their mining process is far from accurate. For achieving highest accuracy, the issue of zero probability which is generally faced by naïve bayes analysis needs to be addressed suitably. The proposed framework RB-Bayes aims to extract the required information with high accuracy that could survive the problem of zero probability and also configure accuracy with other methods like Support Vector Machine, Naive Bayes, and K Nearest Neighbor. We calculated mean to handle missing data and calculated probability for yes (positive) and no (negative). The highest value between yes and no decide the value for the tuple. It is mostly used in text classification. The outcomes on Pima Indian diabetes dataset demonstrate that the proposed methodology enhances the precision as a contrast with other regulated procedures. The accuracy of the proposed methodology large dataset is 72.9%.</p>

Download Full-text

Prediksi Harga Minyak Kelapa Sawit Dalam Investasi Dengan Membandingkan Algoritma Naïve Bayes, Support Vector Machine dan K-Nearest Neighbor

IT for Society ◽

10.33021/itfs.v4i1.1181 ◽

2019 ◽

Vol 4 (1) ◽

Author(s):

Deny Haryadi ◽

Rila Mandala

Keyword(s):

Support Vector Machine ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

K Nearest Neighbor

Harga minyak kelapa sawit bisa mengalami kenaikan, penurunan maupun tetap setiap hari karena faktor yang mempengaruhi harga minyak kelapa sawit seperti harga minyak nabati lain (minyak kedelai dan minyak canola), harga minyak mentah dunia, maupun nilai tukar riil antara kurs dolar terhadap mata uang negara produsen (rupiah, ringgit, dan canada) atau mata uang negara konsumen (rupee). Untuk itu dibutuhkan prediksi harga minyak kelapa sawit yang cukup akurat agar para investor bisa mendapatkan keuntungan sesuai perencanaan yang dibuat. tujuan dari penelitian ini yaitu untuk mengetahui perbandingan accuracy, precision, dan recall yang dihasilkan oleh algoritma Naïve Bayes, Support Vector Machine, dan K-Nearest Neighbor dalam menyelesaikan masalah prediksi harga minyak kelapa sawit dalam investasi. Berdasarkan hasil pengujian dalam penelitian yang telah dilakukan, algoritma Support Vector Machine memiliki accuracy, precision, dan recall dengan jumlah paling tinggi dibandingkan dengan algoritma Naïve Bayes dan algoritma K-Nearest Neighbor. Nilai accuracy tertinggi pada penelitian ini yaitu 82,46% dengan precision tertinggi yaitu 86% dan recall tertinggi yaitu 89,06%.

Download Full-text

Sentiment Analysis about E-Commerce from Tweets Using Decision Tree, K-Nearest Neighbor, and Naïve Bayes

2018 International Conference on Orange Technologies (ICOT) ◽

10.1109/icot.2018.8705796 ◽

2018 ◽

Cited By ~ 2

Author(s):

Achmad Bayhaqy ◽

Sfenrianto Sfenrianto ◽

Kaman Nainggolan ◽

Emil R. Kaburuan

Keyword(s):

Decision Tree ◽

Sentiment Analysis ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

K Nearest Neighbor

Download Full-text

Centroid Based Classifier With TF – IDF – ICF for Classfication of Student’s Complaint at Appliation E-Complaint in Muhammadiyah University of Sidoarjo

JEEE-U (Journal of Electrical and Electronic Engineering-UMSIDA) ◽

10.21070/jeee-u.v1i1.23 ◽

2016 ◽

Vol 1 (1) ◽

pp. 17 ◽

Cited By ~ 1

Author(s):

Mochamad Alfan Rosid ◽

Gunawan Gunawan ◽

Edwin Pramana

Keyword(s):

Text Mining ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

K Nearest Neighbor ◽

Base Classifier

Text mining mengacu pada proses mengambil informasi berkualitas tinggi dari teks. Informasi berkualitas tinggi biasanya diperoleh melalui peramalan pola dan kecenderungan melalui sarana seperti pembelajaran pola statistik. Salah satu kegiatan penting dalam text mining adalah klasifikasi atau kategorisasi teks. Kategorisasi teks sendiri saat ini memiliki berbagai metode antara lain metode K-Nearest Neighbor, Naïve Bayes, dan Centroid Base Classifier, atau decision tree classification.Pada penelitian ini, klasifikasi keluhan mahasiswa dilakukan dengan metode centroid based classifier dan dengan fitur TF-IDF-ICF, Ada lima tahap yang dilakukan untuk mendapatkan hasil klasifikasi. Tahap pengambilan data keluhan kemudian dilanjutkan dengan tahap preprosesing yaitu mempersiapkan data yang tidak terstruktur sehingga siap digunakan untuk proses selanjutnya, kemudian dilanjutkan dengan proses pembagian data, data dibagi menjadi dua macam yaitu data latih dan data uji, tahap selanjutnya yaitu tahap pelatihan untuk menghasilkan model klasifikasi dan tahap terakhir adalah tahap pengujian yaitu menguji model klasifikasi yang telah dibuat pada tahap pelatihan terhadap data uji. Keluhan untuk pengujian akan diambilkan dari database aplikasi e-complaint Universitas Muhammadiyah Sidoarjo. Adapun hasil uji coba menunjukkan bahwa klasifikasi keluhan dengan algoritma centroid based classifier dan dengan fitur TF-IDF-ICF memiliki rata-rata akurasi yang cukup tinggi yaitu 79.5%. Nilai akurasi akan meningkat dengan meningkatnya data latih dan efesiensi sistem semakin menurun dengan meningkatnya data latih.

Download Full-text