Comparison estimating of classification error rate in decision tree: Data mining

Decision Tree (DT) typically splitting criteria using one variable at a time. In this way, the final decision partition has boundaries that are parallel to axes. An observation is misclassified when it falls in a region which does not have the same class membership. Misclassification rate in classification tree is defined as the proportion of observations classified to the wrong class while in the regression tree is defined as a mean squared error. In this paper, we present two of the important methods for estimating the misclassification (error) rate in decision trees, as we know that all classification procedures, including decision trees, can produce errors. Constructed DT model by using a training dataset and tested it based on an independent test dataset. There are several procedures for estimating the error rate of decision tree-structured classifiers, as K-fold cross-validation and bootstrap estimates. This comparison aimed to characterize the performance of the two methods in terms of test error rates based on real datasets. The results indicate that 10-fold cross-validation and bootstrap yield a tree fairly close to the best available measured by tree size.

Download Full-text

Predictor Selection for Bacterial Vaginosis Diagnosis Using Decision Tree and Relief Algorithms

Applied Sciences ◽

10.3390/app10093291 ◽

2020 ◽

Vol 10 (9) ◽

pp. 3291

Author(s):

Jesús F. Pérez-Gómez ◽

Juana Canul-Reich ◽

José Hernández-Torruco ◽

Betania Hernández-Ocaña

Keyword(s):

Feature Selection ◽

Decision Tree ◽

Bacterial Vaginosis ◽

Cross Validation ◽

Performance Comparison ◽

Support Vector ◽

Ongoing Research ◽

Selection For ◽

Comparison Of The Results ◽

Fold Cross Validation

Requiring only a few relevant characteristics from patients when diagnosing bacterial vaginosis is highly useful for physicians as it makes it less time consuming to collect these data. This would result in having a dataset of patients that can be more accurately diagnosed using only a subset of informative or relevant features in contrast to using the entire set of features. As such, this is a feature selection (FS) problem. In this work, decision tree and Relief algorithms were used as feature selectors. Experiments were conducted on a real dataset for bacterial vaginosis with 396 instances and 252 features/attributes. The dataset was obtained from universities located in Baltimore and Atlanta. The FS algorithms utilized feature rankings, from which the top fifteen features formed a new dataset that was used as input for both support vector machine (SVM) and logistic regression (LR) algorithms for classification. For performance evaluation, averages of 30 runs of 10-fold cross-validation were reported, along with balanced accuracy, sensitivity, and specificity as performance measures. A performance comparison of the results was made between using the total number of features against using the top fifteen. These results found similar attributes from our rankings compared to those reported in the literature. This study is part of ongoing research that is investigating a range of feature selection and classification methods.

Download Full-text

Comparison of LDA and SPRT on Clinical Dataset Classifications

Biomedical Informatics Insights ◽

10.4137/bii.s6935 ◽

2011 ◽

Vol 4 ◽

pp. BII.S6935 ◽

Cited By ~ 2

Author(s):

Chih Lee ◽

Brittany Nkounkou ◽

Chun-Hsi Huang

Keyword(s):

Learning Community ◽

Prediction Accuracy ◽

Cross Validation ◽

Error Rates ◽

Close Relative ◽

Classification Error ◽

Class Label ◽

Normality Assumption ◽

Clinical Dataset ◽

Leave One Out

In this work, we investigate the well-known classification algorithm LDA as well as its close relative SPRT. SPRT affords many theoretical advantages over LDA. It allows specification of desired classification error rates α and β and is expected to be faster in predicting the class label of a new instance. However, SPRT is not as widely used as LDA in the pattern recognition and machine learning community. For this reason, we investigate LDA, SPRT and a modified SPRT (MSPRT) empirically using clinical datasets from Parkinson's disease, colon cancer, and breast cancer. We assume the same normality assumption as LDA and propose variants of the two SPRT algorithms based on the order in which the components of an instance are sampled. Leave-one-out cross-validation is used to assess and compare the performance of the methods. The results indicate that two variants, SPRT-ordered and MSPRT-ordered, are superior to LDA in terms of prediction accuracy. Moreover, on average SPRT-ordered and MSPRT-ordered examine less components than LDA before arriving at a decision. These advantages imply that SPRT-ordered and MSPRT-ordered are the preferred algorithms over LDA when the normality assumption can be justified for a dataset.

Download Full-text

The Impact of Pressure on the Fingerprint Impression: Presentation Attack Detection Scheme

Applied Sciences ◽

10.3390/app11177883 ◽

2021 ◽

Vol 11 (17) ◽

pp. 7883

Author(s):

Anas Husseis ◽

Judith Liu-Jimenez ◽

Raul Sanchez-Reillo

Keyword(s):

Error Rate ◽

Attack Detection ◽

Fingerprint Recognition ◽

Classification Error ◽

Final Decision ◽

Classification Error Rate ◽

Detection Scheme ◽

Bona Fide ◽

The Impact ◽

Presentation Attack Detection

Fingerprint recognition systems have been widely deployed in authentication and verification applications, ranging from personal smartphones to border control systems. Recently, the biometric society has raised concerns about presentation attacks that aim to manipulate the biometric system’s final decision by presenting artificial fingerprint traits to the sensor. In this paper, we propose a presentation attack detection scheme that exploits the natural fingerprint phenomena, and analyzes the dynamic variation of a fingerprint’s impression when the user applies additional pressure during the presentation. For that purpose, we collected a novel dynamic dataset with an instructed acquisition scenario. Two sensing technologies are used in the data collection, thermal and optical. Additionally, we collected attack presentations using seven presentation attack instrument species considering the same acquisition circumstances. The proposed mechanism is evaluated following the directives of the standard ISO/IEC 30107. The comparison between ordinary and pressure presentations shows higher accuracy and generalizability for the latter. The proposed approach demonstrates efficient capability of detecting presentation attacks with low bona fide presentation classification error rate (BPCER) where BPCER is 0% for an optical sensor and 1.66% for a thermal sensor at 5% attack presentation classification error rate (APCER) for both.

Download Full-text

Perbandingan kinerja metode C4.5 dan Naive Bayes dalam klasifikasi artikel jurnal PGSD berdasarkan mata pelajaran

TEKNO ◽

10.17977/um034v29i1p50-67 ◽

2019 ◽

Vol 29 (1) ◽

pp. 50

Author(s):

Utomo Pujianto ◽

Putri Yuni Ristanti

Keyword(s):

Decision Tree ◽

Cross Validation ◽

Naive Bayes ◽

Naïve Bayes ◽

Fold Cross Validation

Pendidikan mempunyai standar sebagai acuan dalam proses pembelajaran. Dalam hal ini Pemerintah telah mengatur standar pendidikan di Indonesia, mengacu pada Peraturan Pemerintah Republik Indonesia Nomor 19 Tahun 2005 Pasal 6 ayat (1) yaitu kurikulum untuk jenis pendidikan umum, kejuruan, dan khusus pada jenjang pendidikan dasar dan menengah. Sesuai dengan Peraturan Pemerintah tersebut, ditetapkannya Peraturan Menteri Pendidikan Nasional Republik Indonesia Nomor 23 Tahun 2006 pasal 1 ayat (2), tentang Standar Kompetensi Lulusan yang diantaranya memuat SK-KMP (Standar Kompetensi Kelompok Mata Pelajaran). Standar inilah yang dijadikan sebuah rujukan untuk tenaga pendidik, dan bakal tenaga pendidik khususnya mahasiswa bidang pendidikan untuk membuat sebuah media pembelajaran, jurnal sebagai bahan ajaran yang pokok. Tujuan penelitian ini untuk mengklasifikasikan minat mahasiswa PGSD terhadap tema mata pelajaran menurut SK-KMP menggunakan metode Naive Bayes dan Decision tree J48. Hasil penelitian tersebut dapat dijadikan sebagai referensi untuk pengambilan tema pada mata pelajaran di tahun mendatang untuk lebih bervariasi, tidak hanya membahas tentang salah satu mata pelajaran tersebut. Kinerja dari kedua metode tersebut akan dibandingkan, sehingga dapat diketahui kinerja metode mana yang lebih baik dalam melakukan klasifikasi dokumen. Pengujian performa algoritma klasifikasi yang digunakan adalah teknik K-fold Cross Validation. Berdasarkan pengujian performa penerapan algoritma Naïve Bayes dan Decision Tree J48 menggunakan teknik K-Fold Cross Validation terhadap 200 judul dan abstrak artikel jurnal, didapatkan algoritma Naive Bayes, tingkat akurasi sebesar 84%. Sementara itu, untuk hasil yang diperoleh dengan algoritma Decision Tree J48, tingkat akurasi sebesar 86%.

Download Full-text

Analisis Komparatif Evaluasi Performa Algoritma Klasifikasi pada Readmisi Pasien Diabetes

Jurnal Buana Informatika ◽

10.24002/jbi.v7i4.770 ◽

2016 ◽

Vol 7 (4) ◽

Author(s):

Mochammad Yusa ◽

Ema Utami ◽

Emha T. Luthfi

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Nearest Neighbor ◽

Naive Bayes ◽

Kappa Statistic ◽

Naïve Bayes ◽

Validation Dataset ◽

K Nearest Neighbor ◽

Fold Cross Validation

Abstract. Readmission is associated with quality measures on patients in hospitals. Different attributes related to diabetic patients such as medication, ethnicity, race, lifestyle, age, and others result in the calculation of quality care that tends to be complicated. Classification techniques of data mining can solve this problem. In this paper, the evaluation on three different classifiers, i.e. Decision Tree, k-Nearest Neighbor (k-NN), dan Naive Bayes with various settingparameter, is developed by using 10-Fold Cross Validation technique. The targets of parameter performance evaluated is based on term of Accuracy, Mean Absolute Error (MAE), dan Kappa Statistic. The selected dataset consists of 47 attributes and 49.735 records. The result shows that k-NN classifier with k=100 has a better performance in term of accuracy and Kappa Statistic, but Naive Bayes outperforms in term of MAE among other classifiers. Keywords: k-NN, naive bayes, diabetes, readmissionAbstrak. Proses Readmisi dikaitkan dengan perhitungan kualitas penanganan pasien di rumah sakit. Perbedaan atribut-atribut yang berhubungan dengan pasien diabetes proses medikasi, etnis, ras, gaya hidup, umur, dan lain-lain, mengakibatkan perhitungan kualitas cenderung rumit. Teknik klasifikasi data mining dapat menjadi solusi dalam perhitungan kualitas ini. Teknik klasifikasi merupakan salah satu teknik data mining yang perkembangannya cukup signifikan. Di dalam penelitian ini, model algoritma klasifikasi Decision Tree, k-Nearest Neighbor (k-NN), dan Naive Bayes dengan berbagai parameter setting akan dievaluasi performanya berdasarkan nilai performa Accuracy, Mean AbsoluteError (MAE), dan Kappa Statistik dengan metode 10-Fold Cross Validation. Dataset yang dievaluasi memiliki 47 atribut dengan 49.735 records. Hasil penelitian menunjukan bahwa performa accuracy, MAE, dan Kappa Statistik terbaik didapatkan dari Model Algoritma Naive Bayes.Kata Kunci: k-NN, naive bayes, diabetes, readmisi

Download Full-text

Automatic classification of water samples using an optimized SVM model applied to cyclic voltammetry signals.

Revista Vitae ◽

10.17533/udea.vitae.v26n2a05 ◽

2019 ◽

Vol 26 (2) ◽

pp. 94-103

Author(s):

Hugo Italo Romero ◽

Ivan RAMÍREZ-MORALES ◽

Cinthia ROMERO FLORES

Keyword(s):

Water Samples ◽

Cross Validation ◽

Tap Water ◽

Automatic Classification ◽

Chloride Ions ◽

Bottled Water ◽

Classification Model ◽

Final Decision ◽

Significant Difference ◽

Fold Cross Validation

Background: concern about the quality of the water for human consumption has become widespread among the population. The taste and some problems associated with drinking water have been the cause of increased demand for bottled water. Due to this, day to day, a large number of companies has manifested their interest in the production of bottled water. Objective: to evaluate a novel automatic classification model that differentiates bottled water from tap water. Methods: the voltammetric technique consisted of three electrode setup. The output current has been considered for data analysis. From the results of grid search, six pairs of values were pre-selected for the parameters of σ and C whose results were similar. High values of accuracy, specificity and sensitivity were achieved in test dataset. The final decision was made after performing an ANOVA test of 100 repetitions of 5-fold cross-validation, 3000 models were evaluated with the parameter combinations described above for the SVM. Results: the oxidation and reduction peaks of the water samples have been observed to be prominent. Absolute values of current (I) increased in the case of public water samples, possibly due to the largest concentration of chloride ions which have higher contributions to the conductivity. 5-fold cross-validation test mean specificity resulted in C parameters values greater than 0 and between 0 and 30; a σ value greater than 10 and between 0 and 15 were found for tap water and bottled water, respectively. The combination (σ = 10, C = 30) presented best results in accuracy 0.988 ± 0.037, specificity 0.973 ± 0.085 and sensitivity 1 ± 0.09. Conclusions: results of this research work have shown that voltammograms for values of current increased for tap water samples, 9.94e-6μA, compared to 7.99e-6μA due to higher chloride ions concentration in the former. The parameters combination (σ = 10, C = 20) was selected as optimal parameters since there were no significant difference between this and the former.

Download Full-text

Perbandingan Decision Tree J48, REPTREE, dan Random Tree dalam Menentukan Prediksi Produksi Minyak Kelapa Sawit Menggunakan Fuzzy Tsukamoto

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2021833108 ◽

2021 ◽

Vol 8 (3) ◽

pp. 473

Author(s):

Tundo Tundo ◽

Shofwatul 'Uyun

Keyword(s):

Decision Tree ◽

Decision Analysis ◽

Decision Trees ◽

Palm Oil ◽

Error Rate ◽

Oil Production ◽

Random Tree ◽

Actual Data ◽

Fuzzy Method ◽

Forecasting Error

<h2 align="center"> </h2>Penelitian ini menerangkan analisis decision tree J48, REPTree dan Random Tree dengan menggunakan metode fuzzy Tsukamoto dalam penentuan jumlah produksi minyak kelapa sawit di perusahaan PT Tapiana Nadenggan dengan tujuan untuk mengetahui decision tree mana yang hasilnya mendekati dari data sesungguhnya. Digunakannya decision tree J48, REPTree, dan Random Tree yaitu untuk mempercepat dalam pembuatan rule yang digunakan tanpa harus berkonsultasi dengan para pakar dalam menentukan rule yang digunakan. Berdasarkan data yang digunakan akurasi pembentukan rule dari decision tree J48 adalah 95,2381%, REPTree adalah 90,4762%, dan Random Tree adalah 95,2381%. Hasil dari penelitian yang telah dihitung bahwa metode fuzzy Tsukamoto dengan menggunakan REPTree mempunyai error Average Forecasting Error Rate (AFER) yang lebih kecil sebesar 23,17 % dibandingkan dengan menggunakan J48 sebesar 24,96 % dan Random Tree sebesar 36,51 % pada prediksi jumlah produksi minyak kelapa sawit. Oleh sebab itu ditemukan sebuah gagasan bahwa akurasi pohon keputusan yang terbentuk menggunakan tools WEKA tidak menjamin akurasi yang terbesar adalah yang terbaik, buktinya dari kasus ini REPTree memiliki akurasi rule paling kecil, akan tetapi hasil prediksi memiliki tingkat error paling kecil, dibandingkan dengan J48 dan Random Tree. Abstract<div>This study explains the J48, REPTree and Tree Random tree decision analysis using Tsukamoto's fuzzy method in determining the amount of palm oil production in PT Tapiana Nadenggan's company with the aim of finding out which decision tree results are close to the actual data. The decision tree J48, REPTree, and Random Tree is used to accelerate the making of rules that are used without having to consult with experts in determining the rules used. Based on the data used the accuracy of the rule formation of the J48 decision tree is 95.2381%, REPTree is 90.4762%, and the Random Tree is 95.2381%. The results of the study have calculated that the Tsukamoto fuzzy method using REPTree has a smaller Average Forecasting Error Rate (AFER) rate of 23.17% compared to using J48 of 24.96% and Tree Random of 36.51% in the prediction of the amount of palm oil production. Therefore an idea was found that the accuracy of decision trees formed using WEKA tools does not guarantee the greatest accuracy is the best, the proof of this case REPTree has the smallest rule accuracy, but the predicted results have the smallest error rate, compared to J48 and Tree Random.</div>

Download Full-text

Classification of vulnerability levels using multivariate biomarkers in schizophrenia: a machine-learning approach

10.21203/rs.3.rs-15842/v2 ◽

2020 ◽

Author(s):

Simona Caldani ◽

François-Benoît Vialatte ◽

Aurélien Baelde ◽

Maria Pia Bucci ◽

Narjes Bendjemaa ◽

...

Keyword(s):

Machine Learning ◽

Error Rate ◽

Error Rates ◽

Supervised Machine Learning ◽

Classification Error ◽

Support Vector ◽

Neurological Soft Signs ◽

Neurodevelopmental Disease ◽

Multimodal Features ◽

Early Intervention In Psychosis

Abstract Background: Schizophrenia is a heterogeneous neurodevelopmental disease involving cognitive and motor impairments. Motor dysfunctions, such as eye movements or neurological soft signs, are proposed as endophenotypic markers. Methods: Supervised machine-learning methods (Support Vector Machines) applied on oculomotor performances using comprehensive testing with prosaccades, antisaccades, memory-guided saccade tasks and smooth pursuit, as well as neurological soft signs assessment, was used to discriminate patients with schizophrenia (SZ, N=53), full siblings of patients (FS, N=45) and healthy volunteers (C, N=48). 80% of patients were used in a training/validation set and 20% on a test set. The discrimination was measured using the classification error (rate of misclassified patients).Results: The most reliable classification was between C and SZ, with only 15% and 12% of error rates for validation and test, whereas the SZ vs. FS classification provided the highest error rates (32% of error rate in both validation and test). Interestingly, neurological soft signs were selected as the best predictor, together with a combination of measures, for the two classifications: C vs. SZ, SZ vs. FS. In addition, memory-guided saccades were consistently selected among the best two multimodal features for the classifications involving the control group (C vs. SZ or FS). Conclusions: Taken together, these results emphasize the importance of neurological soft signs and sensitive oculomotor parameters, especially memory-guided saccades. This classification provides promising avenues for improving early detection of / early intervention in psychosis.

Download Full-text

Conceptual Approach to Predict Loan Defaults Using Decision Trees

Advances in Business Information Systems and Analytics - Sentiment Analysis and Knowledge Discovery in Contemporary Business ◽

10.4018/978-1-5225-4999-4.ch009 ◽

2019 ◽

pp. 148-161 ◽

Cited By ~ 1

Author(s):

Syed Muzamil Basha ◽

Dharmendra Singh Rajput ◽

N. Ch. S. N. Iyengar

Keyword(s):

Decision Tree ◽

Decision Trees ◽

Prediction Algorithm ◽

Classification Error ◽

Selection Algorithm ◽

Decision Tree Classifier ◽

Time Data ◽

Conceptual Approach ◽

Tree Classifier ◽

Loan Defaults

In this chapter, the authors show how to build a decision tree from given real-time data. They interpret the output of decision tree by learning decision tree classifier using really recursive greedy algorithm. Feature selection is made based on classification error using the algorithm called feature split selection algorithm (FSSA), with all different possible stopping conditions for splitting. The authors perform prediction with decision trees using decision tree prediction algorithm (DTPA), followed by multiclass predictions and their probabilities. Finally, they perform splitting procedure on real continuous value input using threshold split selection algorithm (TSSA).

Download Full-text

The Performance of the Linear Discriminant Function in Nonoptimal Situations and the Estimation of Classification Error Rates: A Review of Recent Findings

Journal of Marketing Research ◽

10.1177/002224377901600309 ◽

1979 ◽

Vol 16 (3) ◽

pp. 370-381 ◽

Cited By ~ 12

Author(s):

William R. Dillon

Keyword(s):

Discriminant Function ◽

Error Rates ◽

Linear Discriminant Function ◽

Alternative Methods ◽

Misclassification Error ◽

Practical Significance ◽

Classification Error ◽

Continuous Variables ◽

Discrete Variables ◽

Linear Discriminant

This article is a review of the results, as are available, on the performance of the linear discriminant function in situations where the assumptions of multivariate normality and equal group dispersion structures are violated. Some new results are discussed for the case of classification using discrete variables, and in the case of both binary and continuous variables. In addition, alternative methods which have been proposed, and evaluated, for estimating misclassification error rates are thoroughly reviewed. In all cases, the material is reviewed in terms of practical significance, with particular emphasis on the conditions unfavorable to the performance of each procedure.

Download Full-text