Classification and Regression Trees

Author(s):  
Johannes Gehrke

It is the goal of classification and regression to build a data mining model that can be used for prediction. To construct such a model, we are given a set of training records, each having several attributes. These attributes can either be numerical (for example, age or salary) or categorical (for example, profession or gender). There is one distinguished attribute, the dependent attribute; the other attributes are called predictor attributes. If the dependent attribute is categorical, the problem is a classification problem. If the dependent attribute is numerical, the problem is a regression problem. It is the goal of classification and regression to construct a data mining model that predicts the (unknown) value for a record where the value of the dependent attribute is unknown. (We call such a record an unlabeled record.) Classification and regression have a wide range of applications, including scientific experiments, medical diagnosis, fraud detection, credit approval, and target marketing (Hand, 1997). Many classification and regression models have been proposed in the literature, among the more popular models are neural networks, genetic algorithms, Bayesian methods, linear and log-linear models and other statistical methods, decision tables, and tree-structured models, the focus of this chapter (Breiman, Friedman, Olshen, & Stone, 1984). Tree-structured models, socalled decision trees, are easy to understand, they are non-parametric and thus do not rely on assumptions about the data distribution, and they have fast construction methods even for large training datasets (Lim, Loh, & Shih, 2000). Most data mining suites include tools for classification and regression tree construction (Goebel & Gruenwald, 1999).

Animals ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 1165
Author(s):  
Abdelfattah Selim ◽  
Ameer Megahed ◽  
Sahar Kandeel ◽  
Abdullah D. Alanazi ◽  
Hamdan I. Almohammed

Classification and Regression Tree (CART) analysis is a potentially powerful tool for identifying risk factors associated with contagious caprine pleuropneumonia (CCPP) and the important interactions between them. Our objective was therefore to determine the seroprevalence and identify the risk factors associated with CCPP using CART data mining modeling in the most densely sheep- and goat-populated governorates. A cross-sectional study was conducted on 620 animals (390 sheep, 230 goats) distributed over four governorates in the Nile Delta of Egypt in 2019. The randomly selected sheep and goats from different geographical study areas were serologically tested for CCPP, and the animals’ information was obtained from flock men and farm owners. Six variables (geographic location, species, flock size, age, gender, and communal feeding and watering) were used for risk analysis. Multiple stepwise logistic regression and CART modeling were used for data analysis. A total of 124 (20%) serum samples were serologically positive for CCPP. The highest prevalence of CCPP was between aged animals (>4 y; 48.7%) raised in a flock size ≥200 (100%) having communal feeding and watering (28.2%). Based on logistic regression modeling (area under the curve, AUC = 0.89; 95% CI 0.86 to 0.91), communal feeding and watering showed the highest prevalence odds ratios (POR) of CCPP (POR = 3.7, 95% CI 1.9 to 7.3), followed by age (POR = 2.1, 95% CI 1.6 to 2.8) and flock size (POR = 1.1, 95% CI 1.0 to 1.2). However, higher-accuracy CART modeling (AUC = 0.92, 95% CI 0.90 to 0.95) showed that a flock size >100 animals is the most important risk factor (importance score = 8.9), followed by age >4 y (5.3) followed by communal feeding and watering (3.1). Our results strongly suggest that the CCPP is most likely to be found in animals raised in a flock size >100 animals and with age >4 y having communal feeding and watering. Additionally, sheep seem to have an important role in the CCPP epidemiology. The CART data mining modeling showed better accuracy than the traditional logistic regression.


2013 ◽  
Author(s):  
Srimoyee Bhattacharya ◽  
Marko Maucec ◽  
Jeffrey Marc Yarus ◽  
Dwight David Fulton ◽  
Jon Matthew Orth ◽  
...  

2019 ◽  
Vol 3 (2) ◽  
pp. 139-145
Author(s):  
Nurul Indah Prabawati ◽  
Widodo ◽  
Hamidillah Ajie

Organisasi kemahasiswaan adalah fasilitas yang disediakan oleh perguruan tinggi sebagai wadah untuk mengembangkan kemampuan non akademis, minat dan bakat mahasiswa. Namun, dalam kenyataannya banyak mahasiswa yang mengikuti organisasi mengalami penurunan prestasi hingga tidak dapat lulus tepat waktu. Di Universitas Negeri Jakarta belum adanya sistem yang dapat mengklasifikasikan lama masa studi mahasiswa yang mengikuti organisasi. Sebelum membangun sistem pengambilan keputusan, diperlukan penelitian mengenai akurasi suatu algoritma agar sistem keputusan yang dibuat memiliki tingkat akurasi yang tinggi. Penelitian ini menggunakan algoritma data mining yaitu algoritma Classification and Regression Tree (CART). CART merupakan metode pohon keputusan biner. CART dikembangkan untuk melakukan analisis klasifikasi pada peubah respon baik yang nominal, ordinal, maupun kontinu. Metode klasifikasi CART terdiri dari dua metode yaitu metode pohon regresi dan pohon klasifikasi. Data mahasiswa yang mengikuti organisasi yang lulus tepat waktu dan tidak lulus tepat waktu akan diolah menggunakan algoritma CART. Setelah diklasifikasikan data tersebut akan dihitung hasil akurasinya menggunakan K-fold Cross Validation dengan nilai K = 5, k = 10, dan K = 20. Berdasarkan hasil contoh data mahasiswa yang mengikuti organisasi menunjukan bahwa hasil perhitungan akurasi algoritma CART terbaik diperoleh ketika nilai K = 20. Algoritma CART telah mampu mengklasifikasikan lama masa studi mahasiswa yang mengikuti organisasi di Universitas Negeri Jakarta. Algoritma CART menghasilkan rata-rata akurasi 80%.


2017 ◽  
Vol 1 (3) ◽  
pp. 183
Author(s):  
Gede Suwardika

Hepatitis adalah peradangan pada hati karena toxin, seperti kimia atauobat ataupun agen penyebab infeksi. Hepatitis yang berlangsung kurang dari 6 bulan disebut "hepatitis akut", hepatitis yang berlangsung lebih dari 6 bulan disebut "hepatitis kronis".Hepatitis biasanya terjadi karena virus, terutama salah satu dari kelima virus hepatitis, yaitu A, B, C, D atau E. Hepatitis juga bisa terjadi karena infeksi virus lainnya, seperti mononukleosis infeksiosa, demam kuning dan infeksi sitomegalovirus. Penyebab hepatitis non-virus yang utama adalah alkohol dan obat-obatan.Dalam penelitian ini dilakukan tes terhadap  155 pasien dengan respon meninggal atau hidup.  Untuk itu penerapan Data Mining akan dilakukan pada kasus diatas, memanfaatkan salah satu teknik yaitu Data Classification, sejumlah data testing yang tersedia akan di analisis serta dibandingkan dengan data training untuk dilakukan prediksi meninggal atau hidup.Hasil ketepatan klasifikasi antara data training dengan data testing dengan analisis regresi logistik adalah 79,4% sedangkan dengan menggunakan SVM diperoleh sebesar 80%. Pengelompokan dengan menggunakan K-Means dan Kernel K-Means menghasilkan ketepatan pengelompokan yang berbeda. Ini menunjukkan bahwa data hepatitis memiliki pengelompokan yang baik. Kemudian hasil pengelompokan pada Kernel K-Means dibandingkan dengan data aktual yang diklasifikasikan dengan menggunakan regresi logistik, SVM dan CART dimana dihasilkan bahwa data hasil dari Kernel K-Means memiliki ketepatan klasifikasi yang lebih baik dibandingkan dengan hasil klasifikasi pada data aktual.


Winsorize tree is a modified tree that reformed from classification and regression tree (CART). It lays on the strategy of handling and accommodating the outliers simultaneously in all nodes while generating the subsequence branches of tree. Normally, due to the existence of outlier, the accuracy rate of most of the classifiers will be affected. Therefore, we propose winsorize tree which could resist to anomaly data. It protects the originality of the data while performing the splitting process. In this study, winsorize tree was compared to other classifiers. The results obtained from five real datasets indicate that the proposed winsorize tree performs as good as or even better compare to the other data mining techniques based on the misclassification rate.


Sign in / Sign up

Export Citation Format

Share Document