Decision Tree Classification and Model Evaluation for Breast Cancer Survivability: A Data Mining Approach

Abstract. Today there is a considerable amount of work dealing with decision trees, especially in survival analysis (Ibrahim et al, 2008). Cases classified as survival analysis, like cancer patients. This study discusses the application of data mining which is to obtain diagnostic results. The classification technique uses information obtained from medical records of breast cancer patients in Yugoslavia. A method for answering these problems through decision tree analysis using the CHAID, Exhaustive CHAID and CART methods. Empirically aiming to compare performance of three decision tree classification methods so that the best method is obtained. It was concluded that best method used in applying to the classification of breast cancer sufferers was the CART method because it was able to get the most significant variables at most four, namely inv-node, tumor size, deg-malig and breast parts. Then it has a total accuracy rate with highest value of 84.9 percent and has a total error rate with lowest value of 15.1 percent. Abstrak. Dewasa ini ada cukup banyak pekerjaan yang berurusan dengan pohon keputusan, terutama dalam analisis survival (Ibrahim dkk, 2008). Kasus yang tergolong analisis survival seperti penderita penyakit kanker. Penelitian ini membahas mengenai penerapan data mining yang digunakan untuk mendapatkan hasil diagnostik. Pendekatan teknik klasifikasi dengan menggunakan informasi yang diperoleh pada rekam medis data penderita kanker payudara di Yugoslavia. Salah satu metode untuk menjawab permasalahan tersebut melalui analisis pohon keputusan dengan metode CHAID, Exhaustive CHAID dan CART. Secara empiris bertujuan untuk membandingkan kinerja tiga metode pengklasifikasi pohon keputusan agar didapatkan metode manakah yang terbaik. Maka disimpulkan bahwa metode terbaik yang digunakan dalam penerapan pada klasifikasi penderita kanker payudara adalah metode CART sebab mampu mendapatkan variabel signifikan yang paling banyak ada empat, yakni inv-node, ukuran tumor, deg-malig dan bagian payudara. Kemudian memiliki tingkat akurasi total dengan nilai tertinggi sebesar 84.9 persen dan memiliki total tingkat kesalahan dengan nilai yang terendah sebesar 15.1 persen.

Download Full-text

Improved differentiation classification of variable precision artificial intelligence higher education management

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219036 ◽

2021 ◽

pp. 1-10

Author(s):

Chao Dong ◽

Yan Guo

Keyword(s):

Artificial Intelligence ◽

Higher Education ◽

Data Mining ◽

Decision Tree ◽

Classification Accuracy ◽

Attribute Selection ◽

Higher Education Management ◽

Education Management ◽

Decision Tree Classification

The wide application of artificial intelligence technology in various fields has accelerated the pace of people exploring the hidden information behind large amounts of data. People hope to use data mining methods to conduct effective research on higher education management, and decision tree classification algorithm as a data analysis method in data mining technology, high-precision classification accuracy, intuitive decision results, and high generalization ability make it become a more ideal method of higher education management. Aiming at the sensitivity of data processing and decision tree classification to noisy data, this paper proposes corresponding improvements, and proposes a variable precision rough set attribute selection standard based on scale function, which considers both the weighted approximation accuracy and attribute value of the attribute. The number improves the anti-interference ability of noise data, reduces the bias in attribute selection, and improves the classification accuracy. At the same time, the suppression factor threshold, support and confidence are introduced in the tree pre-pruning process, which simplifies the tree structure. The comparative experiments on standard data sets show that the improved algorithm proposed in this paper is better than other decision tree algorithms and can effectively realize the differentiated classification of higher education management.

Download Full-text

Correlates of Physical Activity Behavior in Adults: A Data Mining Approach

10.21203/rs.2.23726/v2 ◽

2020 ◽

Author(s):

Vahid Farrahi ◽

Maisa Niemelä ◽

Mikko Kärmeniemi ◽

Soile Puhakka ◽

Maarit Kangas ◽

...

Keyword(s):

Physical Activity ◽

Data Mining ◽

Decision Tree ◽

Sitting Time ◽

Accelerometer Data ◽

Relative Importance ◽

Interaction Detection ◽

Data Mining Approach ◽

Input Variables

Abstract Purpose: A data mining approach was applied to establish a multilevel hierarchy predicting physical activity (PA) behavior, and to methodologically identify the correlates of PA behavior. Methods: Cross-sectional data from the population-based Northern Finland Birth Cohort 1966 study, collected in the most recent follow-up at age 46, were used to create a hierarchy using the chi-square automatic interaction detection (CHAID) decision tree technique for predicting PA behavior. PA behavior is defined as active or inactive depending on participants’ activity profiles, which were previously created through a multidimensional (clustering) approach on continuous accelerometer-measured activity intensities in one week. The input variables (predictors) used for decision tree fitting consisted of individual, demographical, psychological, behavioral, environmental, and physical factors. Using generalized linear mixed models, we also analyzed how factors emerging from the model were associated with three PA metrics, including daily time (minutes per day) in sedentary (SED), light PA (LPA), and moderate-to-vigorous PA (MVPA), to assure the relative importance of methodologically identified factors. Results: Of the 4,582 participants with valid accelerometer data at the latest follow-up, 2,701 and 1,881 had active and inactive profiles, respectively. We used a total of 168 factors as input variables to classify these two PA behaviors. Out of these 168 factors, the decision tree selected 36 factors of different domains from which 54 subgroups of participants were formed. The emerging factors from the model explained minutes per day in SED, LPA, and/or MVPA, including body fat percentage (SED: B=26.5, LPA: B=-16.1, and MVPA: B=-11.7), normalized heart rate recovery 60 seconds after exercise (SED: B=-16.1, LPA: B=9.9, and MVPA: B=9.6), average weekday total sitting time (SED: B=34.1, LPA: B=-25.3, and MVPA: B=-5.8), and extravagance score (SED: B=6.3 and LPA: B=-3.7). Conclusions: Using data mining, we established a data-driven model composed of 36 different factors of relative importance from empirical data. This model may be used to identify subgroups for multilevel intervention allocation and design. Additionally, this study methodologically discovered an extensive set of factors that can be a basis for additional hypothesis testing in PA correlates research.

Download Full-text

Data Mining and Ergonomic Evaluation of Firefighter’s Motion Based on Decision Tree Classification Model

Communications in Computer and Information Science - Advanced Research on Computer Science and Information Engineering ◽

10.1007/978-3-642-21411-0_35 ◽

2011 ◽

pp. 212-217 ◽

Cited By ~ 1

Author(s):

Lifang Yang ◽

Tianjiao Zhao

Keyword(s):

Data Mining ◽

Decision Tree ◽

Classification Model ◽

Decision Tree Classification ◽

Ergonomic Evaluation

Download Full-text

PENERAPAN ALGORITMA C4.5 UNTUK PENENTUAN KELAYAKAN PEMBERIAN KREDIT (Studi Kasus : Koperia - Koperasi Warga Komplek Gandaria)

Jurnal Algoritma, Logika dan Komputasi ◽

10.30813/j-alu.v2i1.1573 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Teguh Budi Santoso ◽

Dela Sekardiana

Keyword(s):

Data Mining ◽

Decision Tree ◽

Classification Model ◽

Decision Tree Classification ◽

C4.5 Algorithm ◽

Credit Worthiness ◽

Loan Amount

Current credit giving in KOPERIA (Koperasi Warga Komplek Gandaria) is still based on an objective process. Difficulties in determining the feasibility of giving credit are often experienced by cooperative managers, so that problems arise in the cooperative is a default payment of credit installments of customers in KOPERIA. This study aims to form a decision tree classification model to determine the customer's credit worthiness. In this study the application of C4.5 Algorithm, based on the Sets and Attributes used in this study, namely, the amount of income divided into 2 categories> 5 million and 3-5 million, the amount of balance divided into three, namely> 3 million, 1-3 million and <1 Million, The Loan Amount is divided into three, namely 1-4 Months, 5-8 months, and 9-12 Months and Requirements with attributes of Business Capital, buying goods and others. In this study determine the appropriate root nodes, the classification results using C4.5 Algorithm shows that the accuracy of 97.5% is obtained, based on the results obtained shows that the c4.5 algorithm is suitable to be used to determine the feasibility of lending customers to KOPERIA.Keywords: Data Mining, C4.5 Algorithm, loan feasibility

Download Full-text

PENERAPAN DATA MINING MENGGUNAKAN ALGORITMA C4.5 TEHADAP PENGARUH PENJUALAN KOPI PADA PT. JPW INDONESIA

Jurnal Sistem Informasi dan Informatika (Simika) ◽

10.47080/simika.v3i1.836 ◽

2020 ◽

Vol 3 (1) ◽

pp. 40-54

Author(s):

Ikong Ifongki

Keyword(s):

Data Mining ◽

Decision Tree ◽

Decision Rules ◽

Large Data ◽

Added Value ◽

Data Set ◽

Use Of Data ◽

Decision Tree Classification ◽

C4.5 Algorithm

Data mining is a series of processes to explore the added value of a data set in the form of knowledge that has not been known manually. The use of data mining techniques is expected to provide knowledge - knowledge that was previously hidden in the data warehouse, so that it becomes valuable information. C4.5 algorithm is a decision tree classification algorithm that is widely used because it has the main advantages of other algorithms. The advantages of the C4.5 algorithm can produce decision trees that are easily interpreted, have an acceptable level of accuracy, are efficient in handling discrete type attributes and can handle discrete and numeric type attributes. The output of the C4.5 algorithm is a decision tree like other classification techniques, a decision tree is a structure that can be used to divide a large data set into smaller sets of records by applying a series of decision rules, with each series of division members of the resulting set become similar to each other. In this case study what is discussed is the effect of coffee sales by processing 106 data from 1087 coffee sales data at PT. JPW Indonesia. Data samples taken will be calculated manually using Microsoft Excel and Rapidminer software. The results of the calculation of the C4.5 algorithm method show that the Quantity and Price attributes greatly affect coffee sales so that sales at PT. JPW Indonesia is still often unstable.

Download Full-text

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Journal of Computing and Information Technology ◽

10.2498/cit.1000984 ◽

2007 ◽

Vol 15 (2) ◽

pp. 161 ◽

Cited By ~ 5

Author(s):

Appavu Balamurugan ◽

Ramasamy Rajaram

Keyword(s):

Data Mining ◽

Decision Tree ◽

Data Mining Approach ◽

E Mail

Download Full-text

Correlates of Physical Activity Behavior in Adults: A Data Mining Approach

10.21203/rs.2.23726/v3 ◽

2020 ◽

Author(s):

Vahid Farrahi ◽

Maisa Niemelä ◽

Mikko Kärmeniemi ◽

Soile Puhakka ◽

Maarit Kangas ◽

...

Keyword(s):

Physical Activity ◽

Data Mining ◽

Decision Tree ◽

Sitting Time ◽

Accelerometer Data ◽

Relative Importance ◽

Interaction Detection ◽

Data Mining Approach ◽

Input Variables

Abstract Purpose: A data mining approach was applied to establish a multilevel hierarchy predicting physical activity (PA) behavior, and to methodologically identify the correlates of PA behavior.Methods: Cross-sectional data from the population-based Northern Finland Birth Cohort 1966 study, collected in the most recent follow-up at age 46, were used to create a hierarchy using the chi-square automatic interaction detection (CHAID) decision tree technique for predicting PA behavior. PA behavior is defined as active or inactive based on machine-learned activity profiles, which were previously created through a multidimensional (clustering) approach on continuous accelerometer-measured activity intensities in one week. The input variables (predictors) used for decision tree fitting consisted of individual, demographical, psychological, behavioral, environmental, and physical factors. Using generalized linear mixed models, we also analyzed how factors emerging from the model were associated with three PA metrics, including daily time (minutes per day) in sedentary (SED), light PA (LPA), and moderate-to-vigorous PA (MVPA), to assure the relative importance of methodologically identified factors.Results: Of the 4,582 participants with valid accelerometer data at the latest follow-up, 2,701 and 1,881 had active and inactive profiles, respectively. We used a total of 168 factors as input variables to classify these two PA behaviors. Out of these 168 factors, the decision tree selected 36 factors of different domains from which 54 subgroups of participants were formed. The emerging factors from the model explained minutes per day in SED, LPA, and/or MVPA, including body fat percentage (SED: B=26.5, LPA: B=-16.1, and MVPA: B=-11.7), normalized heart rate recovery 60 seconds after exercise (SED: B=-16.1, LPA: B=9.9, and MVPA: B=9.6), average weekday total sitting time (SED: B=34.1, LPA: B=-25.3, and MVPA: B=-5.8), and extravagance score (SED: B=6.3 and LPA: B=-3.7).Conclusions: Using data mining, we established a data-driven model composed of 36 different factors of relative importance from empirical data. This model may be used to identify subgroups for multilevel intervention allocation and design. Additionally, this study methodologically discovered an extensive set of factors that can be a basis for additional hypothesis testing in PA correlates research.

Download Full-text

Evaluating the impact of soy compounds on breast cancer using the data mining approach

Food & Function ◽

10.1039/c9fo00976k ◽

2020 ◽

Vol 11 (5) ◽

pp. 4561-4570

Author(s):

Sheng-I Chen ◽

Hsiao-Ting Tseng ◽

Chia-Chien Hsieh

Keyword(s):

Breast Cancer ◽

Data Mining ◽

Cancer Type ◽

Data Mining Approach ◽

The Impact

Accumulating evidence has shown that soy intake is associated with the prevention of cancers. However, the specific soy compound and cancer type should be considered before allocating a precise nutrient intervention.

Download Full-text