c5.0 decision tree
Recently Published Documents


TOTAL DOCUMENTS

40
(FIVE YEARS 21)

H-INDEX

4
(FIVE YEARS 2)

2022 ◽  
Vol 19 (3) ◽  
pp. 2193-2205
Author(s):  
Jian-xue Tian ◽  
◽  
Jue Zhang

<abstract><p>To overcome the two class imbalance problem among breast cancer diagnosis, a hybrid method by combining principal component analysis (PCA) and boosted C5.0 decision tree algorithm with penalty factor is proposed to address this issue. PCA is used to reduce the dimension of feature subset. The boosted C5.0 decision tree algorithm is utilized as an ensemble classifier for classification. Penalty factor is used to optimize the classification result. To demonstrate the efficiency of the proposed method, it is implemented on biased-representative breast cancer datasets from the University of California Irvine(UCI) machine learning repository. Given the experimental results and further analysis, our proposal is a promising method for breast cancer and can be used as an alternative method in class imbalance learning. Indeed, we observe that the feature extraction process has helped us improve diagnostic accuracy. We also demonstrate that the extracted features considering breast cancer issues are essential to high diagnostic accuracy.</p></abstract>


Author(s):  
Zohreh Manoochehri ◽  
Sara Manoochehri ◽  
Farzaneh Soltani ◽  
Majid Sadeghifar

Background: Preeclampsia is a type of pregnancy hypertension disorder that has adverse effects on both the mother and the fetus. Despite recent advances in the etiology of preeclampsia, no adequate clinical screening tests have been identified to diagnose the disorder. Objective: We aimed to provide a model based on data mining approaches that can be used as a screening tool to identify patients with this syndrome and also to identify the risk factors associated with it. Materials and Methods: The data used to perform this cross-sectional study were extracted from the clinical records of 726 mothers with preeclampsia and 726 mothers without preeclampsia who were referred to Fatemieh Hospital in Hamadan City during April 2005–March 2015. In this study, six data mining methods were adopted, including logistic regression, k-nearest neighborhood, C5.0 decision tree, discriminant analysis, random forest, and support vector machine, and their performance was compared using the criteria of accuracy, sensitivity, and specificity. Results: Underlying condition, age, pregnancy season and the number of pregnancies were the most important risk factors for diagnosing preeclampsia. The accuracy of the models were as follows: logistic regression (0.713), k-nearest neighborhood (0.742), C5.0 decision tree (0.788), discriminant analysis (0.687), random forest (0.758) and support vector machine (0.791). Conclusion: Among the data mining methods employed in this study, support vector machine was the most accurate in predicting preeclampsia. Therefore, this model can be considered as a screening tool to diagnose this disorder. Key words: Preeclampsia, Random forest, C5.0 decision tree, Support vector machine, Logistic regression.


2021 ◽  
Vol 9 ◽  
Author(s):  
Qiaomei Su ◽  
Weiheng Tao ◽  
Shiguang Mei ◽  
Xiaoyuan Zhang ◽  
Kaixin Li ◽  
...  

The main purpose of this study is to establish an effective landslide susceptibility zoning model and test whether underground mined areas and ground collapse in coal mine areas seriously affect the occurrence of landslides. Taking the Fenxi Coal Mine Area of Shanxi Province in China as the research area, landslide data has been investigated by the Shanxi Geological Environment Monitoring Center; adopting the 5-fold cross-validation method, and through Geostatistics analysis means the datasets of all non-landslides and landslides were divided into 80:20 proportions randomly for training and validating models. A set of 15 condition factors including terrain, geological, hydrological, land cover, and human engineering activity factors (distance to road, distance to mined area, ground collapse density) were selected as the evaluation indices to construct the susceptibility assessment model. Three machine learning algorithms for landslide susceptibility prediction (LSP) including C5.0 Decision Tree (C5.0), Random Forest (RF), and Support Vector Machine (SVM) have been selected and compared through the Areas under the Receiver Operating Characteristics (ROC) Curves (AUC), and several statistical estimates. The study revealed that for these three models the value range of prediction accuracies vary from 83.49 to 99.29% (in the training stage), and 62.26–73.58% (in the validation stage). In the two stages, AUCs are between 0.92 to 0.99 and 0.71 to 0.80 respectively. Using Jenks Natural Breaks algorithm, three LSPs levels are established as very low, low, medium, high, and very high probability of landslide by dividing the indices of the LSP. Compared with RF and SVM, C5.0 is considered better in five categories according to quantities and distribution of the landslides and their area percentage for different LSP zones. Four factors such as distance to road, lithology, profile curvature, and ground collapse density are the most suitable condition factors for LSP. The distance to mine area factor has a medium contribution and plays an obvious role in the occurrence of landslides in all the models. The result reveals that C5.0 possesses better prediction efficiency than RF and SVM, and underground mined area and ground collapse sifnigicantly affect significantly the occurrence of landslides in the Fenxi Coal Mine Area.


2021 ◽  
Vol 12 (1) ◽  
pp. 11-21
Author(s):  
Senthil Kumar Seethapathy ◽  
◽  
C.Naveeth Babu

Data mining includes the utilization of erudite data analysis tools to discover previously unidentified, suitable patterns and relationships in enormous data sets. Data mining tools can incorporate statistical models, machine learning methods such as neural networks or decision trees, and mathematical algorithms. As a result data mining comprises of more process. This performs analysis and prediction than collecting and managing data. The main objective of data mining is to identify valid, potentially useful, novel and understandable correlations and patterns in existing data. Finding and analyzing useful patterns in data is known by different names (e.g., knowledge extraction, information discovery, information harvesting, data archaeology, and data pattern processing). The term data mining is basically utilized by statisticians, database researchers, and the business communities.


2021 ◽  
Author(s):  
Meng Ji ◽  
Pierrette Bouillon

BACKGROUND Linguistic accessibility has important impact on the reception and utilization of translated health resources among multicultural and multilingual populations. Linguistic understandability of health translation has been under-studied. OBJECTIVE Our study aimed to develop novel machine learning models for the study of the linguistic accessibility of health translations comparing Chinese translations of the World Health Organization health materials with original Chinese health resources developed by the Chinese health authorities. METHODS Using natural language processing tools for the assessment of the readability of Chinese materials, we explored and compared the readability of Chinese health translations from the World Health Organization with original Chinese materials from China Centre for Disease Control and Prevention. RESULTS Pairwise adjusted t test showed that three new machine learning models achieved statistically significant improvement over the baseline logistic regression in terms of AUC: C5.0 decision tree (p=0.000, 95% CI: -0.249, -0.152), random forest (p=0.000, 95% CI: 0.139, 0.239) and XGBoost Tree (p=0.000, 95% CI: 0.099, 0.193). There was however no significant difference between C5.0 decision tree and random forest (p=0.513). Extreme gradient boost tree was the best model having achieved statistically significant improvement over the C5.0 model (p=0.003) and the Random Forest model (p=0.006) at the adjusted Bonferroni p value at 0.008. CONCLUSIONS The development of machine learning algorithms significantly improved the accuracy and reliability of current approaches to the evaluation of the linguistic accessibility of Chinese health information, especially Chinese health translations in relation to original health resources. Although the new algorithms developed were based on Chinese health resources, they can be adapted for other languages to advance current research in accessible health translation, communication, and promotion.


2021 ◽  
Author(s):  
Christine Ji

BACKGROUND Linguistic accessibility has important impact on the reception and utilisation of translated health resources among multicultural and multilingual populations. Linguistic understandability of health translation has been under-studied. OBJECTIVE Our study aimed to develop novel machine learning models for the study of the linguistic accessibility of health translations comparing Chinese translations of the World Health Organisation health materials with original Chinese health resources developed by the Chinese health authorities. METHODS Using natural language processing tools for the assessment of the readability of Chinese materials, we explored and compared the readability of Chinese health translations from the World Health Organisation with original Chinese materials from China Centre for Disease Control and Prevention. RESULTS Pairwise adjusted t test showed that three new machine learning models achieved statistically significant improvement over the baseline logistic regression in terms of AUC: C5.0 decision tree (p=0.000, 95% CI: -0.249, -0.152), random forest (p=0.000, 95% CI: 0.139, 0.239) and XGBoost Tree (p=0.000, 95% CI: 0.099, 0.193). There was however no significant difference between C5.0 decision tree and random forest (p=0.513). Extreme gradient boost tree was the best model having achieved statistically significant improvement over the C5.0 model (p=0.003) and the Random Forest model (p=0.006) at the adjusted Bonferroni p value at 0.008. CONCLUSIONS The development of machine learning algorithms significantly improved the accuracy and reliability of current approaches to the evaluation of the linguistic accessibility of Chinese health information, especially Chinese health translations in relation to original health resources. Although the new algorithms developed were based on Chinese health resources, they can be adapted for other languages to advance current research in accessible health translation, communication, and promotion.


2021 ◽  
Vol 6 (2) ◽  
pp. 113-119
Author(s):  
Ulfi Saidata Aesyi ◽  
Alfirna Rizqi Lahitani ◽  
Taufaldisatya Wijatama Diwangkara ◽  
Riyanto Tri Kurniawan

The decline in the number of active students also occurred at the Faculty of Engineering and Information Technology, Universitas Jenderal Achmad Yani. This greatly affects the profile of study program graduates. So it is necessary to have a system that is able to detect students who are threatened with dropping out early. In this study, the attributes chosen were the student's GPA and the percentage of attendance . This attribute is used to classify students who are predicted to drop out. The research data uses student data from the Faculty of Engineering and Information Technology, Universitas Jenderal Achmad Yani. This study uses the C5.0 algorithm to build a decision tree to assist data classification. The decision tree that was built with 304 data as training data resulted a C5.0 decision tree which had an error rate of 5%. The accuracy results obtained from the 76 test data is 93%.


Author(s):  
Yannick Kiffen ◽  
Francesco Lelli ◽  
Omid Feyli

In this preprint, we introduce a dataset containing students enrolment applications combined with the related result of their filing procedure. The dataset contains 73 variable. Student candidates, at the time of applying for study, fill a web form for filing the procedure. A committee at Tilburg University review each single application and decide if the student is admissible or not. This dataset is suitable for algorithmic studies and has been used in a comparison between the Na&iuml;ve Bayes and the C5.0 Decision Tree Algorithms. They have been used for predicting the decision of the committee in admitting candidates at various bachelor programs. Our analysis shows that, in this particular case, a combination of the approaches outperform a both of them in term of precision and recall.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Biaokai Zhu ◽  
Xinyi Hou ◽  
Sanman Liu ◽  
Wanli Ma ◽  
Meiya Dong ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document