scholarly journals High Accurate and a Variant of k-fold Cross Validation Technique for Predicting the Decision Tree Classifier Accuracy

Author(s):  
D. Mabuni ◽  
S. Aquter Babu

In machine learning data usage is the most important criterion than the logic of the program. With very big and moderate sized datasets it is possible to obtain robust and high classification accuracies but not with small and very small sized datasets. In particular only large training datasets are potential datasets for producing robust decision tree classification results. The classification results obtained by using only one training and one testing dataset pair are not reliable. Cross validation technique uses many random folds of the same dataset for training and validation. In order to obtain reliable and statistically correct classification results there is a need to apply the same algorithm on different pairs of training and validation datasets. To overcome the problem of the usage of only a single training dataset and a single testing dataset the existing k-fold cross validation technique uses cross validation plan for obtaining increased decision tree classification accuracy results. In this paper a new cross validation technique called prime fold is proposed and it is experimentally tested thoroughly and then verified correctly using many bench mark UCI machine learning datasets. It is observed that the prime fold based decision tree classification accuracy results obtained after experimentation are far better than the existing techniques of finding decision tree classification accuracies.

Author(s):  
Yuhong Huang ◽  
Wenben Chen ◽  
Xiaoling Zhang ◽  
Shaofu He ◽  
Nan Shao ◽  
...  

Aim: After neoadjuvant chemotherapy (NACT), tumor shrinkage pattern is a more reasonable outcome to decide a possible breast-conserving surgery (BCS) than pathological complete response (pCR). The aim of this article was to establish a machine learning model combining radiomics features from multiparametric MRI (mpMRI) and clinicopathologic characteristics, for early prediction of tumor shrinkage pattern prior to NACT in breast cancer.Materials and Methods: This study included 199 patients with breast cancer who successfully completed NACT and underwent following breast surgery. For each patient, 4,198 radiomics features were extracted from the segmented 3D regions of interest (ROI) in mpMRI sequences such as T1-weighted dynamic contrast-enhanced imaging (T1-DCE), fat-suppressed T2-weighted imaging (T2WI), and apparent diffusion coefficient (ADC) map. The feature selection and supervised machine learning algorithms were used to identify the predictors correlated with tumor shrinkage pattern as follows: (1) reducing the feature dimension by using ANOVA and the least absolute shrinkage and selection operator (LASSO) with 10-fold cross-validation, (2) splitting the dataset into a training dataset and testing dataset, and constructing prediction models using 12 classification algorithms, and (3) assessing the model performance through an area under the curve (AUC), accuracy, sensitivity, and specificity. We also compared the most discriminative model in different molecular subtypes of breast cancer.Results: The Multilayer Perception (MLP) neural network achieved higher AUC and accuracy than other classifiers. The radiomics model achieved a mean AUC of 0.975 (accuracy = 0.912) on the training dataset and 0.900 (accuracy = 0.828) on the testing dataset with 30-round 6-fold cross-validation. When incorporating clinicopathologic characteristics, the mean AUC was 0.985 (accuracy = 0.930) on the training dataset and 0.939 (accuracy = 0.870) on the testing dataset. The model further achieved good AUC on the testing dataset with 30-round 5-fold cross-validation in three molecular subtypes of breast cancer as following: (1) HR+/HER2–: 0.901 (accuracy = 0.816), (2) HER2+: 0.940 (accuracy = 0.865), and (3) TN: 0.837 (accuracy = 0.811).Conclusions: It is feasible that our machine learning model combining radiomics features and clinical characteristics could provide a potential tool to predict tumor shrinkage patterns prior to NACT. Our prediction model will be valuable in guiding NACT and surgical treatment in breast cancer.


Author(s):  
Kui Fang ◽  
Zheqing Dong ◽  
Xiling Chen ◽  
Ji Zhu ◽  
Bing Zhang ◽  
...  

Abstract Objectives A sample with a blood clot may produce an inaccurate outcome in coagulation testing, which may mislead clinicians into making improper clinical decisions. Currently, there is no efficient method to automatically detect clots. This study demonstrates the feasibility of utilizing machine learning (ML) to identify clotted specimens. Methods The results of coagulation testing with 192 clotted samples and 2,889 no-clot-detected (NCD) samples were retrospectively retrieved from a laboratory information system to form the training dataset and testing dataset. Standard and momentum backpropagation neural networks (BPNNs) were trained and validated using the training dataset with a five-fold cross-validation method. The predictive performances of the models were then assessed based on the testing dataset. Results Our results demonstrated that there were intrinsic distinctions between the clotted and NCD specimens regarding differences in the testing results and the separation of the groups (clotted and NCD) in the t-SNE analysis. The standard and momentum BPNNs could identify the sample status (clotted and NCD) with areas under the ROC curves of 0.966 (95% CI, 0.958–0.974) and 0.971 (95% CI, 0.9641–0.9784), respectively. Conclusions Here, we have described the application of ML algorithms in identifying the sample status based on the results of coagulation testing. This approach provides a proof-of-concept application of ML algorithms to evaluate the sample quality, and it has the potential to facilitate clinical laboratory automation.


Techno Com ◽  
2020 ◽  
Vol 19 (4) ◽  
pp. 353-363
Author(s):  
Mayanda Mega Santoni ◽  
Nurul Chamidah ◽  
Nurhafifah Matondang

Hipertensi merupakan salah satu penyakit tidak menular yang dapat menyebabkan kematian karena meningkatkan resiko munculnya berbagai penyakit seperti gagal ginjal, gagal jantung, bahkan stroke. Resiko hipertensi disebabkan oleh beberapa faktor penyebab seperti usia, keturunan, pola makan dan olahraga, dan merokok. Teknologi artificial intelligence yakni machine learning dimanfaatkan di bidang kesehatan khususnya prediksi penyakit hipertensi. Pada penelitian ini diimplementasi tiga algoritma machine learning yakni decision tree, naïve bayes dan artificial neural networks. Data yang digunakan pada penelitian ini sebanyak 274 data yang diperoleh dari hasil kuesioner dengan 26 pertanyaan, dimana 25 pertanyaan adalah variabel faktor resiko dan satu pertanyaan merupakan kelas yang menyatakan responden memiliki riwayat hipertensi atau tidak. Data diolah menggunakan platform analisis data yakni KNIME. Sebelum data diolah untuk membangun model klasifikasi menggunakan decision tree, naïve bayes dan artificial neural network, data dipraproses terlebih dahulu dengan melakukan imputasi missing value, oversampling dan normalisasi data. Selanjutnya pembagian data menggunakan 5-fold cross validation. Model klasifikasi yang diperoleh dievaluasi menggunakan nilai akurasi, recall dan precision. Hasil evaluasi dari eksperimen yang dilakukan diperoleh bahwa algoritma artificial neural network memiliki tingkat performa lebih baik dibandingkan decision tree dan naïve bayes dengan nilai akurasi sebesar 94.7%, recall sebesar 91.5% dan precision sebesar 97.7%.


2016 ◽  
Vol 7 (2) ◽  
pp. 48-58 ◽  
Author(s):  
Ivana Herliana W. Jayawardanu ◽  
Seng Hansun

In 2010, 51% of 39 million blindness are caused by cataract. In 2013, there are 1.8% of 1.027.763 Indonesian people who suffered from cataract. Half of them are not treated yet due to their ignorance on the cataract disease. Therefore, in this research, we tried to build a system that can detect early cataract disease as the ophthalmologist would do. The system will use C4.5 algorithm that receives 150 training data set as an input, resulting in a set of rules which can be used as decision factors. To test the system, k-fold cross validation technique is been used with k equals to 10. From the analysis result, the accuracy of the system is 93.2% to detect cataract disease and 80.5% to detect the type of cataract disease one might suffered. Index terms-C4.5 algorithm, cataract, k-fold cross validation, machine learning


2020 ◽  
Vol 25 (40) ◽  
pp. 4296-4302 ◽  
Author(s):  
Yuan Zhang ◽  
Zhenyan Han ◽  
Qian Gao ◽  
Xiaoyi Bai ◽  
Chi Zhang ◽  
...  

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.


Author(s):  
Dhilsath Fathima.M ◽  
S. Justin Samuel ◽  
R. Hari Haran

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.


2021 ◽  
Vol 11 (1) ◽  
pp. 450
Author(s):  
Jinfu Liu ◽  
Mingliang Bai ◽  
Na Jiang ◽  
Ran Cheng ◽  
Xianling Li ◽  
...  

Multi-classifiers are widely applied in many practical problems. But the features that can significantly discriminate a certain class from others are often deleted in the feature selection process of multi-classifiers, which seriously decreases the generalization ability. This paper refers to this phenomenon as interclass interference in multi-class problems and analyzes its reason in detail. Then, this paper summarizes three interclass interference suppression methods including the method based on all-features, one-class classifiers and binary classifiers and compares their effects on interclass interference via the 10-fold cross-validation experiments in 14 UCI datasets. Experiments show that the method based on binary classifiers can suppress the interclass interference efficiently and obtain the best classification accuracy among the three methods. Further experiments were done to compare the suppression effect of two methods based on binary classifiers including the one-versus-one method and one-versus-all method. Results show that the one-versus-one method can obtain a better suppression effect on interclass interference and obtain better classification accuracy. By proposing the concept of interclass inference and studying its suppression methods, this paper significantly improves the generalization ability of multi-classifiers.


2021 ◽  
pp. 1-10
Author(s):  
Chao Dong ◽  
Yan Guo

The wide application of artificial intelligence technology in various fields has accelerated the pace of people exploring the hidden information behind large amounts of data. People hope to use data mining methods to conduct effective research on higher education management, and decision tree classification algorithm as a data analysis method in data mining technology, high-precision classification accuracy, intuitive decision results, and high generalization ability make it become a more ideal method of higher education management. Aiming at the sensitivity of data processing and decision tree classification to noisy data, this paper proposes corresponding improvements, and proposes a variable precision rough set attribute selection standard based on scale function, which considers both the weighted approximation accuracy and attribute value of the attribute. The number improves the anti-interference ability of noise data, reduces the bias in attribute selection, and improves the classification accuracy. At the same time, the suppression factor threshold, support and confidence are introduced in the tree pre-pruning process, which simplifies the tree structure. The comparative experiments on standard data sets show that the improved algorithm proposed in this paper is better than other decision tree algorithms and can effectively realize the differentiated classification of higher education management.


2021 ◽  
Vol 102 ◽  
pp. 04004
Author(s):  
Jesse Jeremiah Tanimu ◽  
Mohamed Hamada ◽  
Mohammed Hassan ◽  
Saratu Yusuf Ilu

With the advent of new technologies in the medical field, huge amounts of cancerous data have been collected and are readily accessible to the medical research community. Over the years, researchers have employed advanced data mining and machine learning techniques to develop better models that can analyze datasets to extract the conceived patterns, ideas, and hidden knowledge. The mined information can be used as a support in decision making for diagnostic processes. These techniques, while being able to predict future outcomes of certain diseases effectively, can discover and identify patterns and relationships between them from complex datasets. In this research, a predictive model for predicting the outcome of patients’ cervical cancer results has been developed, given risk patterns from individual medical records and preliminary screening tests. This work presents a Decision tree (DT) classification algorithm and shows the advantage of feature selection approaches in the prediction of cervical cancer using recursive feature elimination technique for dimensionality reduction for improving the accuracy, sensitivity, and specificity of the model. The dataset employed here suffers from missing values and is highly imbalanced. Therefore, a combination of under and oversampling techniques called SMOTETomek was employed. A comparative analysis of the proposed model has been performed to show the effectiveness of feature selection and class imbalance based on the classifier’s accuracy, sensitivity, and specificity. The DT with the selected features and SMOTETomek has better results with an accuracy of 98%, sensitivity of 100%, and specificity of 97%. Decision Tree classifier is shown to have excellent performance in handling classification assignment when the features are reduced, and the problem of imbalance class is addressed.


Author(s):  
Mr. Bhavar Shivam S.

Today we do a lot of things online from shopping to data sharing on social networking sites. Social networking (SNS) is good for releasing stress and depression by sharing one’s thoughts. Thus, emotion detection has become a hot trend to day. But there is a problem in analyzing emotions on a SNS like twitter as it generates lakhs of tweets each day and it is hard to keep track of the emotion behind each tweet as it is impossible for a human being to read and decide the emotions behind tweets. So, to help understand behind the texts in a SNS site we thought of designing a project which will keep track of the tweets and predict the right emotion behind the tweets whether they have a positive or a negative sentiment behind them. This thought of project can be achieved by a integration of SNS with NLP and machine learning together. For SNS we will use Twitter as it generates a lot of data which is accessible freely using an API. First, we will enter a keyword and fetch tweets from the twitter. Then stop words will be removed from these tweets using NLTK stop words database. Then the tweets will be passed for POS tagging and only right form of grammatical words will be kept and others will be removed. Then we create a training dataset with two types positive and negative. Then SVM algorithm will be trained using this training dataset. Then each tweet will be passed to the SVM as testing dataset which in turn will return classification of each tweet as a whole in two classes positive and negative. Thus, our application will be helpful in recognizing emotion behind a tweet.


Sign in / Sign up

Export Citation Format

Share Document