Use of a Recursive-Rule eXtraction algorithm with J48graft to achieve highly accurate and concise rule extraction from a large breast cancer dataset

Breast cancer is the second most leading cancer occurring in women compared to all other cancers. Around 1.1 million cases were recorded in 2004. Observed rates of this cancer increase with industrialization and urbanization and also with facilities for early detection. It remains much more common in high-income countries but is now increasing rapidly in middle- and low-income countries including within Africa, much of Asia, and Latin America. Breast cancer is fatal in under half of all cases and is the leading cause of death from cancer in women, accounting for 16% of all cancer deaths worldwide. The objective of this research paper is to present a report on breast cancer where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. We used three popular data mining algorithms (Naïve Bayes, RBF Network, J48) to develop the prediction models using a large dataset (683 breast cancer cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. The results (based on average accuracy Breast Cancer dataset) indicated that the Naïve Bayes is the best predictor with 97.36% accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), RBF Network came out to be the second with 96.77% accuracy, J48 came out third with 93.41% accuracy.

Download Full-text

PERFORMANCE ANALYSIS OF BREAST CANCER CLASSIFICATION USING DECISION TREE CLASSIFIERS

International Journal of Current Pharmaceutical Research ◽

10.22159/ijcpr.2017v9i2.17383 ◽

2017 ◽

Vol 9 (2) ◽

pp. 19 ◽

Cited By ~ 6

Author(s):

P. Hamsagayathri ◽

P. Sampath

Keyword(s):

Breast Cancer ◽

Decision Tree ◽

Ductal Carcinoma ◽

Research Work ◽

The United States ◽

Breast Cancer Dataset ◽

Decision Tree Classifier ◽

Cancer Dataset ◽

Term Survival ◽

Tree Classifier

Breast cancer is one of the dangerous cancers among world’s women above 35 y. The breast is made up of lobules that secrete milk and thin milk ducts to carry milk from lobules to the nipple. Breast cancer mostly occurs either in lobules or in milk ducts. The most common type of breast cancer is ductal carcinoma where it starts from ducts and spreads across the lobules and surrounding tissues. According to the medical survey, each year there are about 125.0 per 100,000 new cases of breast cancer are diagnosed and 21.5 per 100,000 women due to this disease in the United States. Also, 246,660 new cases of women with cancer are estimated for the year 2016. Early diagnosis of breast cancer is a key factor for long-term survival of cancer patients. Classification plays an important role in breast cancer detection and used by researchers to analyse and classify the medical data. In this research work, priority-based decision tree classifier algorithm has been implemented for Wisconsin Breast cancer dataset. This paper analyzes the different decision tree classifier algorithms for Wisconsin original, diagnostic and prognostic dataset using WEKA software. The performance of the classifiers are evaluated against the parameters like accuracy, Kappa statistic, Entropy, RMSE, TP Rate, FP Rate, Precision, Recall, F-Measure, ROC, Specificity, Sensitivity.

Download Full-text

Breast Cancer Prediction using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8292.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 4879-4881

Keyword(s):

Breast Cancer ◽

Random Forest ◽

Data Science ◽

Breast Cancer Dataset ◽

Random Forest Algorithm ◽

Medical Field ◽

Cancer Dataset ◽

Cancer Prediction ◽

Time Consumption ◽

Simulated Environment

One of the most dreadful disease is breast cancer and it has a potential cause for death in women. Every year, death rate increases drastically due to breast cancer. An effective way to classify data is through classification or data mining. This becomes very handy, especially in the medical field where diagnosis and analysis are done through these techniques. Wisconsin Breast cancer dataset is used to perform a comparison between SVM, Logistic Regression, Naïve Bayes and Random Forest. Evaluating the correctness in classifying data based on accuracy and time consumption is used to determine the efficiency of the algorithms, which is the main objective. Based on the result of performed experiments, the Random Forest algorithm shows the highest accuracy (99.76%) with the least error rate. ANACONDA Data Science Platform is used to execute all the experiments in a simulated environment.

Download Full-text

Classifications of Breast Cancer Diagnosis using Machine Learning

International Journal of Computers ◽

10.46300/9108.2020.14.13 ◽

2020 ◽

Vol 14 ◽

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Random Forest ◽

Breast Cancer Diagnosis ◽

Performance Comparison ◽

Support Vector ◽

Breast Cancer Dataset ◽

K Nearest Neighbors ◽

Cancer Dataset ◽

Machine Learning Classification

Breast Cancer (BC) is amongst the most common and leading causes of deaths in women throughout the world. Recently, classification and data analysis tools are being widely used in the medical field for diagnosis, prognosis and decision making to help lower down the risks of people dying or suffering from diseases. Advanced machine learning methods have proven to give hope for patients as this has helped the doctors in early detection of diseases like Breast Cancer that can be fatal, in support with providing accurate outcomes. However, the results highly depend on the techniques used for feature selection and classification which will produce a strong machine learning model. In this paper, a performance comparison is conducted using four classifiers which are Multilayer Perceptron (MLP), Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Random Forest on the Wisconsin Breast Cancer dataset to spot the most effective predictors. The main goal is to apply best machine learning classification methods to predict the Breast Cancer as benign or malignant using terms such as accuracy, f-measure, precision and recall. Experimental results show that Random forest is proven to achieve the highest accuracy of 99.26% on this dataset and features, while SVM and KNN show 97.78% and 97.04% accuracy respectively. MLP shows the least accuracy of 94.07%. All the experiments are conducted using RStudio as the data mining tool platform.

Download Full-text

Comparison of Imputation Methods on Retrospective Breast Cancer Data in Tanzania: A Case Study of Muhimbili and Ocean Road Hospitals

10.21203/rs.3.rs-820770/v1 ◽

2021 ◽

Author(s):

Rahibu A. Abassi ◽

Amina S. Msengwa ◽

Rocky R. J. Akarro

Keyword(s):

Breast Cancer ◽

Logistic Regression ◽

Missing Data ◽

Binary Logistic Regression ◽

Breast Cancer Dataset ◽

Cancer Dataset ◽

Imputation Methods ◽

Multiple Imputations ◽

Predictive Mean Matching ◽

Mean Square Errors

Abstract Background Clinical data are at risk of having missing or incomplete values for several reasons including patients’ failure to attend clinical measurements, wrong interpretations of measurements, and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission of missed observation during statistical analysis especially if a dataset is considerably small. The objective of this study is to compare several imputation methods in terms of efficiency in filling-in the missing data so as to increase the prediction and classification accuracy in breast cancer dataset. Methods Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, and multiple imputations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion The predictive mean matching imputation showed higher accuracy in estimating and replacing missing/incomplete data values in a real breast cancer dataset under the study. It is a more effective and good method to handle missing data in this scenario. We recommend to replace missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables, as it improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.

Download Full-text

HIOC: a hybrid imputation method to predict missing values in medical datasets

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-03-2021-0042 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Pooja Rani ◽

Rajneesh Kumar ◽

Anurag Jain

Keyword(s):

Breast Cancer ◽

Heart Disease ◽

Missing Values ◽

Imputation Method ◽

Support Vector ◽

Correct Prediction ◽

Breast Cancer Dataset ◽

Cancer Dataset ◽

Content Type ◽

Imputation Methods

PurposeDecision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.Design/methodology/approachThe proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.FindingsThe results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.Originality/valueThe proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.

Download Full-text