Entropy Based k Nearest Neighbor Pattern Classification (EbkNN): En-route to Achieving a High Accuracy in Breast Cancer Diagnosis

Entropy based k-Nearest Neighbor pattern classification (EbkNN) is a variation of the conventional k-Nearest Neighbor rule of pattern classification, which exclusively optimizes the value of k-neighbors for each test data based on the calculations of entropy. The formula for entropy used in EbkNN is the one that has been defined popularly in information theory for a set of n different types of information (class) attached to a total of m objects (data points) with each object defined by f features. In EbkNN that value of k is chosen for discrimination of given test data for which the entropy is the least non-zero value. Other rules of conventional kNN are retained in EbkNN. It is concluded that EbkNN works best for binary classification. It is computationally prohibitive to use EbkNN for discriminating the data points of the test dataset into number of classes greater than two. The biggest advantage of EbkNN vis-à-vis the conventional kNN is that in one single run of EbkNN algorithm we get optimum classification of test data. But conventional kNN algorithm has to be run separately for each of the selected range of values of k, and then the optimum k to be chosen from amongst them. We also tested our EbkNN method on WDBC (Wisconsin Diagnostic Breast Cancer) dataset. There are 569 instances in this dataset and we made a random choice of first 290 instances as training dataset and the rest 279 instances as test dataset. We got an exceptionally remarkable result with EbkNN method- accuracy close to 100% and better than the ones got by most of the other researchers who worked on WDBC dataset.

Download Full-text

Analisis Perbandingan Kinerja Algoritma Naïve Bayes, Decision Tree-J48 dan Lazy-IBK

JURNAL MEDIA INFORMATIKA BUDIDARMA ◽

10.30865/mib.v5i3.3055 ◽

2021 ◽

Vol 5 (3) ◽

pp. 1038

Author(s):

Indra Rukmana ◽

Arvin Rasheda ◽

Faiz Fathulhuda ◽

Muh Rizky Cahyadi ◽

Fitriyani Fitriyani

Keyword(s):

Breast Cancer ◽

Decision Tree ◽

Thoracic Surgery ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Breast Cancer Dataset ◽

Decision Tree Algorithm ◽

K Nearest Neighbor ◽

Cancer Dataset

This research is focused on knowing the performance of the classification algorithms, namely Naïve Bayes, Decision Tree-J48 and K-Nearest Neighbor. The speed and the percentage of accuracy in this study are the benchmarks for the performance of the algorithm. This study uses the Breast Cancer and Thoracic Surgery dataset, which is downloaded on the UCI Machine Learning Repository website. Using the help of Weka software Version 3.8.5 to find out the classification algorithm testing. The results show that the J-48 Decision Tree algorithm has the best accuracy, namely 75.6% in the cross-validation test mode for the Breast Cancer dataset and 84.5% for the Thoracic Surgery dataset.

Download Full-text

Accuracy Analysis of K-Nearest Neighbor and Naïve Bayes Algorithm in the Diagnosis of Breast Cancer

JURNAL INFOTEL ◽

10.20895/infotel.v12i4.547 ◽

2020 ◽

Vol 12 (4) ◽

pp. 151-159

Author(s):

Irma Handayani ◽

Ikrimach Ikrimach

Keyword(s):

Breast Cancer ◽

Data Mining ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Data Type ◽

Breast Cancer Dataset ◽

K Nearest Neighbor ◽

Cancer Dataset ◽

Bayes Algorithm

In the medical field, there are many records of disease sufferers, one of which is data on breast cancer. An extraction process to fine information in previously unknown data is known as data mining. Data mining uses pattern recognition techniques such as statistics and mathematics to find patterns from old data or cases. One of the main roles of data mining is classification. In the classification dataset, there is one objective attribute or it can be called the label attribute. This attribute will be searched from new data on the basis of other attributes in the past. The number of attributes can affect the performance of an algorithm. This results in if the classification process is inaccurate, the researcher needs to double-check at each previous stage to look for errors. The best algorithm for one data type is not necessarily good for another data type. For this reason, the K-Nearest Neighbor and Naïve Bayes algorithms will be used as a solution to this problem. The research method used was to prepare data from the breast cancer dataset, conduct training and test the data, then perform a comparative analysis. The research target is to produce the best algorithm in classifying breast cancer, so that patients with existing parameters can be predicted which ones are malignant and benign breast cancer. This pattern can be used as a diagnostic measure so that it can be detected earlier and is expected to reduce the mortality rate from breast cancer. By making comparisons, this method produces 95.79% for K-Nearest Neighbor and 93.39% for Naïve Bayes

Download Full-text

Breast Cancer Prediction Using Machine Learning

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit206457 ◽

2020 ◽

pp. 278-284

Author(s):

Gaurav Singh

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Nearest Neighbor ◽

Machine Learning Algorithms ◽

Support Vector ◽

Breast Cancer Dataset ◽

K Nearest Neighbor ◽

Cancer Dataset ◽

Implementation Phase ◽

Machine Learning Classification

Breast cancer may be a prevalent explanation for death, and it's the sole sort of cancer that's widespread among women worldwide. The prime objective of this paper creates the model for predicting breast cancer using various machine learning classification algorithms like k Nearest Neighbor (kNN), Support Vector Machine (SVM), Logistic Regression (LR), and Gaussian Naive Bayes (NB). And furthermore, assess and compare the performance of the varied classifiers as far as accuracy, precision, recall, f1-Score, and Jaccard index. The breast cancer dataset is publicly available on the UCI Machine Learning Repository and therefore the implementation phase dataset is going to be partitioned as 80% for the training phase and 20% for the testing phase then apply the machine learning algorithms. k Nearest Neighbors achieved a significant performance in respect of all parameters.

Download Full-text

Pendekatan Machine Learning yang Efisien untuk Prediksi Kanker Payudara

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v3i3.1347 ◽

2019 ◽

Vol 3 (3) ◽

pp. 458-469

Author(s):

Azminuddin I. S. Azis ◽

Irma Surya Kumala Idris ◽

Budy Santoso ◽

Yasin Aril Mustofa

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Particle Swarm Optimization ◽

Nearest Neighbor ◽

Breast Cancer Dataset ◽

Z Score ◽

Cancer Dataset ◽

Swarm Optimization ◽

Cancer Prediction ◽

Machine Learning Methods

Breast Cancer is the most common cancer found in women and the death rate is still in second place among other cancers. The high accuracy of the machine learning approach that has been proposed by related studies is often achieved. However, without efficient pre-processing, the model of Breast Cancer prediction that was proposed is still in question. Therefore, this research objective to improve the accuracy of machine learning methods through pre-processing: Missing Value Replacement, Data Transformation, Smoothing Noisy Data, Feature Selection / Attribute Weighting, Data Validation, and Unbalanced Class Reduction which is more efficient for Breast Cancer prediction. The results of this study propose several approaches: C4.5 - Z-Score - Genetic Algorithm for Breast Cancer Dataset with 77,27% accuracy, 7-Nearest Neighbor - Min-Max Normalization - Particle Swarm Optimization for Wisconsin Breast Cancer Dataset - Original with 97,85% accuracy, Artificial Neural Network - Z-Score - Forward Selection for Wisconsin Breast Cancer Dataset - Diagnostics with 98,24% accuracy, and 11-Nearest Neighbor - Min-Max Normalization - Particle Swarm Optimization for Wisconsin Breast Cancer Dataset - Prognostic with 83,33% accuracy. The performance of these approaches is better than standard/normal machine learning methods and the proposed methods by the best of previous related studies.

Download Full-text

Extracting Subset of Relevant Features for Breast Cancer to Improve Accuracy of Classifier

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1507.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 1670-1674

Keyword(s):

Breast Cancer ◽

Data Mining ◽

Feature Extraction ◽

Language Processing ◽

Clustering Algorithms ◽

Training Dataset ◽

Mining Machine ◽

Breast Cancer Dataset ◽

Cancer Dataset ◽

Hidden Patterns

Data mining is the essential step which identifies hidden patterns from large repositories. Medical diagnosis became a major area of current research in data mining. Machine learning technique which use statistical methods to enable machine to improve with experiences and identify hidden patterns in data like regression algorithms, clustering algorithms, classification algorithms, neural networks(ANN,CNN,DL),recommender system algorithms, Apriori algorithms, page ranking algorithms, text search and NLP(natural language processing) etc.., but due to lack of evaluation, these algorithms are unsuccessful in finding a better classifier for images to estimate accuracy of classification in medical image processing. Classification is an supervised learning which predicts the future class for an unknown object. The main purpose is to identify an unknown class by consulting with the neighbor class characteristics. Clustering can be known as unsupervised learning as it label the objects based on the scale of similar characteristics without consulting its class label. Main principle of clustering is find the distance like nearby and faraway based on their similarities and dissimilarities and groups the objects and hence can be used to identify outliers (which are far away from from the object). Feature extraction, variable selection is a method of obtaining a subset of relevant characteristics from large dataset. Too many features of a class may affect the accuracy of classifier. Therefore, feature extraction technique can be used to eliminate irrelevant attributes and increases the accuracy of classifier. In this paper we performed an induction to increase the accuracy of classifier by applying mining techniques in WEKA tool. Breast Cancer dataset is chosen from learning repository to analyze and an experimental analysis was conducted with WEKA tool using training dataset by applying naïve bayes, bayesnet, and PART, ZeroR, J48 and Random Forest techniques on the Wisconsin's dataset on Breast cancer. Finally presented the best classifier where the accuracy is more

Download Full-text

The Classification of Skateboarding Tricks : A Transfer Learning and Machine Learning Approach

Mekatronika ◽

10.15282/mekatronika.v2i2.6683 ◽

2020 ◽

Vol 2 (2) ◽

pp. 1-12

Author(s):

Muhammad Nur Aiman Shapiee ◽

Muhammad Ar Rahim Ibrahim ◽

Muhammad Amirul Abdullah ◽

Rabiu Muazu Musa ◽

Noor Azuan Abu Osman ◽

...

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Nearest Neighbor ◽

Olympic Games ◽

Learning Approach ◽

K Nearest Neighbor ◽

Test Dataset ◽

Machine Learning Approach ◽

Competitive Games

The skateboarding scene has arrived at new statures, particularly with its first appearance at the now delayed Tokyo Summer Olympic Games. Hence, attributable to the size of the game in such competitive games, progressed creative appraisal approaches have progressively increased due consideration by pertinent partners, particularly with the enthusiasm of a more goal-based assessment. This study purposes for classifying skateboarding tricks, specifically Frontside 180, Kickflip, Ollie, Nollie Front Shove-it, and Pop Shove-it over the integration of image processing, Trasnfer Learning (TL) to feature extraction enhanced with tradisional Machine Learning (ML) classifier. A male skateboarder performed five tricks every sort of trick consistently and the YI Action camera captured the movement by a range of 1.26 m. Then, the image dataset were features built and extricated by means of three TL models, and afterward in this manner arranged to utilize by k-Nearest Neighbor (k-NN) classifier. The perception via the initial experiments showed, the MobileNet, NASNetMobile, and NASNetLarge coupled with optimized k-NN classifiers attain a classification accuracy (CA) of 95%, 92% and 90%, respectively on the test dataset. Besides, the result evident from the robustness evaluation showed the MobileNet+k-NN pipeline is more robust as it could provide a decent average CA than other pipelines. It would be demonstrated that the suggested study could characterize the skateboard tricks sufficiently and could, over the long haul, uphold judges decided for giving progressively objective-based decision.

Download Full-text

Prediction of benign and malignant breast cancer using data mining techniques

Journal of Algorithms & Computational Technology ◽

10.1177/1748301818756225 ◽

2018 ◽

Vol 12 (2) ◽

pp. 119-126 ◽

Cited By ~ 43

Author(s):

Vikas Chaurasia ◽

Saurabh Pal ◽

BB Tiwari

Keyword(s):

Breast Cancer ◽

Data Mining ◽

Low Income ◽

Prediction Models ◽

Naive Bayes ◽

Naïve Bayes ◽

Low Income Countries ◽

Breast Cancer Dataset ◽

Cancer Dataset ◽

Rbf Network

Breast cancer is the second most leading cancer occurring in women compared to all other cancers. Around 1.1 million cases were recorded in 2004. Observed rates of this cancer increase with industrialization and urbanization and also with facilities for early detection. It remains much more common in high-income countries but is now increasing rapidly in middle- and low-income countries including within Africa, much of Asia, and Latin America. Breast cancer is fatal in under half of all cases and is the leading cause of death from cancer in women, accounting for 16% of all cancer deaths worldwide. The objective of this research paper is to present a report on breast cancer where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. We used three popular data mining algorithms (Naïve Bayes, RBF Network, J48) to develop the prediction models using a large dataset (683 breast cancer cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. The results (based on average accuracy Breast Cancer dataset) indicated that the Naïve Bayes is the best predictor with 97.36% accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), RBF Network came out to be the second with 96.77% accuracy, J48 came out third with 93.41% accuracy.

Download Full-text

PERFORMANCE ANALYSIS OF BREAST CANCER CLASSIFICATION USING DECISION TREE CLASSIFIERS

International Journal of Current Pharmaceutical Research ◽

10.22159/ijcpr.2017v9i2.17383 ◽

2017 ◽

Vol 9 (2) ◽

pp. 19 ◽

Cited By ~ 6

Author(s):

P. Hamsagayathri ◽

P. Sampath

Keyword(s):

Breast Cancer ◽

Decision Tree ◽

Ductal Carcinoma ◽

Research Work ◽

The United States ◽

Breast Cancer Dataset ◽

Decision Tree Classifier ◽

Cancer Dataset ◽

Term Survival ◽

Tree Classifier

Breast cancer is one of the dangerous cancers among world’s women above 35 y. The breast is made up of lobules that secrete milk and thin milk ducts to carry milk from lobules to the nipple. Breast cancer mostly occurs either in lobules or in milk ducts. The most common type of breast cancer is ductal carcinoma where it starts from ducts and spreads across the lobules and surrounding tissues. According to the medical survey, each year there are about 125.0 per 100,000 new cases of breast cancer are diagnosed and 21.5 per 100,000 women due to this disease in the United States. Also, 246,660 new cases of women with cancer are estimated for the year 2016. Early diagnosis of breast cancer is a key factor for long-term survival of cancer patients. Classification plays an important role in breast cancer detection and used by researchers to analyse and classify the medical data. In this research work, priority-based decision tree classifier algorithm has been implemented for Wisconsin Breast cancer dataset. This paper analyzes the different decision tree classifier algorithms for Wisconsin original, diagnostic and prognostic dataset using WEKA software. The performance of the classifiers are evaluated against the parameters like accuracy, Kappa statistic, Entropy, RMSE, TP Rate, FP Rate, Precision, Recall, F-Measure, ROC, Specificity, Sensitivity.

Download Full-text

Breast Cancer Prediction using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8292.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 4879-4881

Keyword(s):

Breast Cancer ◽

Random Forest ◽

Data Science ◽

Breast Cancer Dataset ◽

Random Forest Algorithm ◽

Medical Field ◽

Cancer Dataset ◽

Cancer Prediction ◽

Time Consumption ◽

Simulated Environment

One of the most dreadful disease is breast cancer and it has a potential cause for death in women. Every year, death rate increases drastically due to breast cancer. An effective way to classify data is through classification or data mining. This becomes very handy, especially in the medical field where diagnosis and analysis are done through these techniques. Wisconsin Breast cancer dataset is used to perform a comparison between SVM, Logistic Regression, Naïve Bayes and Random Forest. Evaluating the correctness in classifying data based on accuracy and time consumption is used to determine the efficiency of the algorithms, which is the main objective. Based on the result of performed experiments, the Random Forest algorithm shows the highest accuracy (99.76%) with the least error rate. ANACONDA Data Science Platform is used to execute all the experiments in a simulated environment.

Download Full-text

Performance of Naïve Bayes, C4.5 and KNN using Breast Cancer, Iris and Hypothyroid Datasets

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8795.019320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2193-2197

Keyword(s):

Breast Cancer ◽

Data Mining ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Specific Pattern ◽

K Nearest Neighbor ◽

Data Mining Technique ◽

Digital Format ◽

Tree Classifier

Data mining usually specifies the discovery of specific pattern or analysis of data from a large dataset. Classification is one of an efficient data mining technique, in which class the data are classified are already predefined using the existing datasets. The classification of medical records in terms of its symptoms using computerized method and storing the predicted information in the digital format is of great importance in the diagnosis of various diseases in the medical field. In this paper, finding the algorithm with highest accuracy range is concentrated so that a cost-effective algorithm can be found. Here the data mining classification algorithms are compared with their accuracy of finding exact data according to the diagnosis report and their execution rate to identify how fast the records are classified. The classification technique based algorithms used in this study are the Naive Bayes Classifier, the C4.5 tree classifier and the K-Nearest Neighbor (KNN) to predict which algorithm is the best suited for classifying any kind of medical dataset. Here the datasets such as Breast Cancer, Iris and Hypothyroid are used to predict which of the three algorithms is suitable for classifying the datasets with highest accuracy of finding the records of patients with the particular health problems. The experimental results represented in the form of table and graph shows the performance and the importance of Naïve Bayes, C4.5 and K-Nearest Neighbor algorithms. From the performance outcome of the three algorithms the C4.5 algorithm is a lot better than the Naïve Bayes and the K-Nearest Neighbor algorithm.

Download Full-text