A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer

World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold crossvalidations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data –namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.

Download Full-text

Prediction of Breast Cancer Using Machine Learning

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190617160834 ◽

2020 ◽

Vol 13 (5) ◽

pp. 901-908

Author(s):

Somil Jain ◽

Puneet Kumar

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Prediction Accuracy ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

Breast Cancer Dataset

Background:: Breast cancer is one of the diseases which cause number of deaths ever year across the globe, early detection and diagnosis of such type of disease is a challenging task in order to reduce the number of deaths. Now a days various techniques of machine learning and data mining are used for medical diagnosis which has proven there metal by which prediction can be done for the chronic diseases like cancer which can save the life’s of the patients suffering from such type of disease. The major concern of this study is to find the prediction accuracy of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest and to suggest the best algorithm. Objective:: The objective of this study is to assess the prediction accuracy of the classification algorithms in terms of efficiency and effectiveness. Methods: This paper provides a detailed analysis of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest in terms of their prediction accuracy by applying 10 fold cross validation technique on the Wisconsin Diagnostic Breast Cancer dataset using WEKA open source tool. Results:: The result of this study states that Support Vector Machine has achieved the highest prediction accuracy of 97.89 % with low error rate of 0.14%. Conclusion:: This paper provides a clear view over the performance of the classification algorithms in terms of their predicting ability which provides a helping hand to the medical practitioners to diagnose the chronic disease like breast cancer effectively.

Download Full-text

Performance Analysis of Supervised Machine Learning Algorithms on Medical Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f7908.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 1637-1642

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

Learning System ◽

Supervised Machine Learning ◽

Support Vector ◽

Heart Problem

Machine learning (ML) algorithms are designed to perform prediction based on features. With the help of machine learning, system can automatically learn and improve by experience. Machine learning comes under Artificial intelligence. Machine learning is broadly categorized in two types: supervised and unsupervised. Supervised ML performs classification and unsupervised is for clustering. In present scenario, machine learning is used in various areas. It can be used for biometric recognition, hand writing recognition, medical diagnosis etc. In medical field, machine learning plays an important role in identifying diseases based on patient’s features. Presently,doctors use software application based on machine learning algorithm in various disease diagnosis like cancer, cardiac arrest and many more. In this paper we used an ensemble learning method to predict heart problem. Our study described the performance of ML algorithms by comparing various evaluating parameters such as F-measure, Recall, ROC, precision and accuracy. The study done with various combination ML classifiers such as, Decision Tree (DT), Naïve Bayes (NB), Support Vector Machine (SVM), Random Forest (RF) algorithm to predict heart problem. The result showed that by combining two ML algorithm, DT with NB, 81.1% accuracy was achieved. Simultaneously, the models like Support Vector machine (SVM), Decision tree, Naïve Bayes, Random Forest models were also trained and tested individually.

Download Full-text

Recognition of gasoline in fire debris using machine learning: Part I, Application of Random Forest, Gradient Boosting, Support Vector Machine and Naïve Bayes

Forensic Science International ◽

10.1016/j.forsciint.2021.111146 ◽

2021 ◽

pp. 111146 ◽

Cited By ~ 1

Author(s):

C. Bogdal ◽

R. Schellenberg ◽

O. Höpli ◽

M. Bovens ◽

M. Lory

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Naive Bayes ◽

Naïve Bayes ◽

Gradient Boosting ◽

Support Vector ◽

Fire Debris ◽

In Fire

Download Full-text

Komparasi Algoritma Naive Bayes, Decision Tree dan Support Vector Machine untuk Prediksi Penyakit Kanker Payudara

Jurnal Teknik Komputer ◽

10.31294/jtk.v7i1.9191 ◽

2021 ◽

Vol 7 (1) ◽

pp. 51-54

Author(s):

Lusa Indah Prahartiwi ◽

Wulan Dari

Keyword(s):

Breast Cancer ◽

Data Mining ◽

Support Vector Machine ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector

Kanker payudara merupakan kanker paling umum pada wanita di seluruh dunia dengan menyumbang 25,4% dari total jumlah kasus baru yang didiagnosis pada tahun 2018. Kanker adalah sekelompok besar penyakit yang dapat dimulai di hampir semua organ atau jaringan tubuh ketika sel abnormal tumbuh tak terkendali, melampaui batas biasanya untuk menyerang bagian tubuh yang berdekatan dan/atau menyebar ke organ lain. Penyakit kanker payudara dapat diprediksi dengan pengetahuan data mining. Data mining dapat menemukan korelasi, pola, dan tren baru yang bermakna dengan memilah-milah data dalam jumlah besar yang disimpan dalam repositori, menggunakan teknologi pengenalan pola serta teknik statistik dan matematika. Penelitian ini membandingkan performa Algoritma Naive Bayes, Decision Tree dan Support Vector Machine untuk memprediksi penyakit kanker payudara. Dataset yang digunakan adalah data sekunder Breast Cancer Coimbra yang diambil dari UCI Repository. Hasil dari penelitian ini menunjukan bahwa Algoritma Support Vector Machine menghasilkan tingkat Accuracy tertinggi yaitu sebesar 74,29% dibandingkan dengan Algoritma Naive Bayes dan Decision Tree

Download Full-text

Classification Breast Cancer Revisited with Machine Learning

International Journal on Data Science ◽

10.18517/ijods.1.1.42-50.2020 ◽

2020 ◽

Vol 1 (1) ◽

pp. 42-50

Author(s):

Hanna Arini Parhusip ◽

Bambang Susanto ◽

Lilik Linawati ◽

Suryasatriya Trihandaru ◽

Yohanes Sardjono ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Algorithm ◽

K Nearest Neighbor ◽

Cancer Data

The article presents the study of several machine learning algorithms that are used to study breast cancer data with 33 features from 569 samples. The purpose of this research is to investigate the best algorithm for classification of breast cancer. The data may have different scales with different large range one to the other features and hence the data are transformed before the data are classified. The used classification methods in machine learning are logistic regression, k-nearest neighbor, Naive bayes classifier, support vector machine, decision tree and random forest algorithm. The original data and the transformed data are classified with size of data test is 0.3. The SVM and Naive Bayes algorithms have no improvement of accuracy with random forest gives the best accuracy among all. Therefore the size of data test is reduced to 0.25 leading to improve all algorithms in transformed data classifications. However, random forest algorithm still gives the best accuracy.

Download Full-text

Weka, áreas de aplicación y sus algoritmos: una revisión sistemática de literatura

Revista Científica ECOCIENCIA ◽

10.21855/ecociencia.50.153 ◽

2018 ◽

Vol 5 ◽

pp. 1-26

Author(s):

Marcos Antonio Espinoza Mina

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Sequential Minimal Optimization ◽

Revisión Sistemática ◽

J48 Decision Tree

Actualmente se generan grandes cantidades de datos almacenados en dispositivos digitales que cada día, debido a los avances tecnológicos, crecen también en su capacidad de almacenamiento. Muchos de estos datos no se encuentran adecuadamente estructurados, resultando una tarea difícil su explotación. Ante estas realidades y dificultades es necesario hacer uso de técnicas automatizadas que permitan reducir, analizar y utilizar de forma eficiente los datos. Se han desarrollado variadas metodologías, programas y complementos aplicados específicamente al conjunto de datos con el que se trabaje; se destacan las herramientas informáticas que implementan técnicas para el aprendizaje automático y la minería de datos, una de ellas es Weka. El presente trabajo hace una revisión sistemática de la literatura cuyo objetivo fue buscar los campos o áreas de aplicación de Weka y los algoritmos más utilizados de este programa. Los resultados exponen que Weka está siendo empleada en campos como: informática, medicina, educación y agricultura. Los algoritmos más utilizados dependiendo del tipo de datos y propósito de la evaluación son: Naïve Bayes, J48, Decision Tree, Random Forest, Support Vector Machine (SMV) y Sequential Minimal Optimization (SMO).

Download Full-text

KLASIFIKASI SMS SPAM MENGGUNAKAN SUPPORT VECTOR MACHINE

Jurnal Pilar Nusa Mandiri ◽

10.33480/pilar.v15i2.693 ◽

2019 ◽

Vol 15 (2) ◽

pp. 275-280

Author(s):

Agus Setiyono ◽

Hilman F Pardede

Keyword(s):

Data Mining ◽

Support Vector Machine ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Spam Detection ◽

Support Vector Machine Algorithm ◽

Data Mining Techniques ◽

To Receive

It is now common for a cellphone to receive spam messages. Great number of received messages making it difficult for human to classify those messages to Spam or no Spam. One way to overcome this problem is to use Data Mining for automatic classifications. In this paper, we investigate various data mining techniques, named Support Vector Machine, Multinomial Naïve Bayes and Decision Tree for automatic spam detection. Our experimental results show that Support Vector Machine algorithm is the best algorithm over three evaluated algorithms. Support Vector Machine achieves 98.33%, while Multinomial Naïve Bayes achieves 98.13% and Decision Tree is at 97.10 % accuracy.

Download Full-text

Predicting Student’s Performance Using Machine Learning Algorithm

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1209 ◽

2021 ◽

pp. 53-58

Author(s):

Sheela Rani P ◽

Dhivya S ◽

Dharshini Priya M ◽

Dharmila Chowdary A

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Prediction Model ◽

Naive Bayes ◽

Learning Algorithm ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approaches ◽

K Nearest Neighbors

Machine learning is a new analysis discipline that uses knowledge to boost learning, optimizing the training method and developing the atmosphere within which learning happens. There square measure 2 sorts of machine learning approaches like supervised and unsupervised approach that square measure accustomed extract the knowledge that helps the decision-makers in future to require correct intervention. This paper introduces an issue that influences students' tutorial performance prediction model that uses a supervised variety of machine learning algorithms like support vector machine , KNN(k-nearest neighbors), Naïve Bayes and supplying regression and logistic regression. The results supported by various algorithms are compared and it is shown that the support vector machine and Naïve Bayes performs well by achieving improved accuracy as compared to other algorithms. The final prediction model during this paper may have fairly high prediction accuracy .The objective is not just to predict future performance of students but also provide the best technique for finding the most impactful features that influence student’s while studying.

Download Full-text

Tinjauan Algoritma RoI (Region of Interest) Dengan Metode Pengambangan Otsu Dan Klasterisasi K-Mean; Hasil Dan Tantangannya

Informatik : Jurnal Ilmu Komputer ◽

10.52958/iftk.v16i2.1961 ◽

2020 ◽

Vol 16 (2) ◽

pp. 75

Author(s):

Didit Widiyanto

Keyword(s):

Support Vector Machine ◽

Decision Tree ◽

Naive Bayes ◽

Region Of Interest ◽

Naïve Bayes ◽

Support Vector ◽

Gray Level

Akurasi sebuah klasifikasi citra ditentukan oleh pengklasifikasi. Meskipun RoI (Region of Interest) tidak menentukan secara langsung akurasi, namun RoI menentukan lingkup klasifikasi citra. Terdapat tiga algoritma yang dapat digunakan sebagai algoritma RoI yaitu; Balanced Histogram Thresholding (BHT), algoritma Otsu, dan algoritma klasterisasi K-Means. Paper ini meninjau algoritma Otsu dan algoritma klasterisasi K-Means yang digunakan oleh lima peneliti. Dari ke lima peneliti; tiga peneliti menerapkan algoritma Otsu dan dua peneliti menerapkan algoritma K-Means sebagai algoritma RoI. Setelah operasi RoI, ke lima peneliti menerapkan algoritma GLCM (Gray Level Co-occurance Matrix) sebagai pengekstraksi ciri tekstur. Hasil ekstraksi ciri diklasifikasi dengan menggunakan berbagai pengklasifikasi antara lain SVM (Support Vector Machine), Naive Bayes, dan Decision Tree. Akhirnya dengan membandingkan hasil dari ke lima peneliti, akurasi tertinggi diperoleh sebesar 100% dengan pengklasifikasi SVM menggunakan algoritma Otsu sebagai algoritma RoI, dan akurasi terendah adalah sebesar52% yang menggunakan algoritma Otsu pada kanal S dari citra HSV (Hue, Saturation Value).

Download Full-text

Preliminary Screening of COVID-19 Infection Employing Machine Learning Techniques From Simple Blood Profile

International Journal of Quantitative Structure-Property Relationships ◽

10.4018/ijqspr.2021070103 ◽

2021 ◽

Vol 6 (3) ◽

pp. 35-47

Author(s):

Anirudh Reddy Cingireddy ◽

Robin Ghosh ◽

Supratik Kar ◽

Venkata Melapu ◽

Sravanthi Joginipeli ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Albert Einstein ◽

Naïve Bayes ◽

Machine Learning Techniques ◽

Support Vector ◽

Blood Profile ◽

Molecular Tests ◽

Large Populations

Frequent testing of the entire population would help to identify individuals with active COVID-19 and allow us to identify concealed carriers. Molecular tests, antigen tests, and antibody tests are being widely used to confirm COVID-19 in the population. Molecular tests such as the real-time reverse transcription-polymerase chain reaction (rRT-PCR) test will take a minimum of 3 hours to a maximum of 4 days for the results. The authors suggest using machine learning and data mining tools to filter large populations at a preliminary level to overcome this issue. The ML tools could reduce the testing population size by 20 to 30%. In this study, they have used a subset of features from full blood profile which are drawn from patients at Israelita Albert Einstein hospital located in Brazil. They used classification models, namely KNN, logistic regression, XGBooting, naive Bayes, decision tree, random forest, support vector machine, and multilayer perceptron with k-fold cross-validation, to validate the models. Naïve bayes, KNN, and random forest stand out as the most predictive ones with 88% accuracy each.

Download Full-text