Dimensionality Reduction using PCA and K-Means Clustering for Breast Cancer Prediction

Breast cancer is the most important cause of death among women. A prediction of breast cancer in early stage provides a greater possibility of its cure. It needs a breast cancer prediction tool that can classify a breast tumor whether it was a harmful malignant tumor or un-harmful benign tumor. In this paper, two algorithms of machine learning, namely Support Vector Machine and Extreme Gradient Boosting technique will be compared for classification purpose. Prior to the classification, the number of data attribute will be reduced from the raw data by extracting features using Principal Component Analysis. A clustering method, namely K-Means is also used for dimensionality reduction besides the Principal Component Analysis. This paper will present a comparison among four models based on two dimensionality reduction methods combined with two classifiers which applied on Wisconsin Breast Cancer Dataset. The comparison will be measured by using accuracy, sensitivity and specificity metrics evaluated from the confusion matrices. The experimental results have indicated that the K-Means method, which is not usually used for dimensionality reduction can perform well compared to the popular Principal Component Analysis.

Download Full-text

Physical-oriented and machine learning-based emission modeling in a diesel compression ignition engine: Dimensionality reduction and regression

International Journal of Engine Research ◽

10.1177/14680874211070736 ◽

2022 ◽

pp. 146808742110707

Author(s):

Aran Mohammad ◽

Reza Rezaei ◽

Christopher Hayduk ◽

Thaddaeus Delebinski ◽

Saeid Shahpouri ◽

...

Keyword(s):

Principal Component Analysis ◽

Support Vector Machine ◽

Factor Analysis ◽

Dimensionality Reduction ◽

Principal Component ◽

Component Analysis ◽

Data Driven ◽

Support Vector ◽

Emission Models ◽

Emission Modeling

The development of internal combustion engines is affected by the exhaust gas emissions legislation and the striving to increase performance. This demands for engine-out emission models that can be used for engine optimization for real driving emission controls. The prediction capability of physically and data-driven engine-out emission models is influenced by the system inputs, which are specified by the user and can lead to an improved accuracy with increasing number of inputs. Thereby the occurrence of irrelevant inputs becomes more probable, which have a low functional relation to the emissions and can lead to overfitting. Alternatively, data-driven methods can be used to detect irrelevant and redundant inputs. In this work, thermodynamic states are modeled based on 772 stationary measured test bench data from a commercial vehicle diesel engine. Afterward, 37 measured and modeled variables are led into a data-driven dimensionality reduction. For this purpose, approaches of supervised learning, such as lasso regression and linear support vector machine, and unsupervised learning methods like principal component analysis and factor analysis are applied to select and extract the relevant features. The selected and extracted features are used for regression by the support vector machine and the feedforward neural network to model the NOx, CO, HC, and soot emissions. This enables an evaluation of the modeling accuracy as a result of the dimensionality reduction. Using the methods in this work, the 37 variables are reduced to 25, 22, 11, and 16 inputs for NOx, CO, HC, and soot emission modeling while maintaining the accuracy. The features selected using the lasso algorithm provide more accurate learning of the regression models than the extracted features through principal component analysis and factor analysis. This results in test errors RMSETe for modeling NOx, CO, HC, and soot emissions 19.22 ppm, 6.46 ppm, 1.29 ppm, and 0.06 FSN, respectively.

Download Full-text

Raman Microspectral Study and Classification of the Pathological Evolution of Breast Cancer Using Both Principal Component Analysis-Linear Discriminant Analysis and Principal Component Analysis-Support Vector Machine

Journal of Spectroscopy ◽

10.1155/2021/5572782 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Heping Li ◽

Yu Ren ◽

Fan Yu ◽

Dongliang Song ◽

Lizhe Zhu ◽

...

Keyword(s):

Breast Cancer ◽

Principal Component Analysis ◽

Support Vector Machine ◽

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Ductal Carcinoma ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Linear Discriminant

To facilitate the enhanced reliability of Raman-based tumor detection and analytical methodologies, an ex vivo Raman spectral investigation was conducted to identify distinct compositional information of healthy (H), ductal carcinoma in situ (DCIS), and invasive ductal carcinoma (IDC). Then, principal component analysis-linear discriminant analysis (PCA-LDA) and principal component analysis-support vector machine (PCA-SVM) models were constructed for distinguishing spectral features among different tissue groups. Spectral analysis highlighted differences in levels of unsaturated and saturated lipids, carotenoids, protein, and nucleic acid between healthy and cancerous tissue and variations in the levels of nucleic acid, protein, and phenylalanine between DCIS and IDC. Both classification models were principal component analysis-linear discriminant analysis to be extremely efficient on discriminating tissue pathological types with 99% accuracy for PCA-LDA and 100%, 100%, and 96.7% for PCA-SVM analysis based on linear kernel, polynomial kernel, and radial basis function (RBF), respectively, while PCA-SVM algorithm greatly simplified the complexity of calculation without sacrificing performance. The present study demonstrates that Raman spectroscopy combined with multivariate analysis technology has considerable potential for improving the efficiency and performance of breast cancer diagnosis.

Download Full-text

Data Mining and Principal Component Analysis on Coimbra Breast Cancer Dataset

Proceedings of Intelligent Computing and Technologies Conference ◽

10.21467/proceedings.115.5 ◽

2021 ◽

Author(s):

Anupam Sen

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Principal Component Analysis ◽

Principal Component ◽

Component Analysis ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Breast Cancer Dataset ◽

Analysis Tool ◽

Machine Learning Classification

Machine Learning (ML) techniques play an important role in the medical field. Early diagnosis is required to improve the treatment of carcinoma. During this analysis Breast Cancer Coimbra dataset (BCCD) with ten predictors are analyzed to classify carcinoma. In this paper method for feature selection and Machine learning algorithms are applied to the dataset from the UCI repository. WEKA (“Waikato Environment for Knowledge Analysis”) tool is used for machine learning techniques. In this paper Principal Component Analysis (PCA) is used for feature extraction. Different Machine Learning classification algorithms are applied through WEKA such as Glmnet, Gbm, ada Boosting, Adabag Boosting, C50, Cforest, DcSVM, fnn, Ksvm, Node Harvest compares the accuracy and also compare values such as Kappa statistic, Mean Absolute Error (MAE), Root Mean Square Error (RMSE). Here the 10-fold cross validation method is used for training, testing and validation purposes.

Download Full-text

Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms

BMC Bioinformatics ◽

10.1186/s12859-021-04006-w ◽

2021 ◽

Vol 22 (S6) ◽

Author(s):

Guo-Sheng Han ◽

Qi Li ◽

Ying Li

Keyword(s):

Principal Component Analysis ◽

Comparative Analysis ◽

Dna Sequence ◽

Principal Component ◽

Nucleosome Positioning ◽

Component Analysis ◽

Feature Representation ◽

Support Vector ◽

Prediction Quality ◽

Extreme Gradient Boosting

Abstract Background Nucleosome plays an important role in the process of genome expression, DNA replication, DNA repair and transcription. Therefore, the research of nucleosome positioning has invariably received extensive attention. Considering the diversity of DNA sequence representation methods, we tried to integrate multiple features to analyze its effect in the process of nucleosome positioning analysis. This process can also deepen our understanding of the theoretical analysis of nucleosome positioning. Results Here, we not only used frequency chaos game representation (FCGR) to construct DNA sequence features, but also integrated it with other features and adopted the principal component analysis (PCA) algorithm. Simultaneously, support vector machine (SVM), extreme learning machine (ELM), extreme gradient boosting (XGBoost), multilayer perceptron (MLP) and convolutional neural networks (CNN) are used as predictors for nucleosome positioning prediction analysis, respectively. The integrated feature vector prediction quality is significantly superior to a single feature. After using principal component analysis (PCA) to reduce the feature dimension, the prediction quality of H. sapiens dataset has been significantly improved. Conclusions Comparative analysis and prediction on H. sapiens, C. elegans, D. melanogaster and S. cerevisiae datasets, demonstrate that the application of FCGR to nucleosome positioning is feasible, and we also found that integrative feature representation would be better.

Download Full-text

Auxiliary Diagnosis of Breast Cancer Based on Kernel Principal Component Analysis Support Vector Machine

Hans Journal of Data Mining ◽

10.12677/hjdm.2018.83010 ◽

2018 ◽

Vol 08 (03) ◽

pp. 89-95

Author(s):

珂珂邓

Keyword(s):

Breast Cancer ◽

Principal Component Analysis ◽

Support Vector Machine ◽

Principal Component ◽

Component Analysis ◽

Kernel Principal Component Analysis ◽

Support Vector

Download Full-text

Breast Cancer Recognition by Support Vector Machine Combined with Daubechies Wavelet Transform and Principal Component Analysis

Proceedings of the International Conference on ISMAC in Computational Vision and Bio-Engineering 2018 (ISMAC-CVB) - Lecture Notes in Computational Vision and Biomechanics ◽

10.1007/978-3-030-00665-5_177 ◽

2019 ◽

pp. 1921-1930 ◽

Cited By ~ 1

Author(s):

Fangyuan Liu ◽

Mackenzie Brown

Keyword(s):

Breast Cancer ◽

Principal Component Analysis ◽

Support Vector Machine ◽

Wavelet Transform ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Daubechies Wavelet

Download Full-text

SEGMENTASI DAN PERAMALAN PASAR RETAIL MENGGUNAKAN XGBOOST DAN PRINCIPAL COMPONENT ANALYSIS

METHOMIKA: Jurnal Manajemen Informatika dan Komputerisasi Akuntansi ◽

10.46880/jmika.vol5no1.pp42-47 ◽

2021 ◽

Vol 5 (1) ◽

pp. 42-47

Author(s):

Rimbun Siringoringo ◽

◽

Resianta Perangin-angin ◽

Mufria J. Purba

Keyword(s):

Principal Component Analysis ◽

Confusion Matrix ◽

Principal Component ◽

Roc Curves ◽

Component Analysis ◽

Gradient Boosting ◽

Retail Sector ◽

Online Retail ◽

Goods And Services ◽

Extreme Gradient Boosting

The growth of the online retail market in Indonesia is an excellent business opportunity. It is predicted that this growth will continue to move upward due to the increasing internet penetration. With greater exposure to brands, products and offerings, consumers become smarter and wiser in their purchasing decisions. Offering goods and services that match the tastes and behavior of consumers is very important to maintain business continuity. So far, the models developed are divided into two major parts, namely the time series approach and machine learning. In this study, segmentation and forecasting of online retail sector sales were carried out using extreme gradient boosting (XGBoost). The data used in this study is an online retail dataset obtained from the UCI repository. The k-means clustering (KMC) method is applied to determine the target or data class. Principal component analysis (PCA) is applied to reduce data dimensions by eliminating irrelevant features. Model evaluation is based on a confusion matrix and macro average ROC curve. Based on the research results, XGBoost can perform retail data classification well, this can be seen through confusion matrix metrics and ROC curves.

Download Full-text

Comparative Analysis of Machine Learning Techniques with Principal Component Analysis on Kidney and Heart Disease

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i2.1433 ◽

2021 ◽

Vol 12 (2) ◽

pp. 1564-1572

Author(s):

Reena Chandra, Et. al.

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Principal Component Analysis ◽

Logistic Regression ◽

Heart Disease ◽

Kidney Disease ◽

Dimensionality Reduction ◽

Principal Component ◽

Component Analysis ◽

Support Vector

Detection of disease at earlier stages is the most challenging one. Datasets of different diseases are available online with different number of features corresponding to a particular disease. Many dimensionality reduction and feature extraction techniques are used nowadays to reduce the number of features in dataset and finding the most appropriate ones. This paper explores the difference in performance of different machine learning models using Principal Component Analysis dimensionality reduction technique on the datasets of Chronic kidney disease and Cardiovascular disease. Further, the authors apply Logistic Regression, K Nearest Neighbour, Naïve Bayes, Support Vector Machine and Random Forest Model on the datasets and compare the performance of the model with and without PCA. A key challenge in the field of data mining and machine learning is building accurate and computationally efficient classifiers for medical applications. With an accuracy of 100% in chronic kidney disease and 85% for heart disease, KNN classifier and logistic regression were revealed to be the most optimal method of predictions for kidney and heart disease respectively.

Download Full-text

A data-driven principal component analysis-support vector machine approach for breast cancer diagnosis: Comparison and application

Transactions of the Institute of Measurement and Control ◽

10.1177/0142331219889221 ◽

2019 ◽

Vol 42 (7) ◽

pp. 1301-1312

Author(s):

Wen Wu ◽

Shah Faisal

Keyword(s):

Breast Cancer ◽

Principal Component Analysis ◽

Support Vector Machine ◽

Principal Component ◽

Component Analysis ◽

Data Driven ◽

Recursive Feature Elimination ◽

Support Vector ◽

Cancer Data ◽

Mechanism Model

In recent years, with the development of artificial intelligence, data-driven methodologies have been widely studied in fault diagnosis and detection, since an increasing number of complexities of modern complex systems make the mechanism model information difficult to obtain. Especially in people’s health monitoring, it is very difficult to achieve the mechanism model. The existing challenges, such as huge amount of data, high data dimension, large noise interference, and so forth, make the applications of data-driven approaches more suitable. For the sake of solving the problems above, we present principal component analysis-support vector machine (PCA-SVM) method with different kernels to reduce data dimension, and two sets of breast-cancer data are utilized to verify the method. Additionally, support vector machine-recursive feature elimination (SVM-RFE), the original SVM with different kernels, PCA and modified PCA (MPCA) methods are also applied to diagnose malignant cancer in comparison with PCA-SVM. In experiments, PCA-SVM via radial basis function (RBF) kernel shows better performance than other methods, with the two breast cancer datasets obtained from the University of Wisconsin Hospital. Finally, PCA-SVM in this study uses only six principal components and obtains better accuracy (97.19%) than most of the previous studies.

Download Full-text