scholarly journals Machine Learning With K-Means Dimensional Reduction for Predicting Survival Outcomes in Patients With Breast Cancer

2018 ◽  
Vol 17 ◽  
pp. 117693511881021 ◽  
Author(s):  
Melissa Zhao ◽  
Yushi Tang ◽  
Hyunkyung Kim ◽  
Kohei Hasegawa

Objective: Despite existing prognostic markers, breast cancer prognosis remains a difficult subject due to the complex relationships between many contributing factors and survival. This study seeks to integrate multiple clinicopathological and genomic factors with dimensional reduction across machine learning algorithms to compare survival predictions. Methods: This is a secondary analysis of the data from a prospective cohort study of female patients with breast cancer enrolled in the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). We constructed a series of predictive models: ensemble models (Gradient Boosting and Random Forest), support vector machine (SVM), and artificial neural networks (ANN) for 5-year survival based on clinicopathological and gene expression data after K-means clustering with K-nearest-neighbor (KNN) classification. Model performance was evaluated by receiver operating characteristic (ROC) curve, accuracy, and calibration slope (CS). Model stability was assessed over 10 random runs in terms of ROC, accuracy, CS, and variable importance. Results: The analytic cohort is composed of 1874 patients with breast cancer. Overall, the median age was 62 years; the 5-year survival rate was 75%. ROC and accuracy were not significantly different between models (ROC and accuracy around 0.67 and 0.72 across models, respectively). However, ensemble methods resulted in better fit (CS) with stable measures of variable importance across 10 random training/validation splits. K-means clustering of gene expression profiles on training data points along with KNN classification of validation data points was a robust method of dimensional reduction. Furthermore, the gene expression cluster with the highest mortality risk was an influential factor in model prediction. Conclusions: Using machine learning methods to construct predictive models for 5-year survival in patients with breast cancer, we demonstrated discrimination ability across models with new insight into the stability and utility of dimensional reduction on genomic features in breast cancer survival prediction.

2021 ◽  
Vol 11 (2) ◽  
pp. 61
Author(s):  
Jiande Wu ◽  
Chindo Hicks

Background: Breast cancer is a heterogeneous disease defined by molecular types and subtypes. Advances in genomic research have enabled use of precision medicine in clinical management of breast cancer. A critical unmet medical need is distinguishing triple negative breast cancer, the most aggressive and lethal form of breast cancer, from non-triple negative breast cancer. Here we propose use of a machine learning (ML) approach for classification of triple negative breast cancer and non-triple negative breast cancer patients using gene expression data. Methods: We performed analysis of RNA-Sequence data from 110 triple negative and 992 non-triple negative breast cancer tumor samples from The Cancer Genome Atlas to select the features (genes) used in the development and validation of the classification models. We evaluated four different classification models including Support Vector Machines, K-nearest neighbor, Naïve Bayes and Decision tree using features selected at different threshold levels to train the models for classifying the two types of breast cancer. For performance evaluation and validation, the proposed methods were applied to independent gene expression datasets. Results: Among the four ML algorithms evaluated, the Support Vector Machine algorithm was able to classify breast cancer more accurately into triple negative and non-triple negative breast cancer and had less misclassification errors than the other three algorithms evaluated. Conclusions: The prediction results show that ML algorithms are efficient and can be used for classification of breast cancer into triple negative and non-triple negative breast cancer types.


2021 ◽  
pp. 1063293X2199180
Author(s):  
Babymol Kurian ◽  
VL Jyothi

A wide reach on cancer prediction and detection using Next Generation Sequencing (NGS) by the application of artificial intelligence is highly appreciated in the current scenario of the medical field. Next generation sequences were extracted from NCBI (National Centre for Biotechnology Information) gene repository. Sequences of normal Homo sapiens (Class 1), BRCA1 (Class 2) and BRCA2 (Class 3) were extracted for Machine Learning (ML) purpose. The total volume of datasets extracted for the process were 1580 in number under four categories of 50, 100, 150 and 200 sequences. The breast cancer prediction process was carried out in three major steps such as feature extraction, machine learning classification and performance evaluation. The features were extracted with sequences as input. Ten features of DNA sequences such as ORF (Open Reading Frame) count, individual nucleobase average count of A, T, C, G, AT and GC-content, AT/GC composition, G-quadruplex occurrence, MR (Mutation Rate) were extracted from three types of sequences for the classification process. The sequence type was also included as a target variable to the feature set with values 0, 1 and 2 for classes 1, 2 and 3 respectively. Nine various supervised machine learning techniques like LR (Logistic Regression statistical model), LDA (Linear Discriminant analysis model), k-NN (k nearest neighbours’ algorithm), DT (Decision tree technique), NB (Naive Bayes classifier), SVM (Support-Vector Machine algorithm), RF (Random Forest learning algorithm), AdaBoost (AB) and Gradient Boosting (GB) were employed on four various categories of datasets. Of all supervised models, decision tree machine learning technique performed most with maximum accuracy in classification of 94.03%. Classification model performance was evaluated using precision, recall, F1-score and support values wherein F1-score was most similar to the classification accuracy.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2124 ◽  
Author(s):  
Iman Rezaeian ◽  
Eliseos J. Mucaki ◽  
Katherina Baranova ◽  
Huy Q. Pham ◽  
Dimo Angelov ◽  
...  

Genomic aberrations and gene expression-defined subtypes in the large METABRIC patient cohort have been used to stratify and predict survival. The present study used normalized gene expression signatures of paclitaxel drug response to predict outcome for different survival times in METABRIC patients receiving hormone (HT) and, in some cases, chemotherapy (CT) agents. This machine learning method, which distinguishes sensitivity vs. resistance in breast cancer cell lines and validates predictions in patients, was also used to derive gene signatures of other HT  (tamoxifen) and CT agents (methotrexate, epirubicin, doxorubicin, and 5-fluorouracil) used in METABRIC. Paclitaxel gene signatures exhibited the best performance, however the other agents also predicted survival with acceptable accuracies. A support vector machine (SVM) model of paclitaxel response containing the ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2, SLCO1B3, TUBB1, TUBB4A, TUBB4B genes was 78.6% accurate in 84 patients treated with both HT and CT (median survival ≥ 4.4 yr). Accuracy was lower (73.4%) in 304 untreated patients. The performance of other machine learning approaches were also evaluated at different survival thresholds. Minimum redundancy maximum relevance feature selection of a paclitaxel-based SVM classifier based on expression of ABCB11, ABCC1, BAD, BBC3 and BCL2L1 was 79% accurate in 53 CT patients. A random forest (RF) classifier produced a gene signature (ABCB11, ABCC1, BAD, BCL2, CYP2C8, CYP3A4, MAP4, MAPT, NR1I2, TUBB1, GBP1, OPRK1) that predicted >3 year survival with 82.4% accuracy in 420 HT patients. A similar RF gene signature showed 79.6% accuracy in 504 patients treated with CT and/or HT. These results suggest that tumor gene expression signatures refined by machine learning techniques can be useful for predicting survival after drug therapies.


F1000Research ◽  
2017 ◽  
Vol 5 ◽  
pp. 2124 ◽  
Author(s):  
Iman Rezaeian ◽  
Eliseos J. Mucaki ◽  
Katherina Baranova ◽  
Huy Q. Pham ◽  
Dimo Angelov ◽  
...  

Genomic aberrations and gene expression-defined subtypes in the large METABRIC patient cohort have been used to stratify and predict survival. The present study used normalized gene expression signatures of paclitaxel drug response to predict outcome for different survival times in METABRIC patients receiving hormone (HT) and, in some cases, chemotherapy (CT) agents. This machine learning method, which distinguishes sensitivity vs. resistance in breast cancer cell lines and validates predictions in patients; was also used to derive gene signatures of other HT  (tamoxifen) and CT agents (methotrexate, epirubicin, doxorubicin, and 5-fluorouracil) used in METABRIC. Paclitaxel gene signatures exhibited the best performance, however the other agents also predicted survival with acceptable accuracies. A support vector machine (SVM) model of paclitaxel response containing genes ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2, SLCO1B3, TUBB1, TUBB4A, and TUBB4B was 78.6% accurate in predicting survival of 84 patients treated with both HT and CT (median survival ≥ 4.4 yr). Accuracy was lower (73.4%) in 304 untreated patients. The performance of other machine learning approaches was also evaluated at different survival thresholds. Minimum redundancy maximum relevance feature selection of a paclitaxel-based SVM classifier based on expression of genes BCL2L1, BBC3, FGF2, FN1, and TWIST1 was 81.1% accurate in 53 CT patients. In addition, a random forest (RF) classifier using a gene signature (ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2,SLCO1B3, TUBB1, TUBB4A, and TUBB4B) predicted >3-year survival with 85.5% accuracy in 420 HT patients. A similar RF gene signature showed 82.7% accuracy in 504 patients treated with CT and/or HT. These results suggest that tumor gene expression signatures refined by machine learning techniques can be useful for predicting survival after drug therapies.


2021 ◽  
Vol 30 (1) ◽  
pp. 998-1013
Author(s):  
Shubham Vashisth ◽  
Ishika Dhall ◽  
Garima Aggarwal

Abstract The rapid pace of development over the last few decades in the domain of machine learning mirrors the advances made in the field of quantum computing. It is natural to ask whether the conventional machine learning algorithms could be optimized using the present-day noisy intermediate-scale quantum technology. There are certain computational limitations while training a machine learning model on a classical computer. Using quantum computation, it is possible to surpass these limitations and carry out such calculations in an optimized manner. This study illustrates the working of the quantum support vector machine classification model which guarantees an exponential speed-up over its typical alternatives. This research uses the quantum SVM model to solve the classification task of a malignant breast cancer diagnosis. This study also demonstrates a comparative analysis of distinct forms of SVM algorithms concerning their time complexity and performances on standard evaluation metrics, namely accuracy, precision, recall, and F1-score, to exemplify the supremacy of quantum SVM over its conventional variants.


F1000Research ◽  
2017 ◽  
Vol 5 ◽  
pp. 2124 ◽  
Author(s):  
Eliseos J. Mucaki ◽  
Katherina Baranova ◽  
Huy Q. Pham ◽  
Iman Rezaeian ◽  
Dimo Angelov ◽  
...  

Genomic aberrations and gene expression-defined subtypes in the large METABRIC patient cohort have been used to stratify and predict survival. The present study used normalized gene expression signatures of paclitaxel drug response to predict outcome for different survival times in METABRIC patients receiving hormone (HT) and, in some cases, chemotherapy (CT) agents. This machine learning method, which distinguishes sensitivity vs. resistance in breast cancer cell lines and validates predictions in patients; was also used to derive gene signatures of other HT  (tamoxifen) and CT agents (methotrexate, epirubicin, doxorubicin, and 5-fluorouracil) used in METABRIC. Paclitaxel gene signatures exhibited the best performance, however the other agents also predicted survival with acceptable accuracies. A support vector machine (SVM) model of paclitaxel response containing genes ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2, SLCO1B3, TUBB1, TUBB4A, and TUBB4B was 78.6% accurate in predicting survival of 84 patients treated with both HT and CT (median survival ≥ 4.4 yr). Accuracy was lower (73.4%) in 304 untreated patients. The performance of other machine learning approaches was also evaluated at different survival thresholds. Minimum redundancy maximum relevance feature selection of a paclitaxel-based SVM classifier based on expression of genes BCL2L1, BBC3, FGF2, FN1, and TWIST1 was 81.1% accurate in 53 CT patients. In addition, a random forest (RF) classifier using a gene signature (ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2,SLCO1B3, TUBB1, TUBB4A, and TUBB4B) predicted >3-year survival with 85.5% accuracy in 420 HT patients. A similar RF gene signature showed 82.7% accuracy in 504 patients treated with CT and/or HT. These results suggest that tumor gene expression signatures refined by machine learning techniques can be useful for predicting survival after drug therapies.


The first step in diagnosis of a breast cancer is the identification of the disease. Early detection of the breast cancer is significant to reduce the mortality rate due to breast cancer. Machine learning algorithms can be used in identification of the breast cancer. The supervised machine learning algorithms such as Support Vector Machine (SVM) and the Decision Tree are widely used in classification problems, such as the identification of breast cancer. In this study, a machine learning model is proposed by employing learning algorithms namely, the support vector machine and decision tree. The kaggle data repository consisting of 569 observations of malignant and benign observations is used to develop the proposed model. Finally, the model is evaluated using accuracy, confusion matrix precision and recall as metrics for evaluation of performance on the test set. The analysis result showed that, the support vector machine (SVM) has better accuracy and less number of misclassification rate and better precision than the decision tree algorithm. The average accuracy of the support vector machine (SVM) is 91.92 % and that of the decision tree classification model is 87.12 %.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Arturo Moncada-Torres ◽  
Marissa C. van Maaren ◽  
Mathijs P. Hendriks ◽  
Sabine Siesling ◽  
Gijs Geleijnse

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.


2012 ◽  
Vol 2012 ◽  
pp. 1-10 ◽  
Author(s):  
Lev V. Utkin

A fuzzy classification model is studied in the paper. It is based on the contaminated (robust) model which produces fuzzy expected risk measures characterizing classification errors. Optimal classification parameters of the models are derived by minimizing the fuzzy expected risk. It is shown that an algorithm for computing the classification parameters is reduced to a set of standard support vector machine tasks with weighted data points. Experimental results with synthetic data illustrate the proposed fuzzy model.


2020 ◽  
Vol 14 ◽  

Breast Cancer (BC) is amongst the most common and leading causes of deaths in women throughout the world. Recently, classification and data analysis tools are being widely used in the medical field for diagnosis, prognosis and decision making to help lower down the risks of people dying or suffering from diseases. Advanced machine learning methods have proven to give hope for patients as this has helped the doctors in early detection of diseases like Breast Cancer that can be fatal, in support with providing accurate outcomes. However, the results highly depend on the techniques used for feature selection and classification which will produce a strong machine learning model. In this paper, a performance comparison is conducted using four classifiers which are Multilayer Perceptron (MLP), Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Random Forest on the Wisconsin Breast Cancer dataset to spot the most effective predictors. The main goal is to apply best machine learning classification methods to predict the Breast Cancer as benign or malignant using terms such as accuracy, f-measure, precision and recall. Experimental results show that Random forest is proven to achieve the highest accuracy of 99.26% on this dataset and features, while SVM and KNN show 97.78% and 97.04% accuracy respectively. MLP shows the least accuracy of 94.07%. All the experiments are conducted using RStudio as the data mining tool platform.


Sign in / Sign up

Export Citation Format

Share Document