Performance of a semi-automatic machine leaning method for discriminating HER2 2+ status of breast cancers based on DCE-MRI (Preprint)

BACKGROUND Amplification status of human epidermal growth factor receptor2 (HER2) 2+ is currently tested by fluorescence in situ hybridization (FISH). However, the FISH technique is expensive, time consuming, and requires off-site testing. The requirement for alternative low-cost and accurate surrogate measures to formal genetic analysis is urgent. In addition, machine learning is broadly accepted for its ability to decipher complicated connections between medical image features and gene expression status. OBJECTIVE To investigate the potential association between texture features extracted from dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) and HER2 2+ amplification status of breast cancer. METHODS 92 patients with HER2 2+ breast cancer who underwent 3T MRI and FISH detection in 2018 were retrospectively selected, including 52 HER2 2+ positive and 40 negative cases. The lesion area was delineated semi-automatically with MATLAB, and a total of 307 texture features were extracted from precontrast, postcontrast, and subtraction images, independently. The Student’s t-test or Mann-Whitney U test was performed to identify significant features between different HER2 2+ amplification status. Principal component analysis was used to eliminate the feature correlations. Three machine learning classifiers, logistic regression analysis, quadratic discriminant analysis, and support vector machine (SVM), were with a leave-one-outcross validation method used to establish the classification models of HER2 2+ amplification status. Classification performance was evaluated by receiver operating characteristic (ROC) analysis. RESULTS Texture features calculated from subtraction images showed more promising results than those obtained from pre- and postcontrast images. The model with the SVM based on features from subtraction image achieved the best performance, with an area under the ROC curve of 0.890, sensitivity of 80.77%, specificity of 85.00%, and accuracy of 82.61%. CONCLUSIONS To a certain extent, texture features of breast cancer extracted from DCE-MRI are associated with HER2 2+ amplification status. Additional studies are necessary to confirm the present preliminary findings.

Download Full-text

Validation of miRNAs as Breast Cancer Biomarkers with a Machine Learning Approach

Cancers ◽

10.3390/cancers11030431 ◽

2019 ◽

Vol 11 (3) ◽

pp. 431 ◽

Cited By ~ 11

Author(s):

Oneeb Rehman ◽

Hanqi Zhuang ◽

Ali Muhamed Ali ◽

Ali Ibrahim ◽

Zhongwei Li

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Information Gain ◽

Support Vector ◽

Learning Approach ◽

Breast Cancers ◽

Functional Studies ◽

Normal Tissues ◽

Machine Learning Approach ◽

Chi Squared

Certain small noncoding microRNAs (miRNAs) are differentially expressed in normal tissues and cancers, which makes them great candidates for biomarkers for cancer. Previously, a selected subset of miRNAs has been experimentally verified to be linked to breast cancer. In this paper, we validated the importance of these miRNAs using a machine learning approach on miRNA expression data. We performed feature selection, using Information Gain (IG), Chi-Squared (CHI2) and Least Absolute Shrinkage and Selection Operation (LASSO), on the set of these relevant miRNAs to rank them by importance. We then performed cancer classification using these miRNAs as features using Random Forest (RF) and Support Vector Machine (SVM) classifiers. Our results demonstrated that the miRNAs ranked higher by our analysis had higher classifier performance. Performance becomes lower as the rank of the miRNA decreases, confirming that these miRNAs had different degrees of importance as biomarkers. Furthermore, we discovered that using a minimum of three miRNAs as biomarkers for breast cancers can be as effective as using the entire set of 1800 miRNAs. This work suggests that machine learning is a useful tool for functional studies of miRNAs for cancer detection and diagnosis.

Download Full-text

Machine Learning and Feature Selection Methods for EGFR Mutation Status Prediction in Lung Cancer

Applied Sciences ◽

10.3390/app11073273 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3273

Author(s):

Joana Morgado ◽

Tania Pereira ◽

Francisco Silva ◽

Cláudia Freitas ◽

Eduardo Negrão ◽

...

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Feature Selection ◽

Egfr Mutation ◽

Feature Selection Method ◽

Principal Component ◽

Image Features ◽

Support Vector ◽

Selection Methods ◽

Mutation Status

The evolution of personalized medicine has changed the therapeutic strategy from classical chemotherapy and radiotherapy to a genetic modification targeted therapy, and although biopsy is the traditional method to genetically characterize lung cancer tumor, it is an invasive and painful procedure for the patient. Nodule image features extracted from computed tomography (CT) scans have been used to create machine learning models that predict gene mutation status in a noninvasive, fast, and easy-to-use manner. However, recent studies have shown that radiomic features extracted from an extended region of interest (ROI) beyond the tumor, might be more relevant to predict the mutation status in lung cancer, and consequently may be used to significantly decrease the mortality rate of patients battling this condition. In this work, we investigated the relation between image phenotypes and the mutation status of Epidermal Growth Factor Receptor (EGFR), the most frequently mutated gene in lung cancer with several approved targeted-therapies, using radiomic features extracted from the lung containing the nodule. A variety of linear, nonlinear, and ensemble predictive classification models, along with several feature selection methods, were used to classify the binary outcome of wild-type or mutant EGFR mutation status. The results show that a comprehensive approach using a ROI that included the lung with nodule can capture relevant information and successfully predict the EGFR mutation status with increased performance compared to local nodule analyses. Linear Support Vector Machine, Elastic Net, and Logistic Regression, combined with the Principal Component Analysis feature selection method implemented with 70% of variance in the feature set, were the best-performing classifiers, reaching Area Under the Curve (AUC) values ranging from 0.725 to 0.737. This approach that exploits a holistic analysis indicates that information from more extensive regions of the lung containing the nodule allows a more complete lung cancer characterization and should be considered in future radiogenomic studies.

Download Full-text

Identifying Wood Based on Near-Infrared Spectra and Four Gray-Level Co-Occurrence Matrix Texture Features

Forests ◽

10.3390/f12111527 ◽

2021 ◽

Vol 12 (11) ◽

pp. 1527

Author(s):

Xi Pan ◽

Kang Li ◽

Zhangjing Chen ◽

Zhong Yang

Keyword(s):

Near Infrared ◽

Spatial Clustering ◽

Texture Feature ◽

Principal Component ◽

Texture Features ◽

Image Features ◽

Identification Accuracy ◽

Support Vector ◽

Gray Level ◽

Wood Identification

Identifying wood accurately and rapidly is one of the best ways to prevent wood product fakes and adulterants in forestry products. Wood identification traditionally relies heavily on special experts that spend extensive time in the laboratory. A new method is proposed that uses near-infrared (NIR) spectra at a wavelength of 780–2300 nm incorporated with the gray-level co-occurrence (GLCM) texture feature to accurately and rapidly identify timbers. The NIR spectral features were determined by principal component analysis (PCA), and the digital image features extracted with the GLCM were used to create a support vector machine (SVM) model to identify the timbers. The results from fusion features of raw spectra and four GLCM features of 25 timbers showed that identification accuracy by the model was 99.43%. A sample anisotropy and heterogeneity comparative analysis revealed that the wood identification information from the transverse surface had more characteristics than that from the tangential and radial surfaces. Furthermore, short-wavelength pre-processed NIR bands of 780–1100 nm and 1100–2300 nm realized high identification accuracy of 99.43% and 100%, respectively. The four GLCM features were effective for improving identification accuracy by improving the data spatial clustering features.

Download Full-text

Hyperparameter Tuning and Pipeline Optimization via Grid Search Method and Tree-Based AutoML in Breast Cancer Prediction

Journal of Personalized Medicine ◽

10.3390/jpm11100978 ◽

2021 ◽

Vol 11 (10) ◽

pp. 978

Author(s):

Siti Fairuz Mat Radzi ◽

Muhammad Khalis Abdul Karim ◽

M Iqbal Saripan ◽

Mohd Amiruddin Abdul Rahman ◽

Iza Nurzawani Che Isa ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Model Selection ◽

Principal Component ◽

Receiver Operating Curve ◽

Support Vector ◽

Grid Search ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data

Automated machine learning (AutoML) has been recognized as a powerful tool to build a system that automates the design and optimizes the model selection machine learning (ML) pipelines. In this study, we present a tree-based pipeline optimization tool (TPOT) as a method for determining ML models with significant performance and less complex breast cancer diagnostic pipelines. Some features of pre-processors and ML models are defined as expression trees and optimal gene programming (GP) pipelines, a stochastic search system. Features of radiomics have been presented as a guide for the ML pipeline selection from the breast cancer data set based on TPOT. Breast cancer data were used in a comparative analysis of the TPOT-generated ML pipelines with the selected ML classifiers, optimized by a grid search approach. The principal component analysis (PCA) random forest (RF) classification was proven to be the most reliable pipeline with the lowest complexity. The TPOT model selection technique exceeded the performance of grid search (GS) optimization. The RF classifier showed an outstanding outcome amongst the models in combination with only two pre-processors, with a precision of 0.83. The grid search optimized for support vector machine (SVM) classifiers generated a difference of 12% in comparison, while the other two classifiers, naïve Bayes (NB) and artificial neural network—multilayer perceptron (ANN-MLP), generated a difference of almost 39%. The method’s performance was based on sensitivity, specificity, accuracy, precision, and receiver operating curve (ROC) analysis.

Download Full-text

Multitask fMRI and machine learning approach improve prediction of differential brain activity pattern in patients with insomnia disorder

Scientific Reports ◽

10.1038/s41598-021-88845-w ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Mi Hyun Lee ◽

Nambeom Kim ◽

Jaeeun Yoo ◽

Hang-Keun Kim ◽

Young-Don Son ◽

...

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Brain Activity ◽

Inferior Frontal Gyrus ◽

Principal Component ◽

Classification Performance ◽

Support Vector ◽

Spatial Covariance ◽

Single Task ◽

Bold Responses

AbstractWe investigated the differential spatial covariance pattern of blood oxygen level-dependent (BOLD) responses to single-task and multitask functional magnetic resonance imaging (fMRI) between patients with psychophysiological insomnia (PI) and healthy controls (HCs), and evaluated features generated by principal component analysis (PCA) for discrimination of PI from HC, compared to features generated from BOLD responses to single-task fMRI using machine learning methods. In 19 patients with PI and 21 HCs, the mean beta value for each region of interest (ROIbval) was calculated with three contrast images (i.e., sleep-related picture, sleep-related sound, and Stroop stimuli). We performed discrimination analysis and compared with features generated from BOLD responses to single-task fMRI. We applied support vector machine analysis with a least absolute shrinkage and selection operator to evaluate five performance metrics: accuracy, recall, precision, specificity, and F2. Principal component features showed the best classification performance in all aspects of metrics compared to BOLD response to single-task fMRI. Bilateral inferior frontal gyrus (orbital), right calcarine cortex, right lingual gyrus, left inferior occipital gyrus, and left inferior temporal gyrus were identified as the most salient areas by feature selection. Our approach showed better performance in discriminating patients with PI from HCs, compared to single-task fMRI.

Download Full-text

Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features

Sensors ◽

10.3390/s21113628 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3628

Author(s):

Soumya Deep Roy ◽

Soham Das ◽

Devroop Kar ◽

Friedhelm Schwenker ◽

Ram Sarkar

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Ductal Carcinoma ◽

Pearson Correlation ◽

Treatment Protocol ◽

Texture Features ◽

Breast Cancer Diagnosis ◽

Image Features ◽

Breast Cancer Dataset ◽

Cancer Dataset

Breast cancer, like most forms of cancer, is a fatal disease that claims more than half a million lives every year. In 2020, breast cancer overtook lung cancer as the most commonly diagnosed form of cancer. Though extremely deadly, the survival rate and longevity increase substantially with early detection and diagnosis. The treatment protocol also varies with the stage of breast cancer. Diagnosis is typically done using histopathological slides from which it is possible to determine whether the tissue is in the Ductal Carcinoma In Situ (DCIS) stage, in which the cancerous cells have not spread into the encompassing breast tissue, or in the Invasive Ductal Carcinoma (IDC) stage, wherein the cells have penetrated into the neighboring tissues. IDC detection is extremely time-consuming and challenging for physicians. Hence, this can be modeled as an image classification task where pattern recognition and machine learning can be used to aid doctors and medical practitioners in making such crucial decisions. In the present paper, we use an IDC Breast Cancer dataset that contains 277,524 images (with 78,786 IDC positive images and 198,738 IDC negative images) to classify the images into IDC(+) and IDC(-). To that end, we use feature extractors, including textural features, such as SIFT, SURF and ORB, and statistical features, such as Haralick texture features. These features are then combined to yield a dataset of 782 features. These features are ensembled by stacking using various Machine Learning classifiers, such as Random Forest, Extra Trees, XGBoost, AdaBoost, CatBoost and Multi Layer Perceptron followed by feature selection using Pearson Correlation Coefficient to yield a dataset with four features that are then used for classification. From our experimental results, we found that CatBoost yielded the highest accuracy (92.55%), which is at par with other state-of-the-art results—most of which employ Deep Learning architectures. The source code is available in the GitHub repository.

Download Full-text

Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data

10.1101/2020.09.13.295592 ◽

2020 ◽

Author(s):

Fei Deng ◽

Jibing Huang ◽

Xiaoling Yuan ◽

Chao Cheng ◽

Lanjing Zhang

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Dimension Reduction ◽

Cause Of Death ◽

Information Gain ◽

Machine Learning Algorithms ◽

Support Vector ◽

Biomedical Data ◽

Breast Cancers ◽

Similar Accuracy

AbstractMost of the biomedical datasets, including those of ‘omics, population studies and surveys, are rectangular in shape and have few missing data. Recently, their sample sizes have grown significantly. Rigorous analyses on these large datasets demand considerably more efficient and more accurate algorithms. Machine learning (ML) algorithms have been used to classify outcomes in biomedical datasets, including random forests (RF), decision tree (DT), artificial neural networks (ANN) and support vector machine (SVM). However, their performance and efficiency in classifying multi-category outcomes in rectangular data are poorly understood. Therefore, we aimed to compare these metrics among the 4 ML algorithms. As an example, we created a large rectangular dataset using the female breast cancers in the Surveillance, Epidemiology, and End Results-18 (SEER-18) database which were diagnosed in 2004 and followed up until December 2016. The outcome was the 6-category cause of death, namely alive, non-breast cancer, breast cancer, cardiovascular disease, infection and other cause. We included 58 dichotomized features from ~53,000 patients. All analyses were performed using MatLab (version 2018a) and the 10-fold cross validation approach. The accuracy in classifying 6-category cause of death with DT, RF, ANN and SVM was 72.68%, 72.66%, 70.01% and 71.85%, respectively. Based on the information entropy and information gain of feature values, we optimized dimension reduction (i.e. reduce the number of features in models). We found 22 or more features were required to maintain the similar accuracy, while the running time decreased from 440s for 58 features to 90s for 22 features in RF, from 70s to 40s in ANN and from 440s to 80s in SVM. In summary, we here show that RF, DT, ANN and SVM had similar accuracy for classifying multi-category outcomes in this large rectangular dataset. Dimension reduction based on information gain will significantly increase model’s efficiency while maintaining classification accuracy.

Download Full-text

Automated selection of mid-height intervertebral disc slice in traverse lumbar spine MRI using a combination of deep learning feature and machine learning classifier

PLoS ONE ◽

10.1371/journal.pone.0261659 ◽

2022 ◽

Vol 17 (1) ◽

pp. e0261659

Author(s):

Friska Natalia ◽

Julio Christian Young ◽

Nunik Afriliana ◽

Hira Meidia ◽

Reyhan Eddy Yunus ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Lumbar Spine ◽

Intervertebral Disc ◽

Classification Performance ◽

Image Features ◽

Gaussian Kernel ◽

Support Vector ◽

Wide Range ◽

Spine Mri

Abnormalities and defects that can cause lumbar spinal stenosis often occur in the Intervertebral Disc (IVD) of the patient’s lumbar spine. Their automatic detection and classification require an application of an image analysis algorithm on suitable images, such as mid-sagittal images or traverse mid-height intervertebral disc slices, as inputs. Hence the process of selecting and separating these images from other medical images in the patient’s set of scans is necessary. However, the technological progress in making this process automated is still lagging behind other areas in medical image classification research. In this paper, we report the result of our investigation on the suitability and performance of different approaches of machine learning to automatically select the best traverse plane that cuts closest to the half-height of an IVD from a database of lumbar spine MRI images. This study considers images features extracted using eleven different pre-trained Deep Convolution Neural Network (DCNN) models. We investigate the effectiveness of three dimensionality-reduction techniques and three feature-selection techniques on the classification performance. We also investigate the performance of five different Machine Learning (ML) algorithms and three Fully Connected (FC) neural network learning optimizers which are used to train an image classifier with hyperparameter optimization using a wide range of hyperparameter options and values. The different combinations of methods are tested on a publicly available lumbar spine MRI dataset consisting of MRI studies of 515 patients with symptomatic back pain. Our experiment shows that applying the Support Vector Machine algorithm with a short Gaussian kernel on full-length image features extracted using a pre-trained DenseNet201 model is the best approach to use. This approach gives the minimum per-class classification performance of around 0.88 when measured using the precision and recall metrics. The median performance measured using the precision metric ranges from 0.95 to 0.99 whereas that using the recall metric ranges from 0.93 to 1.0. When only considering the L3/L4, L4/L5, and L5/S1 classes, the minimum F1-Scores range between 0.93 to 0.95, whereas the median F1-Scores range between 0.97 to 0.99.

Download Full-text

Breast Cancer Detection in the IOT Health Environment Using Modified Recursive Feature Selection

Wireless Communications and Mobile Computing ◽

10.1155/2019/5176705 ◽

2019 ◽

Vol 2019 ◽

pp. 1-19 ◽

Cited By ~ 6

Author(s):

Muhammad Hammad Memon ◽

Jian Ping Li ◽

Amin Ul Haq ◽

Muhammad Hunain Memon ◽

Wang Zhou

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Correlation Coefficient ◽

Classification Performance ◽

Experimental Results ◽

Support Vector ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Early Stages

The accurate and efficient diagnosis of breast cancer is extremely necessary for recovery and treatment in early stages in IoT healthcare environment. Internet of Things has witnessed the transition in life for the last few years which provides a way to analyze both the real-time data and past data by the emerging role of artificial intelligence and data mining techniques. The current state-of-the-art method does not effectively diagnose the breast cancer in the early stages, and most of the ladies suffered from this dangerous disease. Thus, the early detection of breast cancer significantly poses a great challenge for medical experts and researchers. To solve the problem of early-stage detection of breast cancer, we proposed machine learning-based diagnostic system which effectively classifies the malignant and benign people in the environment of IoT. In the development of our proposed system, a machine learning classifier support vector machine is used to classify the malignant and benign people. To improve the classification performance of the classification system, we used a recursive feature selection algorithm to select more suitable features from the breast cancer dataset. The training/testing splits method is applied for training and testing of the classifier for the best predictive model. Additionally, the classifier performance has been checked on by using performance evaluation metrics such as classification, specificity, sensitivity, Matthews’s correlation coefficient, F1-score, and execution time. To test the proposed method, the dataset “Wisconsin Diagnostic Breast Cancer” has been used in this research study. The experimental results demonstrate that the recursive feature selection algorithm selects the best subset of features, and the classifier SVM achieved optimal classification performance on this best subset of features. The SVM kernel linear achieved high classification accuracy (99%), specificity (99%), and sensitivity (98%), and the Matthews’s correlation coefficient is 99%. From these experimental results, we concluded that the proposed system performance is excellent due to the selection of more appropriate features that are selected by the recursive feature selection algorithm. Furthermore, we suggest this proposed system for effective and efficient early stages diagnosis of breast cancer. Thus, through this system, the recovery and treatment will be more effective for breast cancer. Lastly, the implementation of the proposed system is very reliable in all aspects of IoT healthcare for breast cancer.

Download Full-text

Binary Spectrum Feature for Improved Classiﬁer Performance

10.36227/techrxiv.12993122 ◽

2020 ◽

Author(s):

Nalika Ulapane ◽

Karthick Thiyagarajan ◽

sarath kodagoda

Keyword(s):

Machine Learning ◽

Classification Performance ◽

Feature Reduction ◽

Sensor Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Svm Classifier ◽

Monitoring Task ◽

Classifier Performance ◽

Spectrum Feature

<div>Classiﬁcation has become a vital task in modern machine learning and Artiﬁcial Intelligence applications, including smart sensing. Numerous machine learning techniques are available to perform classiﬁcation. Similarly, numerous practices, such as feature selection (i.e., selection of a subset of descriptor variables that optimally describe the output), are available to improve classiﬁer performance. In this paper, we consider the case of a given supervised learning classiﬁcation task that has to be performed making use of continuous-valued features. It is assumed that an optimal subset of features has already been selected. Therefore, no further feature reduction, or feature addition, is to be carried out. Then, we attempt to improve the classiﬁcation performance by passing the given feature set through a transformation that produces a new feature set which we have named the “Binary Spectrum”. Via a case study example done on some Pulsed Eddy Current sensor data captured from an infrastructure monitoring task, we demonstrate how the classiﬁcation accuracy of a Support Vector Machine (SVM) classiﬁer increases through the use of this Binary Spectrum feature, indicating the feature transformation’s potential for broader usage.</div><div><br></div>

Download Full-text