On the Interpretability of Machine Learning Models and Experimental Feature Selection in Case of Multicollinear Data

Franc Drobnič; Andrej Kos; Matevž Pustišek

doi:10.3390/electronics9050761

On the Interpretability of Machine Learning Models and Experimental Feature Selection in Case of Multicollinear Data

Electronics ◽

10.3390/electronics9050761 ◽

2020 ◽

Vol 9 (5) ◽

pp. 761

Author(s):

Franc Drobnič ◽

Andrej Kos ◽

Matevž Pustišek

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forests ◽

Experimental Approach ◽

Feature Selection Method ◽

Model Quality ◽

Human In The Loop ◽

Model Interpretation ◽

Small Feature ◽

Original Feature

In the field of machine learning, a considerable amount of research is involved in the interpretability of models and their decisions. The interpretability contradicts the model quality. Random Forests are among the best quality technologies of machine learning, but their operation is of “black box” character. Among the quantifiable approaches to the model interpretation, there are measures of association of predictors and response. In case of the Random Forests, this approach usually consists of calculating the model’s feature importances. Known methods, including the built-in one, are less suitable in settings with strong multicollinearity of features. Therefore, we propose an experimental approach to the feature selection task, a greedy forward feature selection method with least-trees-used criterion. It yields a set of most informative features that can be used in a machine learning (ML) training process with similar prediction quality as the original feature set. We verify the results of the proposed method on two known datasets, one with small feature multicollinearity and another with large feature multicollinearity. The proposed method also allows for a domain expert help with selecting among equally important features, which is known as the human-in-the-loop approach.

Download Full-text

Random Forests Followed by Computed ABC Analysis as a Feature Selection Method for Machine Learning in Biomedical Data

Studies in Classification, Data Analysis, and Knowledge Organization - Advanced Studies in Classification and Data Science ◽

10.1007/978-981-15-3311-2_5 ◽

2020 ◽

pp. 57-69

Author(s):

Jörn Lötsch ◽

Alfred Ultsch

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forests ◽

Feature Selection Method ◽

Selection Method ◽

Biomedical Data ◽

Abc Analysis

Download Full-text

An Improved Machine Learning-Based Employees Attrition Prediction Framework with Emphasis on Feature Selection

Mathematics ◽

10.3390/math9111226 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1226

Author(s):

Saeed Najafi-Zangeneh ◽

Naser Shams-Gharneh ◽

Ali Arjomandi-Nezhad ◽

Sarfaraz Hashemkhani Zolfani

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Standard Deviation ◽

Analytical Formula ◽

Feature Selection Method ◽

Selection Method ◽

Performance Measure ◽

Learning Approaches ◽

Training Costs ◽

Professional Employees

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.

Download Full-text

NIMG-46. RADIOGENOMIC FEATURES PREDICT CLINICALLY RELEVANT GENOME-WIDE ALTERATION SIGNATURES IN GLIOBLASTOMA

Neuro-Oncology ◽

10.1093/neuonc/noaa215.659 ◽

2020 ◽

Vol 22 (Supplement_2) ◽

pp. ii158-ii158

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

James Fink ◽

David Haynor ◽

Eric Holland ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Selection Method ◽

Versus Group ◽

Mri Features ◽

Genome Wide ◽

Group 2 ◽

Group 1

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.

Download Full-text

Radiogenomic modeling predicts survival-associated prognostic groups in glioblastoma

Neuro-Oncology Advances ◽

10.1093/noajnl/vdab004 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

Abdullah Feroze ◽

Eric C Holland ◽

Linda Shapiro ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Area Under The Curve ◽

Selection Method ◽

Recursive Feature Elimination ◽

Signal Abnormality ◽

Mri Features ◽

Mri Scans

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.

Download Full-text

New feature selection method based on neural network and machine learning

2016 IEEE International Multidisciplinary Conference on Engineering Technology (IMCET) ◽

10.1109/imcet.2016.7777431 ◽

2016 ◽

Cited By ~ 4

Author(s):

Nicole Challita ◽

Mohamad Khalil ◽

Pierre Beauseroy

Keyword(s):

Neural Network ◽

Machine Learning ◽

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

New Feature

Download Full-text

A Machine Learning Framework for Intrusion Detection System in IoT Networks Using an Ensemble Feature Selection Method

10.1109/iemcon53756.2021.9623082 ◽

2021 ◽

Author(s):

Ge Guo

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Intrusion Detection ◽

Intrusion Detection System ◽

Detection System ◽

Feature Selection Method ◽

Selection Method ◽

Learning Framework

Download Full-text

A new feature selection method based on machine learning technique for air quality dataset

Journal of Statistics and Management Systems ◽

10.1080/09720510.2019.1609726 ◽

2019 ◽

Vol 22 (4) ◽

pp. 697-705 ◽

Cited By ~ 9

Author(s):

Jasleen Kaur Sethi ◽

Mamta Mittal

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Air Quality ◽

Feature Selection Method ◽

Selection Method ◽

Machine Learning Technique ◽

Learning Technique ◽

New Feature

Download Full-text

A Hybrid Improved Ant Colony Optimization and Random Forests Feature Selection Method for Microarray Data

2009 Fifth International Joint Conference on INC, IMS and IDC ◽

10.1109/ncm.2009.66 ◽

2009 ◽

Cited By ~ 4

Author(s):

Wen Xiong ◽

Cong Wang

Keyword(s):

Feature Selection ◽

Ant Colony Optimization ◽

Random Forests ◽

Microarray Data ◽

Feature Selection Method ◽

Selection Method ◽

Ant Colony

Download Full-text

Diagnostic Performance of 2D and 3D T2WI-Based Radiomics Features With Machine Learning Algorithms to Distinguish Solid Solitary Pulmonary Lesion

Frontiers in Oncology ◽

10.3389/fonc.2021.683587 ◽

2021 ◽

Vol 11 ◽

Author(s):

Qi Wan ◽

Jiaxuan Zhou ◽

Xiaoying Xia ◽

Jianfeng Hu ◽

Peng Wang ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Diagnostic Performance ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approaches ◽

Selection Methods ◽

Linear Discriminant ◽

2D And 3D

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.

Download Full-text

Interpretability and Class Imbalance in Prediction Models for Pain Volatility in Manage My Pain App Users: Analysis Using Feature Selection and Majority Voting Methods

JMIR Medical Informatics ◽

10.2196/15601 ◽

2019 ◽

Vol 7 (4) ◽

pp. e15601 ◽

Cited By ~ 1

Author(s):

Quazi Abidur Rahman ◽

Tahir Janmohamed ◽

Hance Clarke ◽

Paul Ritvo ◽

Jane Heffernan ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Feature Selection ◽

Random Forests ◽

Prediction Models ◽

Class Imbalance ◽

Majority Voting ◽

Selection Methods ◽

Logistic Regression Models ◽

High Volatility

Background Pain volatility is an important factor in chronic pain experience and adaptation. Previously, we employed machine-learning methods to define and predict pain volatility levels from users of the Manage My Pain app. Reducing the number of features is important to help increase interpretability of such prediction models. Prediction results also need to be consolidated from multiple random subsamples to address the class imbalance issue. Objective This study aimed to: (1) increase the interpretability of previously developed pain volatility models by identifying the most important features that distinguish high from low volatility users; and (2) consolidate prediction results from models derived from multiple random subsamples while addressing the class imbalance issue. Methods A total of 132 features were extracted from the first month of app use to develop machine learning–based models for predicting pain volatility at the sixth month of app use. Three feature selection methods were applied to identify features that were significantly better predictors than other members of the large features set used for developing the prediction models: (1) Gini impurity criterion; (2) information gain criterion; and (3) Boruta. We then combined the three groups of important features determined by these algorithms to produce the final list of important features. Three machine learning methods were then employed to conduct prediction experiments using the selected important features: (1) logistic regression with ridge estimators; (2) logistic regression with least absolute shrinkage and selection operator; and (3) random forests. Multiple random under-sampling of the majority class was conducted to address class imbalance in the dataset. Subsequently, a majority voting approach was employed to consolidate prediction results from these multiple subsamples. The total number of users included in this study was 879, with a total number of 391,255 pain records. Results A threshold of 1.6 was established using clustering methods to differentiate between 2 classes: low volatility (n=694) and high volatility (n=185). The overall prediction accuracy is approximately 70% for both random forests and logistic regression models when using 132 features. Overall, 9 important features were identified using 3 feature selection methods. Of these 9 features, 2 are from the app use category and the other 7 are related to pain statistics. After consolidating models that were developed using random subsamples by majority voting, logistic regression models performed equally well using 132 or 9 features. Random forests performed better than logistic regression methods in predicting the high volatility class. The consolidated accuracy of random forests does not drop significantly (601/879; 68.4% vs 618/879; 70.3%) when only 9 important features are included in the prediction model. Conclusions We employed feature selection methods to identify important features in predicting future pain volatility. To address class imbalance, we consolidated models that were developed using multiple random subsamples by majority voting. Reducing the number of features did not result in a significant decrease in the consolidated prediction accuracy.

Download Full-text