Inflated prediction accuracy of neuropsychiatric biomarkers caused by data leakage in feature selection

AbstractIn recent years, machine learning techniques have been frequently applied to uncovering neuropsychiatric biomarkers with the aim of accurately diagnosing neuropsychiatric diseases and predicting treatment prognosis. However, many studies did not perform cross validation (CV) when using machine learning techniques, or others performed CV in an incorrect manner, leading to significantly biased results due to overfitting problem. The aim of this study is to investigate the impact of CV on the prediction performance of neuropsychiatric biomarkers, in particular, for feature selection performed with high-dimensional features. To this end, we evaluated prediction performances using both simulation data and actual electroencephalography (EEG) data. The overall prediction accuracies of the feature selection method performed outside of CV were considerably higher than those of the feature selection method performed within CV for both the simulation and actual EEG data. The differences between the prediction accuracies of the two feature selection approaches can be thought of as the amount of overfitting due to selection bias. Our results indicate the importance of correctly using CV to avoid biased results of prediction performance of neuropsychiatric biomarkers.

Download Full-text

An efficient feature selection method for classification in health care systems using machine learning techniques

2011 3rd International Conference on Electronics Computer Technology ◽

10.1109/icectech.2011.5941891 ◽

2011 ◽

Cited By ~ 7

Author(s):

K Selvakuberan ◽

D Kayathiri ◽

B Harini ◽

M Indra Devi

Keyword(s):

Machine Learning ◽

Health Care ◽

Feature Selection ◽

Health Care Systems ◽

Feature Selection Method ◽

Selection Method ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Care Systems

Download Full-text

Prediction of Alzheimer’s disease (AD) Using Machine Learning Techniques with Boruta Algorithm as Feature Selection Method

Journal of Physics Conference Series ◽

10.1088/1742-6596/1372/1/012065 ◽

2019 ◽

Vol 1372 ◽

pp. 012065

Author(s):

Lee Kuok Leong ◽

Azian Azamimi Abdullah

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Boruta Algorithm

Download Full-text

Osteoporosis Detection Using Machine Learning Techniques and Feature Selection

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213014500146 ◽

2014 ◽

Vol 23 (05) ◽

pp. 1450014 ◽

Cited By ~ 8

Author(s):

Theodoros Iliou ◽

Christos-Nikolaos Anagnostopoulos ◽

George Anastassopoulos

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Bone Densitometry ◽

Feature Selection Method ◽

Machine Learning Techniques ◽

Mineral Density ◽

Learning Techniques ◽

Increased Risk ◽

Osteoporosis Risk ◽

Fold Cross Validation

Osteoporosis is a disease of bones that leads to an increased risk of fracture and it is characterized by low bone mineral density and micro-architectural deterioration of bone tissue. In this article, the dataset consists of 3426 subjects (1083 pathological and 2343 healthy cases) whose diagnosis was based on laboratory and osteal bone densitometry examination. In all cases, four diagnostic factors for osteoporosis risk prediction, namely age, sex, height and weight were stored for later evaluation with the selected classifiers. In order to categorize subjects into two classes (osteoporosis, nonosteoporosis), twenty machine learning techniques were assessed, based on their popularity and frequency in biomedical engineering problems. All classifiers have been evaluated using the wellknown 10-fold cross validation method and the results are reported analytically. In addition, a feature selection method identified that with the use of only two diagnostic factors (age and weight), similar performance could be achieved. The scope of the proposed exhaustive methodology is to assist therapists in osteoporosis prediction, avoiding unnecessary further testing with bone densitometry.

Download Full-text

Heart Disease Prediction Based on an Optimal Feature Selection Method using Autoencoder

International Journal of Scientific Research in Science and Technology ◽

10.32628/ijsrst20748 ◽

2020 ◽

pp. 25-38

Author(s):

Azhar M. A. ◽

Princy Ann Thomas

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Feature Selection Method ◽

Classification Problem ◽

Machine Learning Techniques ◽

Classification Problems ◽

Process Data ◽

Integration Algorithm ◽

Learning Techniques ◽

Hybrid Classification

Heart Failure is one of the common diseases that can lead to dangerous situations. There are several data available within the healthcare systems. However, there was an absence of successful analysis methods to find connections and patterns in health care data. Some Machine learning methods can help us remedy this circumstance. This helps in getting a better insight into the concept of a classification problem. In many classification problems, it is difficult to learn good classifiers before removing these unwanted features due to the huge size of the data. In my work, we have used an artificial neural network-based autoencoder for effective feature selection The aim of feature selection is improving prediction performance and providing a better understanding of the process data. Hybrid Classification method with a dynamic integration algorithm for classification that aims at finding optimal features by applying machine learning techniques resulting in improving the performance in the prediction of cardiovascular disease.

Download Full-text

Decoding the Neural Signatures of Valence and Arousal From Portable EEG Headset

10.1101/2021.07.23.453533 ◽

2021 ◽

Author(s):

Nikhil Garg ◽

Rohit Garg ◽

Parrivesh NS ◽

Apoorv Anand ◽

V.A.S. Abhinav ◽

...

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Feature Selection ◽

Emotion Recognition ◽

Machine Learning Techniques ◽

Emotion Classification ◽

Learning Techniques ◽

Eeg Data ◽

Valence And Arousal ◽

Mean Square Errors

This paper focuses on classifying emotions on the valence-arousal plane using various feature extraction, feature selection and machine learning techniques. Emotion classification using EEG data and machine learning techniques has been on the rise in the recent past. We evaluate different feature extraction techniques, feature selection techniques and propose the optimal set of features and electrodes for emotion recognition. The images from the OASIS image dataset were used for eliciting the Valence and Arousal emotions, and the EEG data was recorded using the Emotiv Epoc X mobile EEG headset. The analysis is additionally carried out on publicly available datasets: DEAP and DREAMER. We propose a novel feature ranking technique and incremental learning approach to analyze the dependence of performance on the number of participants. Leave one out cross-validation was carried out to identify subject bias in emotion elicitation patterns. The importance of different electrode locations was calculated, which could be used for designing a headset for emotion recognition. Our study achieved root mean square errors of less than 0.75 on DREAMER, 1.76 on DEAP, and 2.39 on our dataset.

Download Full-text

An Improved Machine Learning-Based Employees Attrition Prediction Framework with Emphasis on Feature Selection

Mathematics ◽

10.3390/math9111226 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1226

Author(s):

Saeed Najafi-Zangeneh ◽

Naser Shams-Gharneh ◽

Ali Arjomandi-Nezhad ◽

Sarfaraz Hashemkhani Zolfani

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Standard Deviation ◽

Analytical Formula ◽

Feature Selection Method ◽

Selection Method ◽

Performance Measure ◽

Learning Approaches ◽

Training Costs ◽

Professional Employees

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.

Download Full-text

Evaluation of machine learning techniques to classify code comprehension based on developers' EEG data

Proceedings of the 19th Brazilian Symposium on Human Factors in Computing Systems ◽

10.1145/3424953.3426481 ◽

2020 ◽

Author(s):

Lucian José Gonçales ◽

Kleinner Farias ◽

Lucas Silveira Kupssinskü ◽

Matheus Segalotto

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Eeg Data

Download Full-text

NIMG-46. RADIOGENOMIC FEATURES PREDICT CLINICALLY RELEVANT GENOME-WIDE ALTERATION SIGNATURES IN GLIOBLASTOMA

Neuro-Oncology ◽

10.1093/neuonc/noaa215.659 ◽

2020 ◽

Vol 22 (Supplement_2) ◽

pp. ii158-ii158

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

James Fink ◽

David Haynor ◽

Eric Holland ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Selection Method ◽

Versus Group ◽

Mri Features ◽

Genome Wide ◽

Group 2 ◽

Group 1

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.

Download Full-text

Radiogenomic modeling predicts survival-associated prognostic groups in glioblastoma

Neuro-Oncology Advances ◽

10.1093/noajnl/vdab004 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

Abdullah Feroze ◽

Eric C Holland ◽

Linda Shapiro ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Area Under The Curve ◽

Selection Method ◽

Recursive Feature Elimination ◽

Signal Abnormality ◽

Mri Features ◽

Mri Scans

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.

Download Full-text

FSDroid:- A feature selection technique to detect malware from Android using Machine Learning Techniques

Multimedia Tools and Applications ◽

10.1007/s11042-020-10367-w ◽

2021 ◽

Author(s):

Arvind Mahindru ◽

A.L. Sangal

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Machine Learning Techniques ◽

Feature Selection Technique ◽

Selection Technique ◽

Learning Techniques

Download Full-text