Measuring the bias of incorrect application of feature selection when using cross-validation in radiomics

Aydin Demircioğlu

doi:10.1186/s13244-021-01115-1

Measuring the bias of incorrect application of feature selection when using cross-validation in radiomics

Insights into Imaging ◽

10.1186/s13244-021-01115-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Aydin Demircioğlu

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Selection Methods ◽

Validation Data ◽

The Cross

Abstract Background Many studies in radiomics are using feature selection methods to identify the most predictive features. At the same time, they employ cross-validation to estimate the performance of the developed models. However, if the feature selection is performed before the cross-validation, data leakage can occur, and the results can be biased. To measure the extent of this bias, we collected ten publicly available radiomics datasets and conducted two experiments. First, the models were developed by incorrectly applying the feature selection prior to cross-validation. Then, the same experiment was conducted by applying feature selection correctly within cross-validation to each fold. The resulting models were then evaluated against each other in terms of AUC-ROC, AUC-F1, and Accuracy. Results Applying the feature selection incorrectly prior to the cross-validation showed a bias of up to 0.15 in AUC-ROC, 0.29 in AUC-F1, and 0.17 in Accuracy. Conclusions Incorrect application of feature selection and cross-validation can lead to highly biased results for radiomic datasets.

Download Full-text

Phishing Website Detection Using Machine Learning Classifiers Optimized by Feature Selection

Traitement du signal ◽

10.18280/ts.370403 ◽

2020 ◽

Vol 37 (4) ◽

pp. 563-569

Author(s):

Dželila Mehanović ◽

Jasmin Kevrić

Keyword(s):

Feature Selection ◽

Random Forest ◽

Cross Validation ◽

Nearest Neighbor ◽

Security Threats ◽

Selection Methods ◽

K Nearest Neighbor ◽

Machine Learning Classifiers ◽

Time To Build ◽

Fold Cross Validation

Security is one of the most actual topics in the online world. Lists of security threats are constantly updated. One of those threats are phishing websites. In this work, we address the problem of phishing websites classification. Three classifiers were used: K-Nearest Neighbor, Decision Tree and Random Forest with the feature selection methods from Weka. Achieved accuracy was 100% and number of features was decreased to seven. Moreover, when we decreased the number of features, we decreased time to build models too. Time for Random Forest was decreased from the initial 2.88s and 3.05s for percentage split and 10-fold cross validation to 0.02s and 0.16s respectively.

Download Full-text

20th century global glacier mass change: an ensemble-based model reconstruction

10.5194/tc-2020-320 ◽

2020 ◽

Author(s):

Jan-Hendrik Malles ◽

Ben Marzeion

Keyword(s):

Mass Balance ◽

Sea Level Rise ◽

Sea Level ◽

20Th Century ◽

Cross Validation ◽

Data Sets ◽

Mass Balances ◽

Validation Data ◽

Near Surface ◽

The Cross

Abstract. Negative glacier mass balances in most of Earth's glacierized regions contribute roughly one quarter to currently observed rates of sea-level rise, and have likely contributed an even larger fraction during the 20th century. The distant past and future of glaciers' mass balances, and hence their contribution to sea-level rise, can only be calculated using numerical models. Since independent of complexity, models always rely on some form of parameterizations and a choice of boundary conditions, a need for optimization arises. In this work, a model for computing monthly mass balances of glaciers on the global scale was forced with nine different data sets of near-surface air temperature and precipitation anomalies, as well as with their mean and median, leading to a total of eleven different forcing data sets. Five global parameters of the model’s mass balance equations were varied systematically, within physically plausible ranges, for each forcing data set. We then identified optimal parameter combinations by cross-validating the model results against in-situ mass balance observations, using three criteria: model bias, temporal correlation, and the ratio between the observed and modeled temporal standard deviation of specific mass balances. The goal is to better constrain the glaciers' 20th century sea-level budget contribution and its uncertainty. We find that the disagreement between the different ensemble members is often larger than the uncertainties obtained via cross-validation, particularly in times and places where few or no validation data are available, such as the first half of the 20th century. We show that the reason for this is that the availability of mass balance observations often coincides with less uncertainty in the forcing data, such that the cross-validation procedure does not capture the true out-of-sample uncertainty of the glacier model. Therefore, ensemble spread is introduced as an additional estimate of reconstruction uncertainty, increasing the total uncertainty compared to the model uncertainty obtained in the cross validation. Our ensemble mean estimate indicates a sea-level contribution by global glaciers (excluding Antarctic periphery) for 1901–2018 of 76.2 ± 5.9 mm sea-level equivalent (SLE), or 0.65 ± 0.05 mm SLE yr−1.

Download Full-text

The cross-validation method for the evaluation of the adequacy, complexity and generality of timing theories

PsycEXTRA Dataset ◽

10.1037/e604022013-035 ◽

2003 ◽

Author(s):

Paulo Guilhardi

Keyword(s):

Cross Validation ◽

Validation Method ◽

The Cross

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

A Note on Estimating The Number of Super Imposed Exponential Signals by the Cross-Validation Approach

10.21236/ada369860 ◽

1999 ◽

Author(s):

Y. Wu ◽

K. W. Tam ◽

F. Li ◽

M. M. Zen

Keyword(s):

Cross Validation ◽

The Cross

Download Full-text

The Effectiveness of the Fused Weighted Filter Feature Selection Method to Improve Software Fault Prediction

Journal of Communications Technology Electronics and Computer Science ◽

10.22385/jctecs.v8i0.96 ◽

2016 ◽

Vol 8 ◽

pp. 5 ◽

Cited By ~ 1

Author(s):

Fatemeh Alighardashi ◽

Mohammad Ali Zare Chahooki

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Machine Learning Algorithms ◽

Fault Prediction ◽

Filter Method ◽

Selection Methods ◽

Software Projects ◽

Software Fault Prediction ◽

Software Fault

Improving the software product quality before releasing by periodic tests is one of the most expensive activities in software projects. Due to limited resources to modules test in software projects, it is important to identify fault-prone modules and use the test sources for fault prediction in these modules. Software fault predictors based on machine learning algorithms, are effective tools for identifying fault-prone modules. Extensive studies are being done in this field to find the connection between features of software modules, and their fault-prone. Some of features in predictive algorithms are ineffective and reduce the accuracy of prediction process. So, feature selection methods to increase performance of prediction models in fault-prone modules are widely used. In this study, we proposed a feature selection method for effective selection of features, by using combination of filter feature selection methods. In the proposed filter method, the combination of several filter feature selection methods presented as fused weighed filter method. Then, the proposed method caused convergence rate of feature selection as well as the accuracy improvement. The obtained results on NASA and PROMISE with ten datasets, indicates the effectiveness of proposed method in improvement of accuracy and convergence of software fault prediction.

Download Full-text

Ant Colony-Based Hyperparameter Optimisation in Total Variation Reconstruction in X-ray Computed Tomography

Sensors ◽

10.3390/s21020591 ◽

2021 ◽

Vol 21 (2) ◽

pp. 591

Author(s):

Manasavee Lohvithee ◽

Wenjuan Sun ◽

Stephane Chretien ◽

Manuchehr Soleimani

Keyword(s):

Computed Tomography ◽

Total Variation ◽

Cross Validation ◽

Ant Colony ◽

Reference Image ◽

X Ray ◽

X Ray Computed ◽

Hyperparameter Selection ◽

The Cross ◽

Optimal Set

In this paper, a computer-aided training method for hyperparameter selection of limited data X-ray computed tomography (XCT) reconstruction was proposed. The proposed method employed the ant colony optimisation (ACO) approach to assist in hyperparameter selection for the adaptive-weighted projection-controlled steepest descent (AwPCSD) algorithm, which is a total-variation (TV) based regularisation algorithm. During the implementation, there was a colony of artificial ants that swarm through the AwPCSD algorithm. Each ant chose a set of hyperparameters required for its iterative CT reconstruction and the correlation coefficient (CC) score was given for reconstructed images compared to the reference image. A colony of ants in one generation left a pheromone through its chosen path representing a choice of hyperparameters. Higher score means stronger pheromones/probabilities to attract more ants in the next generations. At the end of the implementation, the hyperparameter configuration with the highest score was chosen as an optimal set of hyperparameters. In the experimental results section, the reconstruction using hyperparameters from the proposed method was compared with results from three other cases: the conjugate gradient least square (CGLS), the AwPCSD algorithm using the set of arbitrary hyperparameters and the cross-validation method.The experiments showed that the results from the proposed method were superior to those of the CGLS algorithm and the AwPCSD algorithm using the set of arbitrary hyperparameters. Although the results of the ACO algorithm were slightly inferior to those of the cross-validation method as measured by the quantitative metrics, the ACO algorithm was over 10 times faster than cross—Validation. The optimal set of hyperparameters from the proposed method was also robust against an increase of noise in the data and can be applicable to different imaging samples with similar context. The ACO approach in the proposed method was able to identify optimal values of hyperparameters for a dataset and, as a result, produced a good quality reconstructed image from limited number of projection data. The proposed method in this work successfully solves a problem of hyperparameters selection, which is a major challenge in an implementation of TV based reconstruction algorithms.

Download Full-text

Feature Selection Methods in Sentiment Analysis

Proceedings of the 3rd International Conference on Networking, Information Systems & Security ◽

10.1145/3386723.3387840 ◽

2020 ◽

Author(s):

Nurilhami Izzatie Khairi ◽

Azlinah Mohamed ◽

Nor Nadiah Yusof

Keyword(s):

Feature Selection ◽

Sentiment Analysis ◽

Selection Methods

Download Full-text

A Unified View of Causal and Non-causal Feature Selection

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3436891 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-46

Author(s):

Kui Yu ◽

Lin Liu ◽

Jiuyong Li

Keyword(s):

Feature Selection ◽

Bayesian Network ◽

Synthetic Data ◽

Selection Methods ◽

Bayesian Network Model ◽

Real World Data ◽

Feature Sets ◽

Unified View ◽

Optimal Feature ◽

Different Levels

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.

Download Full-text

A fuzzy gaussian rank aggregation ensemble feature selection method for microarray data

International Journal of Knowledge-based and Intelligent Engineering Systems ◽

10.3233/kes-190134 ◽

2021 ◽

Vol 24 (4) ◽

pp. 289-301

Author(s):

B. Venkatesh ◽

J. Anuradha

Keyword(s):

Feature Selection ◽

Microarray Data ◽

Classification Accuracy ◽

Performance Metrics ◽

Feature Selection Method ◽

Selection Method ◽

Support Vector ◽

Svm Classifier ◽

Binary Particle Swarm Optimization ◽

Selection Methods

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.

Download Full-text