Consensus features nested cross-validation

Saeid Parvandeh; Hung-Wen Yeh; Martin P Paulus; Brett A McKinney

doi:10.1093/bioinformatics/btaa046

Consensus features nested cross-validation

Bioinformatics ◽

10.1093/bioinformatics/btaa046 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3093-3098 ◽

Cited By ~ 2

Author(s):

Saeid Parvandeh ◽

Hung-Wen Yeh ◽

Martin P Paulus ◽

Brett A McKinney

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Cross Validation ◽

Differential Privacy ◽

Simulated Data ◽

Classification Model ◽

Supplementary Information ◽

Similar Accuracy ◽

Main Effects ◽

Feature Stability

Abstract Summary Feature selection can improve the accuracy of machine-learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy-preserving noise mechanism to identify features that are stable between training and holdout sets. We develop consensus nested cross-validation (cnCV) that combines the idea of feature stability from differential privacy with nCV. Feature selection is applied in each inner fold and the consensus of top features across folds is used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by cross-validation, differential privacy and private evaporative cooling (pEC). We also compare these methods using real RNA-seq data from a study of major depressive disorder. The cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification. Availability and implementation Code available at https://github.com/insilico/cncv. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Consensus Features Nested Cross-Validation

10.1101/2019.12.31.891895 ◽

2020 ◽

Author(s):

Saeid Parvandeh ◽

Hung-Wen Yeh ◽

Martin P. Paulus ◽

Brett A. McKinney

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Cross Validation ◽

Differential Privacy ◽

Simulated Data ◽

Classification Model ◽

Supplementary Information ◽

Similar Accuracy ◽

Main Effects ◽

Feature Stability

AbstractMotivationFeature selection can improve the accuracy of machine learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy preserving noise mechanism to identify features that are stable between training and holdout sets.MethodsWe develop consensus nested CV (cnCV) that combines the idea of feature stability from differential privacy with nested CV. Feature selection is applied in each inner fold and the consensus of top features across folds is a used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation, and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by CV, differential privacy, and private Evaporative Cooling (pEC). We also compare these methods using real RNA-Seq data from a study of major depressive disorder.ResultsThe cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification.AvailabilityCode available at https://github.com/insilico/[email protected] information:

Download Full-text

Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data

Revista Colombiana de Estadística ◽

10.15446/rce.v43n1.80000 ◽

2020 ◽

Vol 43 (1) ◽

pp. 103-125

Author(s):

Yi Zhong ◽

Jianghua He ◽

Prabhakar Chalise

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Predictive Accuracy ◽

Simulated Data ◽

Classification Model ◽

High Dimensional ◽

Clinical Settings ◽

Feature Subset ◽

Validation Method ◽

High Dimensional Datasets

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.

Download Full-text

A New Hybrid Feature Subset Selection Framework Based on Binary Genetic Algorithm and Information Theory

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026819500202 ◽

2019 ◽

Vol 18 (03) ◽

pp. 1950020 ◽

Cited By ~ 13

Author(s):

Alok Kumar Shukla ◽

Pradeep Singh ◽

Manu Vardhan

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Classification Accuracy ◽

B Cell Lymphoma ◽

Feature Subset Selection ◽

Classification Model ◽

Significant Feature ◽

Support Vector ◽

Feature Subset ◽

Binary Genetic Algorithm

The explosion of the high-dimensional dataset in the scientific repository has been encouraging interdisciplinary research on data mining, pattern recognition and bioinformatics. The fundamental problem of the individual Feature Selection (FS) method is extracting informative features for classification model and to seek for the malignant disease at low computational cost. In addition, existing FS approaches overlook the fact that for a given cardinality, there can be several subsets with similar information. This paper introduces a novel hybrid FS algorithm, called Filter-Wrapper Feature Selection (FWFS) for a classification problem and also addresses the limitations of existing methods. In the proposed model, the front-end filter ranking method as Conditional Mutual Information Maximization (CMIM) selects the high ranked feature subset while the succeeding method as Binary Genetic Algorithm (BGA) accelerates the search in identifying the significant feature subsets. One of the merits of the proposed method is that, unlike an exhaustive method, it speeds up the FS procedure without lancing of classification accuracy on reduced dataset when a learning model is applied to the selected subsets of features. The efficacy of the proposed (FWFS) method is examined by Naive Bayes (NB) classifier which works as a fitness function. The effectiveness of the selected feature subset is evaluated using numerous classifiers on five biological datasets and five UCI datasets of a varied dimensionality and number of instances. The experimental results emphasize that the proposed method provides additional support to the significant reduction of the features and outperforms the existing methods. For microarray data-sets, we found the lowest classification accuracy is 61.24% on SRBCT dataset and highest accuracy is 99.32% on Diffuse large B-cell lymphoma (DLBCL). In UCI datasets, the lowest classification accuracy is 40.04% on the Lymphography using k-nearest neighbor (k-NN) and highest classification accuracy is 99.05% on the ionosphere using support vector machine (SVM).

Download Full-text

The CSP-Based New Features Plus Non-Convex Log Sparse Feature Selection for Motor Imagery EEG Classification

Sensors ◽

10.3390/s20174749 ◽

2020 ◽

Vol 20 (17) ◽

pp. 4749

Author(s):

Shaorong Zhang ◽

Zhibin Zhu ◽

Benxin Zhang ◽

Bao Feng ◽

Tianyou Yu ◽

...

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Motor Imagery ◽

Classification Accuracy ◽

Feature Selection Method ◽

Extraction Methods ◽

Extraction Time ◽

Classification Model ◽

Discrete Wavelet ◽

New Feature

The common spatial pattern (CSP) is a very effective feature extraction method in motor imagery based brain computer interface (BCI), but its performance depends on the selection of the optimal frequency band. Although a lot of research works have been proposed to improve CSP, most of these works have the problems of large computation costs and long feature extraction time. To this end, three new feature extraction methods based on CSP and a new feature selection method based on non-convex log regularization are proposed in this paper. Firstly, EEG signals are spatially filtered by CSP, and then three new feature extraction methods are proposed. We called them CSP-wavelet, CSP-WPD and CSP-FB, respectively. For CSP-Wavelet and CSP-WPD, the discrete wavelet transform (DWT) or wavelet packet decomposition (WPD) is used to decompose the spatially filtered signals, and then the energy and standard deviation of the wavelet coefficients are extracted as features. For CSP-FB, the spatially filtered signals are filtered into multiple bands by a filter bank (FB), and then the logarithm of variances of each band are extracted as features. Secondly, a sparse optimization method regularized with a non-convex log function is proposed for the feature selection, which we called LOG, and an optimization algorithm for LOG is given. Finally, ensemble learning is used for secondary feature selection and classification model construction. Combing feature extraction and feature selection methods, a total of three new EEG decoding methods are obtained, namely CSP-Wavelet+LOG, CSP-WPD+LOG, and CSP-FB+LOG. Four public motor imagery datasets are used to verify the performance of the proposed methods. Compared to existing methods, the proposed methods achieved the highest average classification accuracy of 88.86, 83.40, 81.53, and 80.83 in datasets 1–4, respectively. The feature extraction time of CSP-FB is the shortest. The experimental results show that the proposed methods can effectively improve the classification accuracy and reduce the feature extraction time. With comprehensive consideration of classification accuracy and feature extraction time, CSP-FB+LOG has the best performance and can be used for the real-time BCI system.

Download Full-text

SU-CCE: A Novel Feature Selection Approach for Reducing High Dimensionality

10.3233/apc210196 ◽

2021 ◽

Author(s):

A B Pawar ◽

M A Jawale ◽

Ravi Kumar Tirandasu ◽

Saiprasad Potharaju

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Feature Space ◽

Microarray Dataset ◽

Classification Model ◽

High Dimensionality ◽

High Dimensional ◽

Selection Approach ◽

Feature Selection Approach ◽

Careful Investigation

High dimensionality is the serious issue in the preprocessing of data mining. Having large number of features in the dataset leads to several complications for classifying an unknown instance. In a initial dataspace there may be redundant and irrelevant features present, which leads to high memory consumption, and confuse the learning model created with those properties of features. Always it is advisable to select the best features and generate the classification model for better accuracy. In this research, we proposed a novel feature selection approach and Symmetrical uncertainty and Correlation Coefficient (SU-CCE) for reducing the high dimensional feature space and increasing the classification accuracy. The experiment is performed on colon cancer microarray dataset which has 2000 features. The proposed method derived 38 best features from it. To measure the strength of proposed method, top 38 features extracted by 4 traditional filter-based methods are compared with various classifiers. After careful investigation of result, the proposed approach is competing with most of the traditional methods.

Download Full-text

binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions

10.1101/681973 ◽

2019 ◽

Author(s):

Samir Rachid Zaim ◽

Colleen Kenost ◽

Joanne Berghout ◽

Wesley Chiu ◽

Liam Wilson ◽

...

Keyword(s):

Feature Selection ◽

Binomial Distribution ◽

Molecular Mechanisms ◽

Hypothesis Test ◽

Supplementary Information ◽

Biomarker Detection ◽

Test Statistic ◽

Link Type ◽

Correlated Binomial ◽

Main Effects

AbstractBackgroundIn this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcript) than samples (i.e., mice or human samples) in a study, this poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest1 (RF) classifiers are widely used2–7 due to their flexibility, powerful performance, and robustness to “P predictors ≫ subjects N” difficulties and their ability to rank features. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.MethodsbinomialRF treats each tree in a RF as a correlated but exchangeable binary trial. It determines importance by constructing a test statistic based on a feature’s selection frequency to compute its rank, nominal p-value, and multiplicity-adjusted q-value using a one-sided hypothesis test with a correlated binomial distribution. A distributional adjustment addresses the co-dependencies among trees as these trees subsample from the same dataset. The proposed algorithm efficiently identifies multiway nonlinear interactions by generalizing the test statistic to count sub-trees.ResultsIn simulations and in the Madelon benchmark datasets studies, binomialRF showed computational gains (up to 30 to 600 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone.ConclusionbinomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide path-way-level feature selection from gene expression input data.AvailabilityGithub: https://github.com/SamirRachidZaim/binomialRFSupplementary informationSupplementary analyses and results are available at https://github.com/SamirRachidZaim/binomialRF_simulationStudy

Download Full-text

Ensemble Feature Selection for Breast Cancer Classification using Microarray Data

INTELIGENCIA ARTIFICIAL ◽

10.4114/intartif.vol23iss65pp100-114 ◽

2020 ◽

Vol 23 (65) ◽

pp. 100-114

Author(s):

Supoj Hengpraprohm ◽

Suwimol Jungjit

Keyword(s):

Breast Cancer ◽

Feature Selection ◽

Cross Validation ◽

Microarray Dataset ◽

Classification Model ◽

Cancer Data ◽

Evaluation Functions ◽

Selection For ◽

Feature Selection Approach ◽

Fold Cross Validation

For breast cancer data classification, we propose an ensemble filter feature selection approach named ‘EnSNR’. Entropy and SNR evaluation functions are used to find the features (genes) for the EnSNR subset. A Genetic Algorithm (GA) generates the classification ‘model’. The efficiency of the ‘model’ is validated using 10-Fold Cross-Validation re-sampling. The Microarray dataset used in our experiments contains 50,739 genes for each of 32 patients. When our proposed ‘EnSNR’ subset of features is used; as well as giving an enhanced degree of prediction accuracy and reducing the number of irrelevant features (genes), there is also a small saving of computer processing time.

Download Full-text

Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data

Communications in Statistics - Simulation and Computation ◽

10.1080/03610918.2020.1850790 ◽

2020 ◽

pp. 1-18

Author(s):

Yi Zhong ◽

Prabhakar Chalise ◽

Jianghua He

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Biological Data ◽

Classification Model ◽

High Dimensional

Download Full-text

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220124756 ◽

2019 ◽

Vol 21 (9) ◽

pp. 631-645 ◽

Cited By ~ 5

Author(s):

Saeed Ahmed ◽

Muhammad Kabir ◽

Zakir Ali ◽

Muhammad Arif ◽

Farman Ali ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Classification Accuracy ◽

Early Stage ◽

Small Sample Size ◽

Feature Selection Method ◽

Small Sample ◽

Expression Data ◽

Base Function

Aim and Objective: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. Materials and Methods: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. Results: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. Conclusion: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.

Download Full-text

A fuzzy gaussian rank aggregation ensemble feature selection method for microarray data

International Journal of Knowledge-based and Intelligent Engineering Systems ◽

10.3233/kes-190134 ◽

2021 ◽

Vol 24 (4) ◽

pp. 289-301

Author(s):

B. Venkatesh ◽

J. Anuradha

Keyword(s):

Feature Selection ◽

Microarray Data ◽

Classification Accuracy ◽

Performance Metrics ◽

Feature Selection Method ◽

Selection Method ◽

Support Vector ◽

Svm Classifier ◽

Binary Particle Swarm Optimization ◽

Selection Methods

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.

Download Full-text