Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data

Author(s):  
Yi Zhong ◽  
Prabhakar Chalise ◽  
Jianghua He
2020 ◽  
Vol 43 (1) ◽  
pp. 103-125
Author(s):  
Yi Zhong ◽  
Jianghua He ◽  
Prabhakar Chalise

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.


2021 ◽  
Author(s):  
A B Pawar ◽  
M A Jawale ◽  
Ravi Kumar Tirandasu ◽  
Saiprasad Potharaju

High dimensionality is the serious issue in the preprocessing of data mining. Having large number of features in the dataset leads to several complications for classifying an unknown instance. In a initial dataspace there may be redundant and irrelevant features present, which leads to high memory consumption, and confuse the learning model created with those properties of features. Always it is advisable to select the best features and generate the classification model for better accuracy. In this research, we proposed a novel feature selection approach and Symmetrical uncertainty and Correlation Coefficient (SU-CCE) for reducing the high dimensional feature space and increasing the classification accuracy. The experiment is performed on colon cancer microarray dataset which has 2000 features. The proposed method derived 38 best features from it. To measure the strength of proposed method, top 38 features extracted by 4 traditional filter-based methods are compared with various classifiers. After careful investigation of result, the proposed approach is competing with most of the traditional methods.


2020 ◽  
Vol 23 (65) ◽  
pp. 100-114
Author(s):  
Supoj Hengpraprohm ◽  
Suwimol Jungjit

For breast cancer data classification, we propose an ensemble filter feature selection approach named ‘EnSNR’. Entropy and SNR evaluation functions are used to find the features (genes) for the EnSNR subset. A Genetic Algorithm (GA) generates the classification ‘model’. The efficiency of the ‘model’ is validated using 10-Fold Cross-Validation re-sampling. The Microarray dataset used in our experiments contains 50,739 genes for each of 32 patients. When our proposed ‘EnSNR’ subset of features is used; as well as giving an enhanced degree of prediction accuracy and reducing the number of irrelevant features (genes), there is also a small saving of computer processing time.


2020 ◽  
Vol 25 (6) ◽  
pp. 729-735
Author(s):  
Venkata Rao Maddumala ◽  
Arunkumar R

This paper intends to present main technique for feature extraction on multimeda getting well versed and a challenging task to handle big data. Analyzing and feature extracting valuable data from high dimensional dataset challenges the bounds of measurable methods and strategies. Conventional techniques in general have less performance while managing high dimensional datasets. Lower test size has consistently been an issue in measurable tests, which get bothered in high dimensional information due to more equivalent or higher component size than the quantity of tests. The intensity of any measurable test is legitimately relative to its capacity to lesser an invalid theory, and test size is a significant central factor in producing probabilities of errors for making substantial ends. Thus one of the effective methods for taking care of high dimensional datasets is by lessening its measurement through feature selection and extraction with the goal that substantial accurate data can be practically performed. Clustering is the act of finding hidden or comparable data in information. It is one of the most widely recognized techniques for realizing useful features where a weight is given to each feature without predefining the various classes. In any feature selection and extraction procedures, the three main considerations of concern are measurable exactness, model interpretability and computational multifaceted nature. For any classification model, it is important to ensure that the productivity of any of these three components isn't undermined. In this manuscript, a Weight Based Feature Extraction Model on Multifaceted Multimedia Big Data (WbFEM-MMB) is proposed which extracts useful features from videos. The feature extraction strategies utilize features from the discrete cosine methods and the features are extracted using a pre-prepared Convolutional Neural Network (CNN). The proposed method is compared with traditional methods and the results show that the proposed method exhibits better performance and accuracy in extracting features from multifaceted multimedia data.


2012 ◽  
Vol 28 (21) ◽  
pp. 2834-2842 ◽  
Author(s):  
Haitian Wang ◽  
Shaw-Hwa Lo ◽  
Tian Zheng ◽  
Inchi Hu

2018 ◽  
Vol 30 (7) ◽  
pp. 1352-1365 ◽  
Author(s):  
Makoto Yamada ◽  
Jiliang Tang ◽  
Jose Lugo-Martinez ◽  
Ermin Hodzic ◽  
Raunak Shrestha ◽  
...  

2021 ◽  
Author(s):  
Peng-fei Ke ◽  
Dong-sheng Xiong ◽  
Jia-hui Li ◽  
Shi-jia Li ◽  
Jie Song ◽  
...  

Abstract Finding effective and objective biomarkers to inform the diagnosis of schizophrenia is of great importance yet remains challenging. However, there is relatively little work on multi-biological data for the diagnosis of schizophrenia. This was a cross-sectional study in which we extracted multiple features from three types of biological data including gut microbiota data, blood data, and electroencephalogram data. Then, an integrated framework of machine learning, consisting of five classifiers, three feature selection algorithms, and four cross-validation methods was used to discriminate patients with schizophrenia from healthy controls. Our results showed that the performance of the classifier using multi-biological data was better than that of the classifiers using single biological data, with 91.7% accuracy and 96.5% AUC. The most discriminative features (top 5%) for the classification include gut microbiota features (Lactobacillus, Haemophilus, and Prevotella), blood features (superoxide dismutase, monocyte-lymphocyte ratio, and neutrophil), and electroencephalogram features (nodal local efficiency, nodal efficiency, and nodal shortest path length in the temporal and frontal-parietal areas).The proposed integrated framework may be help in understanding the pathophysiology of schizophrenia and developing biomarkers for schizophrenia using multi-biological data.


2020 ◽  
Vol 36 (10) ◽  
pp. 3093-3098 ◽  
Author(s):  
Saeid Parvandeh ◽  
Hung-Wen Yeh ◽  
Martin P Paulus ◽  
Brett A McKinney

Abstract Summary Feature selection can improve the accuracy of machine-learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy-preserving noise mechanism to identify features that are stable between training and holdout sets. We develop consensus nested cross-validation (cnCV) that combines the idea of feature stability from differential privacy with nCV. Feature selection is applied in each inner fold and the consensus of top features across folds is used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by cross-validation, differential privacy and private evaporative cooling (pEC). We also compare these methods using real RNA-seq data from a study of major depressive disorder. The cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification. Availability and implementation Code available at https://github.com/insilico/cncv. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Saeid Parvandeh ◽  
Hung-Wen Yeh ◽  
Martin P. Paulus ◽  
Brett A. McKinney

AbstractMotivationFeature selection can improve the accuracy of machine learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy preserving noise mechanism to identify features that are stable between training and holdout sets.MethodsWe develop consensus nested CV (cnCV) that combines the idea of feature stability from differential privacy with nested CV. Feature selection is applied in each inner fold and the consensus of top features across folds is a used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation, and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by CV, differential privacy, and private Evaporative Cooling (pEC). We also compare these methods using real RNA-Seq data from a study of major depressive disorder.ResultsThe cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification.AvailabilityCode available at https://github.com/insilico/[email protected] information:


2013 ◽  
Author(s):  
Natapol Pornputtapong ◽  
Amporn Atsawarungruangkit ◽  
Kawee Numpacharoen

Sign in / Sign up

Export Citation Format

Share Document