scholarly journals Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data

2020 ◽  
Vol 43 (1) ◽  
pp. 103-125
Author(s):  
Yi Zhong ◽  
Jianghua He ◽  
Prabhakar Chalise

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.

2020 ◽  
Vol 25 (6) ◽  
pp. 729-735
Author(s):  
Venkata Rao Maddumala ◽  
Arunkumar R

This paper intends to present main technique for feature extraction on multimeda getting well versed and a challenging task to handle big data. Analyzing and feature extracting valuable data from high dimensional dataset challenges the bounds of measurable methods and strategies. Conventional techniques in general have less performance while managing high dimensional datasets. Lower test size has consistently been an issue in measurable tests, which get bothered in high dimensional information due to more equivalent or higher component size than the quantity of tests. The intensity of any measurable test is legitimately relative to its capacity to lesser an invalid theory, and test size is a significant central factor in producing probabilities of errors for making substantial ends. Thus one of the effective methods for taking care of high dimensional datasets is by lessening its measurement through feature selection and extraction with the goal that substantial accurate data can be practically performed. Clustering is the act of finding hidden or comparable data in information. It is one of the most widely recognized techniques for realizing useful features where a weight is given to each feature without predefining the various classes. In any feature selection and extraction procedures, the three main considerations of concern are measurable exactness, model interpretability and computational multifaceted nature. For any classification model, it is important to ensure that the productivity of any of these three components isn't undermined. In this manuscript, a Weight Based Feature Extraction Model on Multifaceted Multimedia Big Data (WbFEM-MMB) is proposed which extracts useful features from videos. The feature extraction strategies utilize features from the discrete cosine methods and the features are extracted using a pre-prepared Convolutional Neural Network (CNN). The proposed method is compared with traditional methods and the results show that the proposed method exhibits better performance and accuracy in extracting features from multifaceted multimedia data.


2020 ◽  
Vol 36 (10) ◽  
pp. 3093-3098 ◽  
Author(s):  
Saeid Parvandeh ◽  
Hung-Wen Yeh ◽  
Martin P Paulus ◽  
Brett A McKinney

Abstract Summary Feature selection can improve the accuracy of machine-learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy-preserving noise mechanism to identify features that are stable between training and holdout sets. We develop consensus nested cross-validation (cnCV) that combines the idea of feature stability from differential privacy with nCV. Feature selection is applied in each inner fold and the consensus of top features across folds is used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by cross-validation, differential privacy and private evaporative cooling (pEC). We also compare these methods using real RNA-seq data from a study of major depressive disorder. The cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification. Availability and implementation Code available at https://github.com/insilico/cncv. Supplementary information Supplementary data are available at Bioinformatics online.


Algorithms ◽  
2021 ◽  
Vol 14 (11) ◽  
pp. 324
Author(s):  
Yuanzi Zhang ◽  
Jing Wang ◽  
Xiaolin Li ◽  
Shiguo Huang ◽  
Xiuli Wang

There are generally many redundant and irrelevant features in high-dimensional datasets, which leads to the decline of classification performance and the extension of execution time. To tackle this problem, feature selection techniques are used to screen out redundant and irrelevant features. The artificial bee colony (ABC) algorithm is a popular meta-heuristic algorithm with high exploration and low exploitation capacities. To balance between both capacities of the ABC algorithm, a novel ABC framework is proposed in this paper. Specifically, the solutions are first updated by the process of employing bees to retain the original exploration ability, so that the algorithm can explore the solution space extensively. Then, the solutions are modified by the updating mechanism of an algorithm with strong exploitation ability in the onlooker bee phase. Finally, we remove the scout bee phase from the framework, which can not only reduce the exploration ability but also speed up the algorithm. In order to verify our idea, the operators of the grey wolf optimization (GWO) algorithm and whale optimization algorithm (WOA) are introduced into the framework to enhance the exploitation capability of onlooker bees, named BABCGWO and BABCWOA, respectively. It has been found that these two algorithms are superior to four state-of-the-art feature selection algorithms using 12 high-dimensional datasets, in terms of the classification error rate, size of feature subset and execution speed.


2020 ◽  
Author(s):  
Saeid Parvandeh ◽  
Hung-Wen Yeh ◽  
Martin P. Paulus ◽  
Brett A. McKinney

AbstractMotivationFeature selection can improve the accuracy of machine learning models, but appropriate steps must be taken to avoid overfitting. Nested cross-validation (nCV) is a common approach that chooses the classification model and features to represent a given outer fold based on features that give the maximum inner-fold accuracy. Differential privacy is a related technique to avoid overfitting that uses a privacy preserving noise mechanism to identify features that are stable between training and holdout sets.MethodsWe develop consensus nested CV (cnCV) that combines the idea of feature stability from differential privacy with nested CV. Feature selection is applied in each inner fold and the consensus of top features across folds is a used as a measure of feature stability or reliability instead of classification accuracy, which is used in standard nCV. We use simulated data with main effects, correlation, and interactions to compare the classification accuracy and feature selection performance of the new cnCV with standard nCV, Elastic Net optimized by CV, differential privacy, and private Evaporative Cooling (pEC). We also compare these methods using real RNA-Seq data from a study of major depressive disorder.ResultsThe cnCV method has similar training and validation accuracy to nCV, but cnCV has much shorter run times because it does not construct classifiers in the inner folds. The cnCV method chooses a more parsimonious set of features with fewer false positives than nCV. The cnCV method has similar accuracy to pEC and cnCV selects stable features between folds without the need to specify a privacy threshold. We show that cnCV is an effective and efficient approach for combining feature selection with classification.AvailabilityCode available at https://github.com/insilico/[email protected] information:


2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Jin-Jia Wang ◽  
Fang Xue ◽  
Hui Li

Feature extraction and classification of EEG signals are core parts of brain computer interfaces (BCIs). Due to the high dimension of the EEG feature vector, an effective feature selection algorithm has become an integral part of research studies. In this paper, we present a new method based on a wrapped Sparse Group Lasso for channel and feature selection of fused EEG signals. The high-dimensional fused features are firstly obtained, which include the power spectrum, time-domain statistics, AR model, and the wavelet coefficient features extracted from the preprocessed EEG signals. The wrapped channel and feature selection method is then applied, which uses the logistical regression model with Sparse Group Lasso penalized function. The model is fitted on the training data, and parameter estimation is obtained by modified blockwise coordinate descent and coordinate gradient descent method. The best parameters and feature subset are selected by using a 10-fold cross-validation. Finally, the test data is classified using the trained model. Compared with existing channel and feature selection methods, results show that the proposed method is more suitable, more stable, and faster for high-dimensional feature fusion. It can simultaneously achieve channel and feature selection with a lower error rate. The test accuracy on the data used from international BCI Competition IV reached 84.72%.


Information ◽  
2019 ◽  
Vol 10 (6) ◽  
pp. 187
Author(s):  
Rattanawadee Panthong ◽  
Anongnart Srivihok

Liver cancer data always consist of a large number of multidimensional datasets. A dataset that has huge features and multiple classes may be irrelevant to the pattern classification in machine learning. Hence, feature selection improves the performance of the classification model to achieve maximum classification accuracy. The aims of the present study were to find the best feature subset and to evaluate the classification performance of the predictive model. This paper proposed a hybrid feature selection approach by combining information gain and sequential forward selection based on the class-dependent technique (IGSFS-CD) for the liver cancer classification model. Two different classifiers (decision tree and naïve Bayes) were used to evaluate feature subsets. The liver cancer datasets were obtained from the Cancer Hospital Thailand database. Three ensemble methods (ensemble classifiers, bagging, and AdaBoost) were applied to improve the performance of classification. The IGSFS-CD method provided good accuracy of 78.36% (sensitivity 0.7841 and specificity 0.9159) on LC_dataset-1. In addition, LC_dataset II delivered the best performance with an accuracy of 84.82% (sensitivity 0.8481 and specificity 0.9437). The IGSFS-CD method achieved better classification performance compared to the class-independent method. Furthermore, the best feature subset selection could help reduce the complexity of the predictive model.


2020 ◽  
Vol 2020 ◽  
pp. 1-14 ◽  
Author(s):  
Yong Liu ◽  
Shenggen Ju ◽  
Junfeng Wang ◽  
Chong Su

Feature selection method is designed to select the representative feature subsets from the original feature set by different evaluation of feature relevance, which focuses on reducing the dimension of the features while maintaining the predictive accuracy of a classifier. In this study, we propose a feature selection method for text classification based on independent feature space search. Firstly, a relative document-term frequency difference (RDTFD) method is proposed to divide the features in all text documents into two independent feature sets according to the features’ ability to discriminate the positive and negative samples, which has two important functions: one is to improve the high class correlation of the features and reduce the correlation between the features and the other is to reduce the search range of feature space and maintain appropriate feature redundancy. Secondly, the feature search strategy is used to search the optimal feature subset in independent feature space, which can improve the performance of text classification. Finally, we evaluate several experiments conduced on six benchmark corpora, the experimental results show the RDTFD method based on independent feature space search is more robust than the other feature selection methods.


Author(s):  
Alok Kumar Shukla ◽  
Pradeep Singh ◽  
Manu Vardhan

The explosion of the high-dimensional dataset in the scientific repository has been encouraging interdisciplinary research on data mining, pattern recognition and bioinformatics. The fundamental problem of the individual Feature Selection (FS) method is extracting informative features for classification model and to seek for the malignant disease at low computational cost. In addition, existing FS approaches overlook the fact that for a given cardinality, there can be several subsets with similar information. This paper introduces a novel hybrid FS algorithm, called Filter-Wrapper Feature Selection (FWFS) for a classification problem and also addresses the limitations of existing methods. In the proposed model, the front-end filter ranking method as Conditional Mutual Information Maximization (CMIM) selects the high ranked feature subset while the succeeding method as Binary Genetic Algorithm (BGA) accelerates the search in identifying the significant feature subsets. One of the merits of the proposed method is that, unlike an exhaustive method, it speeds up the FS procedure without lancing of classification accuracy on reduced dataset when a learning model is applied to the selected subsets of features. The efficacy of the proposed (FWFS) method is examined by Naive Bayes (NB) classifier which works as a fitness function. The effectiveness of the selected feature subset is evaluated using numerous classifiers on five biological datasets and five UCI datasets of a varied dimensionality and number of instances. The experimental results emphasize that the proposed method provides additional support to the significant reduction of the features and outperforms the existing methods. For microarray data-sets, we found the lowest classification accuracy is 61.24% on SRBCT dataset and highest accuracy is 99.32% on Diffuse large B-cell lymphoma (DLBCL). In UCI datasets, the lowest classification accuracy is 40.04% on the Lymphography using k-nearest neighbor (k-NN) and highest classification accuracy is 99.05% on the ionosphere using support vector machine (SVM).


Sign in / Sign up

Export Citation Format

Share Document