Applying Permutation Tests for Assessing the Statistical Significance of Wrapper Based Feature Selection

Author(s):  
Antti Airola ◽  
Tapio Pahikkala ◽  
Jorma Boberg ◽  
Tapio Salakoski
2018 ◽  
Author(s):  
Neo Christopher Chung

AbstractSingle cell RNA sequencing (scRNA-seq) allows us to dissect transcriptional heterogeneity arising from cellular types, spatio-temporal contexts, and environmental stimuli. Cell identities of samples derived from heterogeneous subpopulations are routinely determined by clustering of scRNA-seq data. Computational cell identities are then used in downstream analysis, feature selection, and visualization. However, how can we examine if cell identities are accurately inferred? To this end, we introduce non-parametric methods to evaluate cell identities by testing cluster memberships of single cell samples in an unsupervised manner. We propose posterior inclusion probabilities for cluster memberships to select and visualize samples relevant to subpopulations. Beyond simulation studies, we examined two scRNA-seq data - a mixture of Jurkat and 293T cells and a large family of peripheral blood mononuclear cells. We demonstrated probabilistic feature selection and improved t-SNE visualization. By learning uncertainty in clustering, the proposed methods enable rigorous testing of cell identities in scRNA-seq.


2019 ◽  
Author(s):  
Marshall A. Taylor

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically non-significant at at least the alpha-level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this paper, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate's p-value and its associated confidence interval in relation to a specified alpha-level. These plots can help the analyst interpret and report both the statistical and substantive significance of their models. Illustrations are provided using a nonprobability sample of activists and participants at a 1962 anti-Communism school.


Author(s):  
MARCUS LIWICKI ◽  
HORST BUNKE

In this paper, we describe feature selection experiments for online handwriting recognition. We investigated a set of 25 online and pseudo-offline features to find out which features are important and which features may be redundant. To analyze the saliency of the features, we applied a sequential forward and a sequential backward search on the feature set. A hidden Markov model and a neural network based recognizer have been used as recognition engines. In our experiments, we obtained interesting results. Using a set of only five features, we achieved a performance similar to that of the reference system that uses all 25 features. The five selected features have a low correlation and have been the top choices during the first iterations of the forward search with both recognizers. Furthermore, for both recognizers, subsets have been identified that outperform the reference system with statistical significance. In order to assess the results more rigorously, we have compared our recognizer with the widely used commercial recognizer from Microsoft.


2021 ◽  
pp. 1-10 ◽  
Author(s):  
Achin Jain ◽  
Vanita Jain

This paper presents a Hybrid Feature Selection Technique for Sentiment Classification. We have used a Genetic Algorithm and a combination of existing Feature Selection methods, namely: Information Gain (IG), CHI Square (CHI), and GINI Index (GINI). First, we have obtained features from three different selection approaches as mentioned above and then performed the UNION SET Operation to extract the reduced feature set. Then, Genetic Algorithm is applied to optimize the feature set further. This paper also presents an Ensemble Approach based on the error rate obtained different domain datasets. To test our proposed Hybrid Feature Selection and Ensemble Classification approach, we have considered four Support Vector Machine (SVM) classifier variants. We have used UCI ML Datasets of three domains namely: IMDB Movie Review, Amazon Product Review and Yelp Restaurant Reviews. The experimental results show that our proposed approach performed best in all three domain datasets. Further, we also presented T-Test for Statistical Significance between classifiers and comparison is also done based on Precision, Recall, F1-Score, AUC and model execution time.


2018 ◽  
Vol 159 ◽  
pp. 01053 ◽  
Author(s):  
Bayu Adhi Tama ◽  
Kyung-Hyune Rhee

The most challenging research topic in the field of intrusion detection system (IDS) is anomaly detection. It is able to repeal any peculiar activities in the network by contrasting them with normal patterns. This paper proposes an efficient random forest (RF) model with particle swarm optimization (PSO)-based feature selection for IDS. The performance model is evaluated on a well-known benchmarking dataset, i.e. NSL-KDD in terms of accuracy, precision, recall, and false alarm rate(FAR) metrics. Furthermore, we evaluate the significance differencesbetween the proposed model and other classifiers, i.e. rotation forest (RoF)and deep neural network (DNN) using statistical significance test. Basedon the statistical tests, the proposed model significantly outperforms otherclassifiers involved in the experiment.


2019 ◽  
Vol 2019 ◽  
pp. 1-18
Author(s):  
Hao Guo ◽  
Yao Li ◽  
Godfred Kim Mensah ◽  
Yong Xu ◽  
Junjie Chen ◽  
...  

In recent years, functional brain network topological features have been widely used as classification features. Previous studies have found that network node scale differences caused by different network parcellation definitions significantly affect the structure of the constructed network and its topological properties. However, we still do not know how network scale differences affect the classification accuracy, performance of classification features, and effectiveness of the feature selection strategy using P values in terms of the machine learning method. This study used five scale parcellations, involving 90, 256, 497, 1003, and 1501 nodes. Three local properties of resting-state functional brain networks were selected (degree, betweenness centrality, and nodal efficiency), and the support vector machine method was used to construct classifiers to identify patients with major depressive disorder. We analyzed the impact of the five scales on classification accuracy. In addition, the effectiveness and redundancy of features obtained by the different scale parcellations were compared. Finally, traditional statistical significance (P value) was verified as a feature selection criterion. The results showed that the feature effectiveness of different scales was similar; in other words, parcellation with more regions did not provide more effective discriminative features. Nevertheless, parcellation with more regions did provide a greater quantity of discriminative features, which led to an improvement in the accuracy of the classification. However, due to the close distance between brain regions, the redundancy of parcellation with more regions was also greater. The traditional P value feature selection strategy is feasible with different scales, but our analysis showed that the traditional P<0.05 threshold was too strict for feature selection. This study provides an important reference for the selection of network scales when applying topological properties of brain networks to machine learning methods.


Author(s):  
Marshall A. Taylor

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically nonsignificant at least at the alpha level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this article, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate’s p-value and its associated confidence interval in relation to a specified alpha level. These plots can help the analyst interpret and report the statistical and substantive significances of their models. I illustrate using a nonprobability sample of activists and participants at a 1962 anticommunism school.


2018 ◽  
Author(s):  
Trang T. Le ◽  
Ryan J. Urbanowicz ◽  
Jason H. Moore ◽  
Brett A. McKinney

AbstractMotivationRelief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features.MethodsWe reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.ResultsWe demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when thefixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR’s straightforward extension to genome-wide association studies.AvailabilityCode and data available at http://insilico.utulsa.edu/software/[email protected]


Sign in / Sign up

Export Citation Format

Share Document