Applying Permutation Tests for Assessing the Statistical Significance of Wrapper Based Feature Selection

AbstractSingle cell RNA sequencing (scRNA-seq) allows us to dissect transcriptional heterogeneity arising from cellular types, spatio-temporal contexts, and environmental stimuli. Cell identities of samples derived from heterogeneous subpopulations are routinely determined by clustering of scRNA-seq data. Computational cell identities are then used in downstream analysis, feature selection, and visualization. However, how can we examine if cell identities are accurately inferred? To this end, we introduce non-parametric methods to evaluate cell identities by testing cluster memberships of single cell samples in an unsupervised manner. We propose posterior inclusion probabilities for cluster memberships to select and visualize samples relevant to subpopulations. Beyond simulation studies, we examined two scRNA-seq data - a mixture of Jurkat and 293T cells and a large family of peripheral blood mononuclear cells. We demonstrated probabilistic feature selection and improved t-SNE visualization. By learning uncertainty in clustering, the proposed methods enable rigorous testing of cell identities in scRNA-seq.

Download Full-text

Visualization Strategies for Regression Estimates with Randomization Inference

10.31235/osf.io/bsd7g ◽

2019 ◽

Author(s):

Marshall A. Taylor

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Regression Models ◽

Statistical Significance ◽

Permutation Tests ◽

P Value ◽

P Values ◽

Alpha Level ◽

Significance Levels ◽

Nonprobability Sample

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically non-significant at at least the alpha-level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this paper, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate's p-value and its associated confidence interval in relation to a specified alpha-level. These plots can help the analyst interpret and report both the statistical and substantive significance of their models. Illustrations are provided using a nonprobability sample of activists and participants at a 1962 anti-Communism school.

Download Full-text

FEATURE SELECTION FOR HMM AND BLSTM BASED HANDWRITING RECOGNITION OF WHITEBOARD NOTES

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007417 ◽

2009 ◽

Vol 23 (05) ◽

pp. 907-923 ◽

Cited By ~ 17

Author(s):

MARCUS LIWICKI ◽

HORST BUNKE

Keyword(s):

Feature Selection ◽

Reference System ◽

Hidden Markov ◽

Handwriting Recognition ◽

Statistical Significance ◽

Forward Search ◽

Online Handwriting Recognition ◽

Selection Experiments ◽

Selection For ◽

Online Handwriting

In this paper, we describe feature selection experiments for online handwriting recognition. We investigated a set of 25 online and pseudo-offline features to find out which features are important and which features may be redundant. To analyze the saliency of the features, we applied a sequential forward and a sequential backward search on the feature set. A hidden Markov model and a neural network based recognizer have been used as recognition engines. In our experiments, we obtained interesting results. Using a set of only five features, we achieved a performance similar to that of the reference system that uses all 25 features. The five selected features have a low correlation and have been the top choices during the first iterations of the forward search with both recognizers. Furthermore, for both recognizers, subsets have been identified that outperform the reference system with statistical significance. In order to assess the results more rigorously, we have compared our recognizer with the widely used commercial recognizer from Microsoft.

Download Full-text

Sentiment classification using hybrid feature selection and ensemble classifier

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189738 ◽

2021 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Achin Jain ◽

Vanita Jain

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Information Gain ◽

Statistical Significance ◽

Sentiment Classification ◽

Ensemble Classification ◽

Support Vector ◽

Svm Classifier ◽

Feature Selection Technique ◽

Restaurant Reviews

This paper presents a Hybrid Feature Selection Technique for Sentiment Classification. We have used a Genetic Algorithm and a combination of existing Feature Selection methods, namely: Information Gain (IG), CHI Square (CHI), and GINI Index (GINI). First, we have obtained features from three different selection approaches as mentioned above and then performed the UNION SET Operation to extract the reduced feature set. Then, Genetic Algorithm is applied to optimize the feature set further. This paper also presents an Ensemble Approach based on the error rate obtained different domain datasets. To test our proposed Hybrid Feature Selection and Ensemble Classification approach, we have considered four Support Vector Machine (SVM) classifier variants. We have used UCI ML Datasets of three domains namely: IMDB Movie Review, Amazon Product Review and Yelp Restaurant Reviews. The experimental results show that our proposed approach performed best in all three domain datasets. Further, we also presented T-Test for Statistical Significance between classifiers and comparison is also done based on Precision, Recall, F1-Score, AUC and model execution time.

Download Full-text

An Integration of PSO-based Feature Selection and Random Forest for Anomaly Detection in IoT Network

MATEC Web of Conferences ◽

10.1051/matecconf/201815901053 ◽

2018 ◽

Vol 159 ◽

pp. 01053 ◽

Cited By ~ 1

Author(s):

Bayu Adhi Tama ◽

Kyung-Hyune Rhee

Keyword(s):

Feature Selection ◽

Random Forest ◽

Anomaly Detection ◽

Detection System ◽

Statistical Tests ◽

Statistical Significance ◽

Significance Test ◽

Performance Model ◽

Rotation Forest ◽

Proposed Model

The most challenging research topic in the field of intrusion detection system (IDS) is anomaly detection. It is able to repeal any peculiar activities in the network by contrasting them with normal patterns. This paper proposes an efficient random forest (RF) model with particle swarm optimization (PSO)-based feature selection for IDS. The performance model is evaluated on a well-known benchmarking dataset, i.e. NSL-KDD in terms of accuracy, precision, recall, and false alarm rate(FAR) metrics. Furthermore, we evaluate the significance differencesbetween the proposed model and other classifiers, i.e. rotation forest (RoF)and deep neural network (DNN) using statistical significance test. Basedon the statistical tests, the proposed model significantly outperforms otherclassifiers involved in the experiment.

Download Full-text

Resting-State Functional Network Scale Effects and Statistical Significance-Based Feature Selection in Machine Learning Classification

Computational and Mathematical Methods in Medicine ◽

10.1155/2019/9108108 ◽

2019 ◽

Vol 2019 ◽

pp. 1-18

Author(s):

Hao Guo ◽

Yao Li ◽

Godfred Kim Mensah ◽

Yong Xu ◽

Junjie Chen ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Resting State ◽

Classification Accuracy ◽

Statistical Significance ◽

Brain Networks ◽

Selection Strategy ◽

Topological Properties ◽

Functional Brain ◽

Network Scale

In recent years, functional brain network topological features have been widely used as classification features. Previous studies have found that network node scale differences caused by different network parcellation definitions significantly affect the structure of the constructed network and its topological properties. However, we still do not know how network scale differences affect the classification accuracy, performance of classification features, and effectiveness of the feature selection strategy using P values in terms of the machine learning method. This study used five scale parcellations, involving 90, 256, 497, 1003, and 1501 nodes. Three local properties of resting-state functional brain networks were selected (degree, betweenness centrality, and nodal efficiency), and the support vector machine method was used to construct classifiers to identify patients with major depressive disorder. We analyzed the impact of the five scales on classification accuracy. In addition, the effectiveness and redundancy of features obtained by the different scale parcellations were compared. Finally, traditional statistical significance (P value) was verified as a feature selection criterion. The results showed that the feature effectiveness of different scales was similar; in other words, parcellation with more regions did not provide more effective discriminative features. Nevertheless, parcellation with more regions did provide a greater quantity of discriminative features, which led to an improvement in the accuracy of the classification. However, due to the close distance between brain regions, the redundancy of parcellation with more regions was also greater. The traditional P value feature selection strategy is feasible with different scales, but our analysis showed that the traditional P<0.05 threshold was too strict for feature selection. This study provides an important reference for the selection of network scales when applying topological properties of brain networks to machine learning methods.

Download Full-text

Visualization strategies for regression estimates with randomization inference

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x20930999 ◽

2020 ◽

Vol 20 (2) ◽

pp. 309-335

Author(s):

Marshall A. Taylor

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Regression Models ◽

Statistical Significance ◽

Permutation Tests ◽

P Value ◽

P Values ◽

Alpha Level ◽

Significance Levels ◽

Nonprobability Sample

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically nonsignificant at least at the alpha level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this article, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate’s p-value and its associated confidence interval in relation to a specified alpha level. These plots can help the analyst interpret and report the statistical and substantive significances of their models. I illustrate using a nonprobability sample of activists and participants at a 1962 anticommunism school.

Download Full-text

Statistical Inference Relief (STIR) feature selection

10.1101/359224 ◽

2018 ◽

Cited By ~ 2

Author(s):

Trang T. Le ◽

Ryan J. Urbanowicz ◽

Jason H. Moore ◽

Brett A. McKinney

Keyword(s):

Feature Selection ◽

Statistical Inference ◽

Nearest Neighbor ◽

Association Studies ◽

Statistical Significance ◽

Machine Learning Algorithms ◽

Type I ◽

Genome Wide Association Studies ◽

Attribute Importance ◽

Parameterized Model

AbstractMotivationRelief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features.MethodsWe reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.ResultsWe demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when thefixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR’s straightforward extension to genome-wide association studies.AvailabilityCode and data available at http://insilico.utulsa.edu/software/[email protected]

Download Full-text

Statistical Significance Assessment for Biological Feature Selection: Methods and Issues

Biological Knowledge Discovery Handbook ◽

10.1002/9781118617151.ch15 ◽

2013 ◽

pp. 353-378

Author(s):

Juntao Li ◽

Kwok Pui Choi ◽

Yudi Pawitan ◽

Radha Krishna Murthy Karuturi

Keyword(s):

Feature Selection ◽

Statistical Significance ◽

Selection Methods ◽

Biological Feature

Download Full-text

Applying Permutation Tests for Assessing the Statistical Significance of Wrapper Based Feature Selection

Permutation Tests for Classification: Towards Statistical Significance in Image-Based Studies

Statistical significance of cluster membership for determination of cell identities in single cell genomics

Visualization Strategies for Regression Estimates with Randomization Inference

FEATURE SELECTION FOR HMM AND BLSTM BASED HANDWRITING RECOGNITION OF WHITEBOARD NOTES

Sentiment classification using hybrid feature selection and ensemble classifier

An Integration of PSO-based Feature Selection and Random Forest for Anomaly Detection in IoT Network

Resting-State Functional Network Scale Effects and Statistical Significance-Based Feature Selection in Machine Learning Classification

Visualization strategies for regression estimates with randomization inference

Statistical Inference Relief (STIR) feature selection

Statistical Significance Assessment for Biological Feature Selection: Methods and Issues

Export Citation Format