Scalable and Flexible Unsupervised Feature Selection

Recently, graph-based unsupervised feature selection algorithms (GUFS) have been shown to efficiently handle prevalent high-dimensional unlabeled data. One common drawback associated with existing graph-based approaches is that they tend to be time-consuming and in need of large storage, especially when faced with the increasing size of data. Research has started using anchors to accelerate graph-based learning model for feature selection, while the hard linear constraint between the data matrix and the lower-dimensional representation is usually overstrict in many applications. In this letter, we propose a flexible linearization model with anchor graph and [Formula: see text]-norm regularization, which can deal with large-scale data sets and improve the performance of the existing anchor-based method. In addition, the anchor-based graph Laplacian is constructed to characterize the manifold embedding structure by means of a parameter-free adaptive neighbor assignment strategy. An efficient iterative algorithm is developed to address the optimization problem, and we also prove the convergence of the algorithm. Experiments on several public data sets demonstrate the effectiveness and efficiency of the method we propose.

Download Full-text

Q-Learning with Fisher Score for Feature Selection of Large-Scale Data Sets

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-82147-0_25 ◽

2021 ◽

pp. 306-318

Author(s):

Min Gan ◽

Li Zhang

Keyword(s):

Feature Selection ◽

Large Scale ◽

Data Sets ◽

Fisher Score ◽

Q Learning ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets ◽

Selection Of

Download Full-text

LLE Score: A New Filter-Based Unsupervised Feature Selection Method Based on Nonlinear Manifold Embedding and Its Application to Image Recognition

IEEE Transactions on Image Processing ◽

10.1109/tip.2017.2733200 ◽

2017 ◽

Vol 26 (11) ◽

pp. 5257-5269 ◽

Cited By ~ 33

Author(s):

Chao Yao ◽

Ya-Feng Liu ◽

Bo Jiang ◽

Jungong Han ◽

Junwei Han

Keyword(s):

Feature Selection ◽

Image Recognition ◽

Feature Selection Method ◽

Selection Method ◽

Unsupervised Feature Selection ◽

Manifold Embedding

Download Full-text

Unsupervised Feature Selection Based on Spectral Clustering with Maximum Relevancy and Minimum Redundancy Approach

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001421500312 ◽

2021 ◽

Vol 35 (11) ◽

pp. 2150031

Author(s):

Bahareh Khozaei ◽

Mahdi Eftekhari

Keyword(s):

Feature Selection ◽

Spectral Clustering ◽

Information Gain ◽

State Of The Art ◽

Nearest Neighbors ◽

Data Sets ◽

Unsupervised Feature Selection ◽

Significant Difference ◽

Cluster A ◽

Novel Approaches

In this paper, two novel approaches for unsupervised feature selection are proposed based on the spectral clustering. In the first proposed method, spectral clustering is employed over the features and the center of clusters is selected as well as their nearest-neighbors. These features have a minimum similarity (redundancy) between themselves since they belong to different clusters. Next, samples of data sets are clustered employing spectral clustering so that to the samples of each cluster a specific pseudo-label is assigned. After that according to the obtained pseudo-labels, the information gain of the features is computed that secures the maximum relevancy. Finally, the intersection of the selected features in the two previous steps is determined that simultaneously guarantees both the maximum relevancy and minimum redundancy. Our second proposed approach is very similar to the first one whose only but significant difference with the first method is that it selects one feature from each cluster and sorts all the features in terms of their relevancy. Then, by appending the selected features to a sorted list and ignoring them for the next step, the algorithm continues with the remaining features until all the features to be appended into the sorted list. Both of our proposed methods are compared with state-of-the-art methods and the obtained results confirm the performance of our proposed approaches especially the second one.

Download Full-text

Feature selection with partition differentiation entropy for large-scale data sets

Information Sciences ◽

10.1016/j.ins.2015.10.002 ◽

2016 ◽

Vol 329 ◽

pp. 690-700 ◽

Cited By ~ 23

Author(s):

Fachao Li ◽

Zan Zhang ◽

Chenxia Jin

Keyword(s):

Feature Selection ◽

Large Scale ◽

Data Sets ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

A novel unsupervised feature selection method for bioinformatics data sets through feature clustering

2008 IEEE International Conference on Granular Computing ◽

10.1109/grc.2008.4664788 ◽

2008 ◽

Cited By ~ 2

Author(s):

Guangrong Li ◽

Xiaohua Hu ◽

Xiajiong Shen ◽

Xin Chen ◽

Zhoujun Li

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Data Sets ◽

Feature Clustering ◽

Unsupervised Feature Selection

Download Full-text

Unified Methods for Feature Selection in Large-Scale Genomic Studies with Censored Survival Outcomes

10.1101/2020.02.14.944314 ◽

2020 ◽

Author(s):

Lauren Spirko-Burns ◽

Karthik Devarajan

Keyword(s):

Feature Selection ◽

Large Scale ◽

Proportional Hazards ◽

Genomic Feature ◽

Data Sets ◽

Prognostic Impact ◽

Genomic Features ◽

Special Cases ◽

Information Divergence ◽

Genomic Studies

AbstractOne of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease’s process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 indices for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 measure that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature expression. We evaluate the performance of our measures using extensive simulation studies and publicly available data sets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. The proposed information divergence, R2 and pseudo-R2 measures were implemented in R (www.R-project.org) and code is available upon request.

Download Full-text

An Empirical Study on Initializing Centroid in K-Means Clustering for Feature Selection

International Journal of Software Science and Computational Intelligence ◽

10.4018/ijssci.2021010101 ◽

2021 ◽

Vol 13 (1) ◽

pp. 1-16

Author(s):

Amit Saxena ◽

John Wang ◽

Wutiphol Sintunavarat

Keyword(s):

Data Mining ◽

Feature Selection ◽

Empirical Study ◽

Data Sets ◽

Data Set ◽

Unsupervised Feature Selection ◽

The Impact

One of the main problems in K-means clustering is setting of initial centroids which can cause misclustering of patterns which affects clustering accuracy. Recently, a density and distance-based technique for determining initial centroids has claimed a faster convergence of clusters. Motivated from this key idea, the authors study the impact of initial centroids on clustering accuracy for unsupervised feature selection. Three metrics are used to rank the features of a data set. The centroids of the clusters in the data sets, to be applied in K-means clustering, are initialized randomly as well as by density and distance-based approaches. Extensive experiments are performed on 15 datasets. The main significance of the paper is that the K-means clustering yields higher accuracies in majority of these datasets using proposed density and distance-based approach. As an impact of the paper, with fewer features, a good clustering accuracy can be achieved which can be useful in data mining of data sets with thousands of features.

Download Full-text

COMBINING FEATURE SELECTION WITH EXTRACTION: UNSUPERVISED FEATURE SELECTION BASED ON PRINCIPAL COMPONENT ANALYSIS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213009000445 ◽

2009 ◽

Vol 18 (06) ◽

pp. 883-904

Author(s):

YUN LI ◽

BAO-LIANG LU ◽

TENG-FEI ZHANG

Keyword(s):

Feature Selection ◽

Principal Components ◽

Nearest Neighbor ◽

Dimensional Space ◽

Image Data ◽

Principal Component ◽

Data Sets ◽

K Nearest Neighbor ◽

Unsupervised Feature Selection ◽

Linear Feature

Principal components analysis (PCA) is a popular linear feature extractor, and widely used in signal processing, face recognition, etc. However, axes of the lower-dimensional space, i.e., principal components, are a set of new variables carrying no clear physical meanings. Thus we propose unsupervised feature selection algorithms based on eigenvectors analysis to identify critical original features for principal component. The presented algorithms are based on k-nearest neighbor rule to find the predominant row components and eight new measures are proposed to compute the correlation between row components in transformation matrix. Experiments are conducted on benchmark data sets and facial image data sets for gender classification to show their superiorities.

Download Full-text

Unsupervised Feature Selection by Pareto Optimization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013534 ◽

2019 ◽

Vol 33 ◽

pp. 3534-3541 ◽

Cited By ~ 3

Author(s):

Chao Feng ◽

Chao Qian ◽

Ke Tang

Keyword(s):

Feature Selection ◽

Reconstruction Error ◽

Superior Performance ◽

Data Matrix ◽

Feature Transformation ◽

Huge Number ◽

Unsupervised Feature Selection ◽

Column Subset Selection ◽

Approximation Guarantee ◽

Natural Formulation

Dimensionality reduction is often employed to deal with the data with a huge number of features, which can be generally divided into two categories: feature transformation and feature selection. Due to the interpretability, the efficiency during inference and the abundance of unlabeled data, unsupervised feature selection has attracted much attention. In this paper, we consider its natural formulation, column subset selection (CSS), which is to minimize the reconstruction error of a data matrix by selecting a subset of features. We propose an anytime randomized iterative approach POCSS, which minimizes the reconstruction error and the number of selected features simultaneously. Its approximation guarantee is well bounded. Empirical results exhibit the superior performance of POCSS over the state-of-the-art algorithms.

Download Full-text

Unsupervised feature selection for large data sets

Pattern Recognition Letters ◽

10.1016/j.patrec.2019.08.017 ◽

2019 ◽

Vol 128 ◽

pp. 183-189 ◽

Cited By ~ 2

Author(s):

Renato Cordeiro de Amorim

Keyword(s):

Feature Selection ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Unsupervised Feature Selection ◽

Selection For

Download Full-text