scholarly journals IVFS: Simple and Efficient Feature Selection for High Dimensional Topology Preservation

2020 ◽  
Vol 34 (04) ◽  
pp. 4747-4754
Author(s):  
Xiaoyun Li ◽  
Chenxi Wu ◽  
Ping Li

Feature selection is an important tool to deal with high dimensional data. In unsupervised case, many popular algorithms aim at maintaining the structure of the original data. In this paper, we propose a simple and effective feature selection algorithm to enhance sample similarity preservation through a new perspective, topology preservation, which is represented by persistent diagrams from the context of computational topology. This method is designed upon a unified feature selection framework called IVFS, which is inspired by random subset method. The scheme is flexible and can handle cases where the problem is analytically intractable. The proposed algorithm is able to well preserve the pairwise distances, as well as topological patterns, of the full data. We demonstrate that our algorithm can provide satisfactory performance under a sharp sub-sampling rate, which supports efficient implementation of our proposed method to large scale datasets. Extensive experiments validate the effectiveness of the proposed feature selection scheme.

Author(s):  
Jundong Li ◽  
Jiliang Tang ◽  
Huan Liu

Feature selection has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. Since real-world data is usually unlabeled, unsupervised feature selection has received increasing attention in recent years. Without label information, unsupervised feature selection needs alternative criteria to define feature relevance. Recently, data reconstruction error emerged as a new criterion for unsupervised feature selection, which defines feature relevance as the capability of features to approximate original data via a reconstruction function. Most existing algorithms in this family assume predefined, linear reconstruction functions. However, the reconstruction function should be data dependent and may not always be linear especially when the original data is high-dimensional. In this paper, we investigate how to learn the reconstruction function from the data automatically for unsupervised feature selection, and propose a novel reconstruction-based unsupervised feature selection framework REFS, which embeds the reconstruction function learning process into feature selection. Experiments on various types of real-world datasets demonstrate the effectiveness of the proposed framework REFS.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3627
Author(s):  
Bo Jin ◽  
Chunling Fu ◽  
Yong Jin ◽  
Wei Yang ◽  
Shengbin Li ◽  
...  

Identifying the key genes related to tumors from gene expression data with a large number of features is important for the accurate classification of tumors and to make special treatment decisions. In recent years, unsupervised feature selection algorithms have attracted considerable attention in the field of gene selection as they can find the most discriminating subsets of genes, namely the potential information in biological data. Recent research also shows that maintaining the important structure of data is necessary for gene selection. However, most current feature selection methods merely capture the local structure of the original data while ignoring the importance of the global structure of the original data. We believe that the global structure and local structure of the original data are equally important, and so the selected genes should maintain the essential structure of the original data as far as possible. In this paper, we propose a new, adaptive, unsupervised feature selection scheme which not only reconstructs high-dimensional data into a low-dimensional space with the constraint of feature distance invariance but also employs ℓ2,1-norm to enable a matrix with the ability to perform gene selection embedding into the local manifold structure-learning framework. Moreover, an effective algorithm is developed to solve the optimization problem based on the proposed scheme. Comparative experiments with some classical schemes on real tumor datasets demonstrate the effectiveness of the proposed method.


2021 ◽  
Vol 26 (1) ◽  
pp. 67-77
Author(s):  
Siva Sankari Subbiah ◽  
Jayakumar Chinnappan

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.


2019 ◽  
Vol 5 ◽  
pp. e237 ◽  
Author(s):  
Davide Nardone ◽  
Angelo Ciaramella ◽  
Antonino Staiano

In this work, we propose a novel Feature Selection framework called Sparse-Modeling Based Approach for Class Specific Feature Selection (SMBA-CSFS), that simultaneously exploits the idea of Sparse Modeling and Class-Specific Feature Selection. Feature selection plays a key role in several fields (e.g., computational biology), making it possible to treat models with fewer variables which, in turn, are easier to explain, by providing valuable insights on the importance of their role, and likely speeding up the experimental validation. Unfortunately, also corroborated by the no free lunch theorems, none of the approaches in literature is the most apt to detect the optimal feature subset for building a final model, thus it still represents a challenge. The proposed feature selection procedure conceives a two-step approach: (a) a sparse modeling-based learning technique is first used to find the best subset of features, for each class of a training set; (b) the discovered feature subsets are then fed to a class-specific feature selection scheme, in order to assess the effectiveness of the selected features in classification tasks. To this end, an ensemble of classifiers is built, where each classifier is trained on its own feature subset discovered in the previous phase, and a proper decision rule is adopted to compute the ensemble responses. In order to evaluate the performance of the proposed method, extensive experiments have been performed on publicly available datasets, in particular belonging to the computational biology field where feature selection is indispensable: the acute lymphoblastic leukemia and acute myeloid leukemia, the human carcinomas, the human lung carcinomas, the diffuse large B-cell lymphoma, and the malignant glioma. SMBA-CSFS is able to identify/retrieve the most representative features that maximize the classification accuracy. With top 20 and 80 features, SMBA-CSFS exhibits a promising performance when compared to its competitors from literature, on all considered datasets, especially those with a higher number of features. Experiments show that the proposed approach may outperform the state-of-the-art methods when the number of features is high. For this reason, the introduced approach proposes itself for selection and classification of data with a large number of features and classes.


Author(s):  
Wei Zheng ◽  
Xiaofeng Zhu ◽  
Yonghua Zhu ◽  
Shichao Zhang

Feature selection is an indispensable preprocessing procedure for high-dimensional data analysis,but previous feature selection methods usually ignore sample diversity (i.e., every sample has individual contribution for the model construction) andhave limited ability to deal with incomplete datasets where a part of training samples have unobserved data. To address these issues, in this paper, we firstly propose a robust feature selectionframework to relieve the influence of outliers, andthen introduce an indicator matrix to avoid unobserved data to take participation in numerical computation of feature selection so that both our proposed feature selection framework and exiting feature selection frameworks are available to conductfeature selection on incomplete data sets. We further propose a new optimization algorithm to optimize the resulting objective function as well asprove our algorithm to converge fast. Experimental results on both real and artificial incompletedata sets demonstrated that our proposed methodoutperformed the feature selection methods undercomparison in terms of clustering performance.  


2021 ◽  
pp. 1-10
Author(s):  
Lei Shu ◽  
Kun Huang ◽  
Wenhao Jiang ◽  
Wenming Wu ◽  
Hongling Liu

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.


Author(s):  
MINGXIA LIU ◽  
DAOQIANG ZHANG

As thousands of features are available in many pattern recognition and machine learning applications, feature selection remains an important task to find the most compact representation of the original data. In the literature, although a number of feature selection methods have been developed, most of them focus on optimizing specific objective functions. In this paper, we first propose a general graph-preserving feature selection framework where graphs to be preserved vary in specific definitions, and show that a number of existing filter-type feature selection algorithms can be unified within this framework. Then, based on the proposed framework, a new filter-type feature selection method called sparsity score (SS) is proposed. This method aims to preserve the structure of a pre-defined l1 graph that is proven robust to data noise. Here, the modified sparse representation based on an l1-norm minimization problem is used to determine the graph adjacency structure and corresponding affinity weight matrix simultaneously. Furthermore, a variant of SS called supervised SS (SuSS) is also proposed, where the l1 graph to be preserved is constructed by using only data points from the same class. Experimental results of clustering and classification tasks on a series of benchmark data sets show that the proposed methods can achieve better performance than conventional filter-type feature selection methods.


Sign in / Sign up

Export Citation Format

Share Document