Unsupervised Nonlinear Feature Selection from High-Dimensional Signed Networks

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.

Download Full-text

Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression

10.26686/wgtn.14273315 ◽

2021 ◽

Author(s):

Qi Chen ◽

Mengjie Zhang ◽

Bing Xue

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Feature Selection Method ◽

Selection Method ◽

Symbolic Regression ◽

Superior Performance ◽

High Dimensional ◽

Selection Methods ◽

Personal Use ◽

New Feature

When learning from high-dimensional data for symbolic regression (SR), genetic programming (GP) typically could not generalize well. Feature selection, as a data preprocessing method, can potentially contribute not only to improving the efficiency of learning algorithms but also to enhancing the generalization ability. However, in GP for high-dimensional SR, feature selection before learning is seldom considered. In this paper, we propose a new feature selection method based on permutation to select features for high-dimensional SR using GP. A set of experiments has been conducted to investigate the performance of the proposed method on the generalization of GP for high-dimensional SR. The regression results confirm the superior performance of the proposed method over the other examined feature selection methods. Further analysis indicates that the models evolved by the proposed method are more likely to contain only the truly relevant features and have better interpretability. © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text

A causality-inspired feature selection method for cancer imbalanced high-dimensional data

10.1101/2021.10.04.462984 ◽

2021 ◽

Author(s):

Yijun Liu ◽

Qiang Huang ◽

Huiyan Sun ◽

Yi Chang

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Class Imbalance ◽

Selection Method ◽

Learning Performance ◽

High Dimensional ◽

Final Decision ◽

Selection Methods ◽

Cancer Initiation ◽

Cancer Types

It is significant but challenging to explore a subset of robust biomarkers to distinguish cancer from normal samples on high-dimensional imbalanced cancer biological omics data. Although many feature selection methods addressing high dimensionality and class imbalance have been proposed, they rarely pay attention to the fact that most classes will dominate the final decision-making when the dataset is imbalanced, leading to instability when it expands downstream tasks. Because of causality invariance, causal relationship inference is considered an effective way to improve machine learning performance and stability. This paper proposes a Causality-inspired Least Angle Nonlinear Distributed (CLAND) feature selection method, consisting of two branches with a class-wised branch and a sample-wised branch representing two deconfounder strategies, respectively. We compared the performance of CLAND with other advanced feature selection methods in transcriptional data of six cancer types with different imbalance ratios. The genes selected by CLAND have superior accuracy, stability, and generalization in the downstream classification tasks, indicating potential causality for identifying cancer samples. Furthermore, these genes have also been demonstrated to play an essential role in cancer initiation and progression through reviewing the literature.

Download Full-text

Improved ICHI square feature selection method for Arabic classifiers

International Journal of Informatics and Communication Technology (IJ-ICT) ◽

10.11591/ijict.v9i3.pp157-170 ◽

2020 ◽

Vol 9 (3) ◽

pp. 157

Author(s):

Hadeel N. Alshaer ◽

Mohammed A. Otair ◽

Laith Abualigah

Keyword(s):

Feature Selection ◽

Information Gain ◽

Feature Selection Method ◽

Selection Methods ◽

Chi Square ◽

Feature Selection Problem ◽

Time To Build ◽

Arabic Text Classification ◽

Almost All ◽

Text And Data Mining

<span>Feature selection problem is one of the main important problems in the text and data mining domain. </span><span>This paper presents a comparative study of feature selection methods for Arabic text classification. Five of the feature selection methods were selected: ICHI square, CHI square, Information Gain, Mutual Information and Wrapper. It was tested with five classification algorithms: Bayes Net, Naive Bayes, Random Forest, Decision Tree and Artificial Neural Networks. In addition, Data Collection was used in Arabic consisting of 9055 documents, which were compared by four criteria: Precision, Recall, F-measure and Time to build model. The results showed that the improved ICHI feature selection got almost all the best results in comparison with other methods.</span>

Download Full-text

Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes

Frontiers in Genetics ◽

10.3389/fgene.2021.632620 ◽

2021 ◽

Vol 12 ◽

Author(s):

David Källberg ◽

Linda Vidman ◽

Patrik Rydén

Keyword(s):

Feature Selection ◽

Rna Sequencing ◽

Feature Selection Method ◽

High Dimensional ◽

Adjusted Rand Index ◽

Data Sets ◽

Selection Methods ◽

Sequencing Data ◽

Cancer Data ◽

Cancer Subtype

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (−0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

Download Full-text

Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression

10.26686/wgtn.14273315.v1 ◽

2021 ◽

Author(s):

Qi Chen ◽

Mengjie Zhang ◽

Bing Xue

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Feature Selection Method ◽

Selection Method ◽

Symbolic Regression ◽

Superior Performance ◽

High Dimensional ◽

Selection Methods ◽

Personal Use ◽

New Feature

When learning from high-dimensional data for symbolic regression (SR), genetic programming (GP) typically could not generalize well. Feature selection, as a data preprocessing method, can potentially contribute not only to improving the efficiency of learning algorithms but also to enhancing the generalization ability. However, in GP for high-dimensional SR, feature selection before learning is seldom considered. In this paper, we propose a new feature selection method based on permutation to select features for high-dimensional SR using GP. A set of experiments has been conducted to investigate the performance of the proposed method on the generalization of GP for high-dimensional SR. The regression results confirm the superior performance of the proposed method over the other examined feature selection methods. Further analysis indicates that the models evolved by the proposed method are more likely to contain only the truly relevant features and have better interpretability. © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text

A Redundancy Based Unsupervised Feature Selection Method for High-Dimensional Data

2021 13th International Conference on Machine Learning and Computing ◽

10.1145/3457682.3457725 ◽

2021 ◽

Author(s):

Jian Zhou ◽

Ding Liu

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Feature Selection Method ◽

Selection Method ◽

High Dimensional ◽

Unsupervised Feature Selection

Download Full-text

Feature Selection and Feature Stability Measurement Method for High-Dimensional Small Sample Data Based on Big Data Technology

Computational Intelligence and Neuroscience ◽

10.1155/2021/3597051 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Chengyuan Huang

Keyword(s):

Feature Selection ◽

Measurement Method ◽

Feature Selection Method ◽

Selection Method ◽

Small Sample ◽

High Dimensional ◽

Selection Methods ◽

Sample Data ◽

Stability Measurement ◽

The Stability

With the rapid development of artificial intelligence in recent years, the research on image processing, text mining, and genome informatics has gradually deepened, and the mining of large-scale databases has begun to receive more and more attention. The objects of data mining have also become more complex, and the data dimensions of mining objects have become higher and higher. Compared with the ultra-high data dimensions, the number of samples available for analysis is too small, resulting in the production of high-dimensional small sample data. High-dimensional small sample data will bring serious dimensional disasters to the mining process. Through feature selection, redundancy and noise features in high-dimensional small sample data can be effectively eliminated, avoiding dimensional disasters and improving the actual efficiency of mining algorithms. However, the existing feature selection methods emphasize the classification or clustering performance of the feature selection results and ignore the stability of the feature selection results, which will lead to unstable feature selection results, and it is difficult to obtain real and understandable features. Based on the traditional feature selection method, this paper proposes an ensemble feature selection method, Random Bits Forest Recursive Clustering Eliminate (RBF-RCE) feature selection method, combined with multiple sets of basic classifiers to carry out parallel learning and screen out the best feature classification results, optimizes the classification performance of traditional feature selection methods, and can also improve the stability of feature selection. Then, this paper analyzes the reasons for the instability of feature selection and introduces a feature selection stability measurement method, the Intersection Measurement (IM), to evaluate whether the feature selection process is stable. The effectiveness of the proposed method is verified by experiments on several groups of high-dimensional small sample data sets.

Download Full-text

Evaluation of Feature Selection Methods on Psychosocial Education Data Using Additive Ratio Assessment

Electronics ◽

10.3390/electronics11010114 ◽

2021 ◽

Vol 11 (1) ◽

pp. 114

Author(s):

Fitriani Muttakin ◽

Jui-Tang Wang ◽

Mulyanto Mulyanto ◽

Jenq-Shiou Leu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Performance Metrics ◽

Prediction Models ◽

Feature Selection Method ◽

Selection Method ◽

Research Trend ◽

High Dimensional ◽

Selection Methods ◽

Education Data

Artificial intelligence, particularly machine learning, is the fastest-growing research trend in educational fields. Machine learning shows an impressive performance in many prediction models, including psychosocial education. The capability of machine learning to discover hidden patterns in large datasets encourages researchers to invent data with high-dimensional features. In contrast, not all features are needed by machine learning, and in many cases, high-dimensional features decrease the performance of machine learning. The feature selection method is one of the appropriate approaches to reducing the features to ensure machine learning works efficiently. Various selection methods have been proposed, but research to determine the essential subset feature in psychosocial education has not been established thus far. This research investigated and proposed methods to determine the best feature selection method in the domain of psychosocial education. We used a multi-criteria decision system (MCDM) approach with Additive Ratio Assessment (ARAS) to rank seven feature selection methods. The proposed model evaluated the best feature selection method using nine criteria from the performance metrics provided by machine learning. The experimental results showed that the ARAS is promising for evaluating and recommending the best feature selection method for psychosocial education data using the teacher’s psychosocial risk levels dataset.

Download Full-text