Feature selection revisited in the single-cell era

AbstractFeature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.

Download Full-text

Triku: a feature selection method based on nearest neighbors for single-cell data

10.1101/2021.02.12.430764 ◽

2021 ◽

Author(s):

Alex M. Ascensión ◽

Olga Ibañez-Solé ◽

Inaki Inza ◽

Ander Izeta ◽

Marcos J. Araúzo-Bravo

Keyword(s):

Feature Selection ◽

Single Cell ◽

Nearest Neighbor ◽

Feature Selection Method ◽

Selection Method ◽

Cell Populations ◽

Neighbor Graph ◽

Gene Sets ◽

Nearest Neighbor Graph ◽

Cell Data

AbstractFeature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Triku is a feature selection method that favours genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the nearest neighbor graph. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on mutual information and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms, and contain fewer ribosomal and mitochondrial genes. Triku is available at https://gitlab.com/alexmascension/triku.

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

DUBStepR: correlation-based feature selection for clustering single-cell RNA sequencing data

10.1101/2020.10.07.330563 ◽

2020 ◽

Author(s):

Bobby Ranjan ◽

Wenjie Sun ◽

Jinyu Park ◽

Ronald Xie ◽

Fatemeh Alipour ◽

...

Keyword(s):

Feature Selection ◽

Single Cell ◽

Gene Selection ◽

Marker Gene ◽

Feature Space ◽

General Purpose ◽

Selection Marker ◽

Selection Methods ◽

Sequencing Data ◽

Data Types

Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. However, we found that the performance of existing feature selection methods was inconsistent across benchmark datasets, and occasionally even worse than without feature selection. Moreover, existing methods ignored information contained in gene-gene correlations. We there-fore developed DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUB-StepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. In a published scRNA-seq dataset from sorted monocytes, DUBStepR sensitively detected a rare and previously invisible population of contaminating basophils. DUBStepR is scalable to large datasets, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.

Download Full-text

A redundancy-removing feature selection algorithm for nominal data

PeerJ Computer Science ◽

10.7717/peerj-cs.24 ◽

2015 ◽

Vol 1 ◽

pp. e24 ◽

Cited By ~ 1

Author(s):

Zhihua Li ◽

Wenqu Gu

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Feature Selection Method ◽

Selection Method ◽

Selection Algorithm ◽

Nominal Data ◽

New Information ◽

New Feature ◽

High Dimensional Datasets ◽

Experimental Comparisons

No order correlation or similarity metric exists in nominal data, and there will always be more redundancy in a nominal dataset, which means that an efficient mutual information-based nominal-data feature selection method is relatively difficult to find. In this paper, a nominal-data feature selection method based on mutual information without data transformation, called the redundancy-removing more relevance less redundancy algorithm, is proposed. By forming several new information-related definitions and the corresponding computational methods, the proposed method can compute the information-related amount of nominal data directly. Furthermore, by creating a new evaluation function that considers both the relevance and the redundancy globally, the new feature selection method can evaluate the importance of each nominal-data feature. Although the presented feature selection method takes commonly used MIFS-like forms, it is capable of handling high-dimensional datasets without expensive computations. We perform extensive experimental comparisons of the proposed algorithm and other methods using three benchmarking nominal datasets with two different classifiers. The experimental results demonstrate the average advantage of the presented algorithm over the well-known NMIFS algorithm in terms of the feature selection and classification accuracy, which indicates that the proposed method has a promising performance.

Download Full-text

scDIOR: single cell RNA-seq data IO software

BMC Bioinformatics ◽

10.1186/s12859-021-04528-3 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Huijian Feng ◽

Lihui Lin ◽

Jiekai Chen

Keyword(s):

Single Cell ◽

Programming Languages ◽

Large Scale ◽

Developmental Trajectories ◽

Rapid Development ◽

Data Transformation ◽

Rna Seq ◽

Data Types ◽

User Friendly ◽

Cell Data

Abstract Background Single-cell RNA sequencing is becoming a powerful tool to identify cell states, reconstruct developmental trajectories, and deconvolute spatial expression. The rapid development of computational methods promotes the insight of heterogeneous single-cell data. An increasing number of tools have been provided for biological analysts, of which two programming languages- R and Python are widely used among researchers. R and Python are complementary, as many methods are implemented specifically in R or Python. However, the different platforms immediately caused the data sharing and transformation problem, especially for Scanpy, Seurat, and SingleCellExperiemnt. Currently, there is no efficient and user-friendly software to perform data transformation of single-cell omics between platforms, which makes users spend unbearable time on data Input and Output (IO), significantly reducing the efficiency of data analysis. Results We developed scDIOR for single-cell data transformation between platforms of R and Python based on Hierarchical Data Format Version 5 (HDF5). We have created a data IO ecosystem between three R packages (Seurat, SingleCellExperiment, Monocle) and a Python package (Scanpy). Importantly, scDIOR accommodates a variety of data types across programming languages and platforms in an ultrafast way, including single-cell RNA-seq and spatial resolved transcriptomics data, using only a few codes in IDE or command line interface. For large scale datasets, users can partially load the needed information, e.g., cell annotation without the gene expression matrices. scDIOR connects the analytical tasks of different platforms, which makes it easy to compare the performance of algorithms between them. Conclusions scDIOR contains two modules, dior in R and diopy in Python. scDIOR is a versatile and user-friendly tool that implements single-cell data transformation between R and Python rapidly and stably. The software is freely accessible at https://github.com/JiekaiLab/scDIOR.

Download Full-text

A Comparative Study on the Feature Selection Techniques for Intrusion Detection System

Asian Journal of Computer Science and Technology ◽

10.51983/ajcst-2019.8.1.2120 ◽

2019 ◽

Vol 8 (1) ◽

pp. 42-47

Author(s):

D. Selvamani ◽

V. Selvi

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

Comparative Study ◽

Intrusion Detection System ◽

Detection System ◽

Feature Selection Method ◽

Support Vector ◽

Network Intrusion ◽

Chi Square Analysis ◽

Feature Selection Techniques

The Intrusion Detection System (IDS) can be used broadly for securing the network. Intrusion detection systems (IDS) are typically positioned laterally through former protecting safety automation, like access control and verification, as a subsequent line of resistance that guards data classifications. Feature selection is employed to diminish the number of features in various applications where data has more than hundreds of attributes. Essential or relevant attribute recognition has converted a vital job to utilize data mining algorithms efficiently in today world situations. This article describes the comparative study on the Information Gain, Gain Ratio, Symmetrical Uncertainty, Chi-Square analysis feature selection techniques with different Classification methods like Artificial Neural Network, Naïve Bayes and Support Vector Machine. In this article, different performance metrics has utilized to choose the appropriate Feature Selection method for better data classification in IDS.

Download Full-text

Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering

Information Visualization ◽

10.1057/palgrave.ivs.9500053 ◽

2003 ◽

Vol 2 (4) ◽

pp. 232-246 ◽

Cited By ~ 62

Author(s):

Diansheng Guo

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Feature Selection Method ◽

Selection Method ◽

High Dimensional ◽

Cancer Dataset ◽

Data Space ◽

Exploration Environment ◽

Interactive Feature ◽

High Dimensional Datasets

Unknown (and unexpected) multivariate patterns lurking in high-dimensional datasets are often very hard to find. This paper describes a human-centered exploration environment, which incorporates a coordinated suite of computational and visualization methods to explore high-dimensional data for uncovering patterns in multivariate spaces. Specifically, it includes: (1) an interactive feature selection method for identifying potentially interesting, multidimensional subspaces from a high-dimensional data space, (2) an interactive, hierarchical clustering method for searching multivariate clusters of arbitrary shape, and (3) a suite of coordinated visualization and computational components centered around the above two methods to facilitate a human-led exploration. The implemented system is used to analyze a cancer dataset and shows that it is efficient and effective for discovering unknown and unexpected multivariate patterns from high-dimensional data.

Download Full-text

Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction

Genome Biology ◽

10.1186/s13059-021-02480-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Wenjing Ma ◽

Kenong Su ◽

Hao Wu

Keyword(s):

Feature Selection ◽

Single Cell ◽

Reference Data ◽

Feature Selection Method ◽

Prediction Method ◽

Real Data ◽

Cell Type ◽

Reference Dataset ◽

Computational Performance ◽

The Impact

Abstract Background Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. Results In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. Conclusions Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub (https://github.com/marvinquiet/RefConstruction_supervisedCelltyping).

Download Full-text

A redundancy-removing feature selection algorithm for nominal data

10.7287/peerj.preprints.1184v1 ◽

2015 ◽

Cited By ~ 1

Author(s):

Zhihua Li

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Feature Selection Method ◽

Selection Method ◽

Selection Algorithm ◽

Nominal Data ◽

New Information ◽

New Feature ◽

High Dimensional Datasets ◽

Experimental Comparisons

No order correlation or similarity metric exists in nominal data, and there will always be more redundancy in a nominal dataset, which means that an efficient mutual information-based nominal-data feature selection method is relatively difficult to find. In this paper, a nominal-data feature selection method based on mutual information without data transformation, called the redundancy-removing more relevance less redundancy algorithm, is proposed. By forming several new information-related definitions and the corresponding computational methods, the proposed method can compute the information-related amount of nominal data directly. Furthermore, by creating a new evaluation function that considers both the relevance and the redundancy globally, the new feature selection method can evaluate the importance of each nominal-data feature. Although the presented feature selection method takes commonly used MIFS-like forms, it is capable of handling high-dimensional datasets without expensive computations. We perform extensive experimental comparisons of the proposed algorithm and other methods using three benchmarking nominal datasets with two different classifiers. The experimental results demonstrate the average advantage of the presented algorithm over the well-known NMIFS algorithm in terms of the feature selection and classification accuracy, which indicates that the proposed method has a promising performance.

Download Full-text