Accurate feature selection improves single-cell RNA-seq cell clustering

Briefings in Bioinformatics ◽

10.1093/bib/bbab034 ◽

2021 ◽

Author(s):

Kenong Su ◽

Tianwei Yu ◽

Hao Wu

Keyword(s):

Feature Selection ◽

Single Cell ◽

Expression Patterns ◽

Cell Types ◽

Cell Clustering ◽

Selection Step ◽

Good Set ◽

Different Cell Types ◽

The Impact

Abstract Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as ‘features’), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have a significant impact on the clustering accuracy. All existing scRNA-seq clustering tools include a feature selection step relying on some simple unsupervised feature selection methods, mostly based on the statistical moments of gene-wise expression distributions. In this work, we carefully evaluate the impact of feature selection on cell clustering accuracy. In addition, we develop a feature selection algorithm named FEAture SelecTion (FEAST), which provides more representative features. We apply the method on 12 public scRNA-seq datasets and demonstrate that using features selected by FEAST with existing clustering tools significantly improve the clustering accuracy.

Potential Networks of Nitrogen-Phosphorus-Potassium Channels and Transporters in Arabidopsis Roots at a Single Cell Resolution

Frontiers in Plant Science ◽

10.3389/fpls.2021.689545 ◽

2021 ◽

Vol 12 ◽

Author(s):

Dhondup Lhamo ◽

Sheng Luan

Keyword(s):

Single Cell ◽

Expression Patterns ◽

Cell Types ◽

Long Distance ◽

Specific Expression ◽

Root Cells ◽

Long Distance Transport ◽

Npk Uptake ◽

Transcriptomics Data ◽

Different Cell Types

Nitrogen (N), phosphorus (P), and potassium (K) are three major macronutrients essential for plant life. These nutrients are acquired and transported by several large families of transporters expressed in plant roots. However, it remains largely unknown how these transporters are distributed in different cell-types that work together to transfer the nutrients from the soil to different layers of root cells and eventually reach vasculature for massive flow. Using the single cell transcriptomics data from Arabidopsis roots, we profiled the transcriptional patterns of putative nutrient transporters in different root cell-types. Such analyses identified a number of uncharacterized NPK transporters expressed in the root epidermis to mediate NPK uptake and distribution to the adjacent cells. Some transport genes showed cortex- and endodermis-specific expression to direct the nutrient flow toward the vasculature. For long-distance transport, a variety of transporters were shown to express and potentially function in the xylem and phloem. In the context of subcellular distribution of mineral nutrients, the NPK transporters at subcellular compartments were often found to show ubiquitous expression patterns, which suggests function in house-keeping processes. Overall, these single cell transcriptomic analyses provide working models of nutrient transport from the epidermis across the cortex to the vasculature, which can be further tested experimentally in the future.

Application of HDBSСAN Method for Clustering scRNA-seq Data

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2020-32(5)-8 ◽

2020 ◽

Vol 32 (5) ◽

pp. 111-120

Author(s):

Maria Andreevna Akimenkova ◽

Anna Anatolyevna Maznina ◽

Anton Yurievich Naumov ◽

Evgeny Andreevich Karpulevich

Keyword(s):

Feature Selection ◽

Dimensionality Reduction ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Adjusted Rand Index ◽

Clustering Method ◽

Clustering Problem ◽

Single Cell Rna Sequencing

One of the main tasks in the analysis of single cell RNA sequencing (scRNA-seq) data is the identification of cell types and subtypes, which is usually based on some method of clustering. There is a number of generally accepted approaches to solving the clustering problem, one of which is implemented in the Seurat package. In addition, the quality of clustering is influenced by the use of preprocessing algorithms, such as imputation, dimensionality reduction, feature selection, etc. In the article, the HDBSCAN hierarchical clustering method is used to cluster scRNA-seq data. For a more complete comparison Experiments and comparisons were made on two labeled datasets: Zeisel (3005 cells) and Romanov (2881 cells). To compare the quality of clustering, two external metrics were used: Adjusted Rand index and V-measure. The experiments demonstrated a higher quality of clustering by the HDBSCAN method on the Zeisel dataset and a poorer quality on the Romanov dataset.

scRNASeqDB: a database for gene expression profiling in human single cell by RNA-seq

10.1101/104810 ◽

2017 ◽

Cited By ~ 7

Author(s):

Yuan Cao ◽

Junjie Zhu ◽

Guangchun Han ◽

Peilin Jia ◽

Zhongming Zhao

Keyword(s):

Single Cell ◽

Expression Profiles ◽

Cell Types ◽

Or Groups ◽

Cell Transcriptome ◽

Almost All ◽

Single Cell Transcriptome ◽

Different Cell Types ◽

Online Web

AbstractSummary: Single-cell RNA sequencing (scRNA-Seq) is quickly becoming a powerful tool for high-throughput transcriptomic analysis of cell states and dynamics. Both the number and quality of scRNA-Seq datasets have dramatically increased recently. So far, there is no database that comprehensively collects and curates scRNA-Seq data in humans. Here, we present scRNASeqDB, a database that includes almost all the currently available human single cell transcriptome datasets (n= 36) covering 71 human cell lines or types and 8910 samples. Our online web interface allows user to query and visualize expression profiles of the gene(s) of interest, search for genes that are expressed in different cell types or groups, or retrieve differentially expressed genes between cell types or groups. The scRNASeqDB is a valuable resource for single cell transcriptional studies.Availability: The database is available at https://bioinfo.uth.edu/scrnaseqdb/.Contact: [email protected]

Identifying cell types from single-cell data based on similarities and dissimilarities between cells

BMC Bioinformatics ◽

10.1186/s12859-020-03873-z ◽

2021 ◽

Vol 22 (S3) ◽

Author(s):

Yuanyuan Li ◽

Ping Luo ◽

Yi Lu ◽

Fang-Xiang Wu

Keyword(s):

Gene Expression ◽

Single Cell ◽

Spectral Clustering ◽

Incidence Matrix ◽

Expression Patterns ◽

Cell Types ◽

Clustering Method ◽

Different Types ◽

Cell Data ◽

Spectral Clustering Method

Abstract Background With the development of the technology of single-cell sequence, revealing homogeneity and heterogeneity between cells has become a new area of computational systems biology research. However, the clustering of cell types becomes more complex with the mutual penetration between different types of cells and the instability of gene expression. One way of overcoming this problem is to group similar, related single cells together by the means of various clustering analysis methods. Although some methods such as spectral clustering can do well in the identification of cell types, they only consider the similarities between cells and ignore the influence of dissimilarities on clustering results. This methodology may limit the performance of most of the conventional clustering algorithms for the identification of clusters, it needs to develop special methods for high-dimensional sparse categorical data. Results Inspired by the phenomenon that same type cells have similar gene expression patterns, but different types of cells evoke dissimilar gene expression patterns, we improve the existing spectral clustering method for clustering single-cell data that is based on both similarities and dissimilarities between cells. The method first measures the similarity/dissimilarity among cells, then constructs the incidence matrix by fusing similarity matrix with dissimilarity matrix, and, finally, uses the eigenvalues of the incidence matrix to perform dimensionality reduction and employs the K-means algorithm in the low dimensional space to achieve clustering. The proposed improved spectral clustering method is compared with the conventional spectral clustering method in recognizing cell types on several real single-cell RNA-seq datasets. Conclusions In summary, we show that adding intercellular dissimilarity can effectively improve accuracy and achieve robustness and that improved spectral clustering method outperforms the traditional spectral clustering method in grouping cells.

TPK: a single-cell clustering algorithm based on novel feature selection genes

Journal of Physics Conference Series ◽

10.1088/1742-6596/1738/1/012078 ◽

2021 ◽

Vol 1738 ◽

pp. 012078

Author(s):

Yaxuan Cui ◽

Kunjie Luo ◽

Zheyu Zhang ◽

Saijia Liu

Keyword(s):

Feature Selection ◽

Single Cell ◽

Clustering Algorithm ◽

Cell Clustering

Optimizing expression quantitative trait locus mapping workflows for single-cell studies

Genome Biology ◽

10.1186/s13059-021-02407-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Anna S. E. Cuomo ◽

Giordano Alvari ◽

Christina B. Azodi ◽

Davis J. McCarthy ◽

Marc Jan Bonder ◽

...

Keyword(s):

Gene Expression ◽

Quantitative Trait Locus ◽

Single Cell ◽

Quantitative Trait ◽

Cell Types ◽

Expression Quantitative Trait Locus ◽

Eqtl Mapping ◽

Trait Locus ◽

The Impact

Abstract Background Single-cell RNA sequencing (scRNA-seq) has enabled the unbiased, high-throughput quantification of gene expression specific to cell types and states. With the cost of scRNA-seq decreasing and techniques for sample multiplexing improving, population-scale scRNA-seq, and thus single-cell expression quantitative trait locus (sc-eQTL) mapping, is increasingly feasible. Mapping of sc-eQTL provides additional resolution to study the regulatory role of common genetic variants on gene expression across a plethora of cell types and states and promises to improve our understanding of genetic regulation across tissues in both health and disease. Results While previously established methods for bulk eQTL mapping can, in principle, be applied to sc-eQTL mapping, there are a number of open questions about how best to process scRNA-seq data and adapt bulk methods to optimize sc-eQTL mapping. Here, we evaluate the role of different normalization and aggregation strategies, covariate adjustment techniques, and multiple testing correction methods to establish best practice guidelines. We use both real and simulated datasets across single-cell technologies to systematically assess the impact of these different statistical approaches. Conclusion We provide recommendations for future single-cell eQTL studies that can yield up to twice as many eQTL discoveries as default approaches ported from bulk studies.

Selecting single cell clustering parameter values using subsampling-based robustness metrics

BMC Bioinformatics ◽

10.1186/s12859-021-03957-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ryan B. Patterson-Cross ◽

Ariel J. Levine ◽

Vilas Menon

Keyword(s):

Single Cell ◽

Optimal Parameter ◽

Clustering Algorithms ◽

Cell Types ◽

Parameter Selection ◽

Data Set ◽

Biologically Relevant ◽

Cell Clustering ◽

Parameter Values ◽

Robustness Metrics

Abstract Background Generating and analysing single-cell data has become a widespread approach to examine tissue heterogeneity, and numerous algorithms exist for clustering these datasets to identify putative cell types with shared transcriptomic signatures. However, many of these clustering workflows rely on user-tuned parameter values, tailored to each dataset, to identify a set of biologically relevant clusters. Whereas users often develop their own intuition as to the optimal range of parameters for clustering on each data set, the lack of systematic approaches to identify this range can be daunting to new users of any given workflow. In addition, an optimal parameter set does not guarantee that all clusters are equally well-resolved, given the heterogeneity in transcriptomic signatures in most biological systems. Results Here, we illustrate a subsampling-based approach (chooseR) that simultaneously guides parameter selection and characterizes cluster robustness. Through bootstrapped iterative clustering across a range of parameters, chooseR was used to select parameter values for two distinct clustering workflows (Seurat and scVI). In each case, chooseR identified parameters that produced biologically relevant clusters from both well-characterized (human PBMC) and complex (mouse spinal cord) datasets. Moreover, it provided a simple “robustness score” for each of these clusters, facilitating the assessment of cluster quality. Conclusion chooseR is a simple, conceptually understandable tool that can be used flexibly across clustering algorithms, workflows, and datasets to guide clustering parameter selection and characterize cluster robustness.

Revealing immune responses in the Mycobacterium avium subsp. paratuberculosis-infected THP-1 cells using single cell RNA-sequencing

PLoS ONE ◽

10.1371/journal.pone.0254194 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254194

Author(s):

Hong-Tae Park ◽

Woo Bin Park ◽

Suji Kim ◽

Jong-Sung Lim ◽

Gyoungju Nah ◽

...

Keyword(s):

Crohn’S Disease ◽

Crohn's Disease ◽

Single Cell ◽

Mycobacterium Avium ◽

Expression Patterns ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Cell Type ◽

Cytokines And Chemokines

Mycobacterium avium subsp. paratuberculosis (MAP) is a causative agent of Johne’s disease, which is a chronic and debilitating disease in ruminants. MAP is also considered to be a possible cause of Crohn’s disease in humans. However, few studies have focused on the interactions between MAP and human macrophages to elucidate the pathogenesis of Crohn’s disease. We sought to determine the initial responses of human THP-1 cells against MAP infection using single-cell RNA-seq analysis. Clustering analysis showed that THP-1 cells were divided into seven different clusters in response to phorbol-12-myristate-13-acetate (PMA) treatment. The characteristics of each cluster were investigated by identifying cluster-specific marker genes. From the results, we found that classically differentiated cells express CD14, CD36, and TLR2, and that this cell type showed the most active responses against MAP infection. The responses included the expression of proinflammatory cytokines and chemokines such as CCL4, CCL3, IL1B, IL8, and CCL20. In addition, the Mreg cell type, a novel cell type differentiated from THP-1 cells, was discovered. Thus, it is suggested that different cell types arise even when the same cell line is treated under the same conditions. Overall, analyzing gene expression patterns via scRNA-seq classification allows a more detailed observation of the response to infection by each cell type.

Infinity Flow: High-throughput single-cell quantification of 100s of proteins using conventional flow cytometry and machine learning

10.1101/2020.06.17.152926 ◽

2020 ◽

Author(s):

Etienne Becht ◽

Daniel Tolstrup ◽

Charles-Antoine Dutertre ◽

Florent Ginhoux ◽

Evan W. Newell ◽

...

Keyword(s):

Machine Learning ◽

Flow Cytometry ◽

Single Cell ◽

Low Cost ◽

Expression Patterns ◽

Cell Types ◽

Cellular Heterogeneity ◽

Supervised Machine Learning ◽

Melanoma Metastasis ◽

Immunologic Research

AbstractModern immunologic research increasingly requires high-dimensional analyses in order to understand the complex milieu of cell-types that comprise the tissue microenvironments of disease. To achieve this, we developed Infinity Flow combining hundreds of overlapping flow cytometry panels using machine learning to enable the simultaneous analysis of the co-expression patterns of 100s of surface-expressed proteins across millions of individual cells. In this study, we demonstrate that this approach allows the comprehensive analysis of the cellular constituency of the steady-state murine lung and to identify novel cellular heterogeneity in the lungs of melanoma metastasis bearing mice. We show that by using supervised machine learning, Infinity Flow enhances the accuracy and depth of clustering or dimensionality reduction algorithms. Infinity Flow is a highly scalable, low-cost and accessible solution to single cell proteomics in complex tissues.

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03797-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Chunxiang Wang ◽

Xin Gao ◽

Juntao Liu

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Preprocessing ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Preprocessing Method ◽

Cell Clustering ◽

Cell Gene Expression

Abstract Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.