scholarly journals Accurate feature selection improves single-cell RNA-seq cell clustering

Author(s):  
Kenong Su ◽  
Tianwei Yu ◽  
Hao Wu

Abstract Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as ‘features’), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have a significant impact on the clustering accuracy. All existing scRNA-seq clustering tools include a feature selection step relying on some simple unsupervised feature selection methods, mostly based on the statistical moments of gene-wise expression distributions. In this work, we carefully evaluate the impact of feature selection on cell clustering accuracy. In addition, we develop a feature selection algorithm named FEAture SelecTion (FEAST), which provides more representative features. We apply the method on 12 public scRNA-seq datasets and demonstrate that using features selected by FEAST with existing clustering tools significantly improve the clustering accuracy.

2020 ◽  
Vol 32 (5) ◽  
pp. 111-120
Author(s):  
Maria Andreevna Akimenkova ◽  
Anna Anatolyevna Maznina ◽  
Anton Yurievich Naumov ◽  
Evgeny Andreevich Karpulevich

One of the main tasks in the analysis of single cell RNA sequencing (scRNA-seq) data is the identification of cell types and subtypes, which is usually based on some method of clustering. There is a number of generally accepted approaches to solving the clustering problem, one of which is implemented in the Seurat package. In addition, the quality of clustering is influenced by the use of preprocessing algorithms, such as imputation, dimensionality reduction, feature selection, etc. In the article, the HDBSCAN hierarchical clustering method is used to cluster scRNA-seq data. For a more complete comparison Experiments and comparisons were made on two labeled datasets: Zeisel (3005 cells) and Romanov (2881 cells). To compare the quality of clustering, two external metrics were used: Adjusted Rand index and V-measure. The experiments demonstrated a higher quality of clustering by the HDBSCAN method on the Zeisel dataset and a poorer quality on the Romanov dataset.


2021 ◽  
Vol 12 ◽  
Author(s):  
Dhondup Lhamo ◽  
Sheng Luan

Nitrogen (N), phosphorus (P), and potassium (K) are three major macronutrients essential for plant life. These nutrients are acquired and transported by several large families of transporters expressed in plant roots. However, it remains largely unknown how these transporters are distributed in different cell-types that work together to transfer the nutrients from the soil to different layers of root cells and eventually reach vasculature for massive flow. Using the single cell transcriptomics data from Arabidopsis roots, we profiled the transcriptional patterns of putative nutrient transporters in different root cell-types. Such analyses identified a number of uncharacterized NPK transporters expressed in the root epidermis to mediate NPK uptake and distribution to the adjacent cells. Some transport genes showed cortex- and endodermis-specific expression to direct the nutrient flow toward the vasculature. For long-distance transport, a variety of transporters were shown to express and potentially function in the xylem and phloem. In the context of subcellular distribution of mineral nutrients, the NPK transporters at subcellular compartments were often found to show ubiquitous expression patterns, which suggests function in house-keeping processes. Overall, these single cell transcriptomic analyses provide working models of nutrient transport from the epidermis across the cortex to the vasculature, which can be further tested experimentally in the future.


2017 ◽  
Author(s):  
Yuan Cao ◽  
Junjie Zhu ◽  
Guangchun Han ◽  
Peilin Jia ◽  
Zhongming Zhao

AbstractSummary: Single-cell RNA sequencing (scRNA-Seq) is quickly becoming a powerful tool for high-throughput transcriptomic analysis of cell states and dynamics. Both the number and quality of scRNA-Seq datasets have dramatically increased recently. So far, there is no database that comprehensively collects and curates scRNA-Seq data in humans. Here, we present scRNASeqDB, a database that includes almost all the currently available human single cell transcriptome datasets (n= 36) covering 71 human cell lines or types and 8910 samples. Our online web interface allows user to query and visualize expression profiles of the gene(s) of interest, search for genes that are expressed in different cell types or groups, or retrieve differentially expressed genes between cell types or groups. The scRNASeqDB is a valuable resource for single cell transcriptional studies.Availability: The database is available at https://bioinfo.uth.edu/scrnaseqdb/.Contact: [email protected]


2021 ◽  
Vol 22 (S3) ◽  
Author(s):  
Yuanyuan Li ◽  
Ping Luo ◽  
Yi Lu ◽  
Fang-Xiang Wu

Abstract Background With the development of the technology of single-cell sequence, revealing homogeneity and heterogeneity between cells has become a new area of computational systems biology research. However, the clustering of cell types becomes more complex with the mutual penetration between different types of cells and the instability of gene expression. One way of overcoming this problem is to group similar, related single cells together by the means of various clustering analysis methods. Although some methods such as spectral clustering can do well in the identification of cell types, they only consider the similarities between cells and ignore the influence of dissimilarities on clustering results. This methodology may limit the performance of most of the conventional clustering algorithms for the identification of clusters, it needs to develop special methods for high-dimensional sparse categorical data. Results Inspired by the phenomenon that same type cells have similar gene expression patterns, but different types of cells evoke dissimilar gene expression patterns, we improve the existing spectral clustering method for clustering single-cell data that is based on both similarities and dissimilarities between cells. The method first measures the similarity/dissimilarity among cells, then constructs the incidence matrix by fusing similarity matrix with dissimilarity matrix, and, finally, uses the eigenvalues of the incidence matrix to perform dimensionality reduction and employs the K-means algorithm in the low dimensional space to achieve clustering. The proposed improved spectral clustering method is compared with the conventional spectral clustering method in recognizing cell types on several real single-cell RNA-seq datasets. Conclusions In summary, we show that adding intercellular dissimilarity can effectively improve accuracy and achieve robustness and that improved spectral clustering method outperforms the traditional spectral clustering method in grouping cells.


2021 ◽  
Vol 1738 ◽  
pp. 012078
Author(s):  
Yaxuan Cui ◽  
Kunjie Luo ◽  
Zheyu Zhang ◽  
Saijia Liu

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Anna S. E. Cuomo ◽  
Giordano Alvari ◽  
Christina B. Azodi ◽  
Davis J. McCarthy ◽  
Marc Jan Bonder ◽  
...  

Abstract Background Single-cell RNA sequencing (scRNA-seq) has enabled the unbiased, high-throughput quantification of gene expression specific to cell types and states. With the cost of scRNA-seq decreasing and techniques for sample multiplexing improving, population-scale scRNA-seq, and thus single-cell expression quantitative trait locus (sc-eQTL) mapping, is increasingly feasible. Mapping of sc-eQTL provides additional resolution to study the regulatory role of common genetic variants on gene expression across a plethora of cell types and states and promises to improve our understanding of genetic regulation across tissues in both health and disease. Results While previously established methods for bulk eQTL mapping can, in principle, be applied to sc-eQTL mapping, there are a number of open questions about how best to process scRNA-seq data and adapt bulk methods to optimize sc-eQTL mapping. Here, we evaluate the role of different normalization and aggregation strategies, covariate adjustment techniques, and multiple testing correction methods to establish best practice guidelines. We use both real and simulated datasets across single-cell technologies to systematically assess the impact of these different statistical approaches. Conclusion We provide recommendations for future single-cell eQTL studies that can yield up to twice as many eQTL discoveries as default approaches ported from bulk studies.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ryan B. Patterson-Cross ◽  
Ariel J. Levine ◽  
Vilas Menon

Abstract Background Generating and analysing single-cell data has become a widespread approach to examine tissue heterogeneity, and numerous algorithms exist for clustering these datasets to identify putative cell types with shared transcriptomic signatures. However, many of these clustering workflows rely on user-tuned parameter values, tailored to each dataset, to identify a set of biologically relevant clusters. Whereas users often develop their own intuition as to the optimal range of parameters for clustering on each data set, the lack of systematic approaches to identify this range can be daunting to new users of any given workflow. In addition, an optimal parameter set does not guarantee that all clusters are equally well-resolved, given the heterogeneity in transcriptomic signatures in most biological systems. Results Here, we illustrate a subsampling-based approach (chooseR) that simultaneously guides parameter selection and characterizes cluster robustness. Through bootstrapped iterative clustering across a range of parameters, chooseR was used to select parameter values for two distinct clustering workflows (Seurat and scVI). In each case, chooseR identified parameters that produced biologically relevant clusters from both well-characterized (human PBMC) and complex (mouse spinal cord) datasets. Moreover, it provided a simple “robustness score” for each of these clusters, facilitating the assessment of cluster quality. Conclusion chooseR is a simple, conceptually understandable tool that can be used flexibly across clustering algorithms, workflows, and datasets to guide clustering parameter selection and characterize cluster robustness.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254194
Author(s):  
Hong-Tae Park ◽  
Woo Bin Park ◽  
Suji Kim ◽  
Jong-Sung Lim ◽  
Gyoungju Nah ◽  
...  

Mycobacterium avium subsp. paratuberculosis (MAP) is a causative agent of Johne’s disease, which is a chronic and debilitating disease in ruminants. MAP is also considered to be a possible cause of Crohn’s disease in humans. However, few studies have focused on the interactions between MAP and human macrophages to elucidate the pathogenesis of Crohn’s disease. We sought to determine the initial responses of human THP-1 cells against MAP infection using single-cell RNA-seq analysis. Clustering analysis showed that THP-1 cells were divided into seven different clusters in response to phorbol-12-myristate-13-acetate (PMA) treatment. The characteristics of each cluster were investigated by identifying cluster-specific marker genes. From the results, we found that classically differentiated cells express CD14, CD36, and TLR2, and that this cell type showed the most active responses against MAP infection. The responses included the expression of proinflammatory cytokines and chemokines such as CCL4, CCL3, IL1B, IL8, and CCL20. In addition, the Mreg cell type, a novel cell type differentiated from THP-1 cells, was discovered. Thus, it is suggested that different cell types arise even when the same cell line is treated under the same conditions. Overall, analyzing gene expression patterns via scRNA-seq classification allows a more detailed observation of the response to infection by each cell type.


2020 ◽  
Author(s):  
Etienne Becht ◽  
Daniel Tolstrup ◽  
Charles-Antoine Dutertre ◽  
Florent Ginhoux ◽  
Evan W. Newell ◽  
...  

AbstractModern immunologic research increasingly requires high-dimensional analyses in order to understand the complex milieu of cell-types that comprise the tissue microenvironments of disease. To achieve this, we developed Infinity Flow combining hundreds of overlapping flow cytometry panels using machine learning to enable the simultaneous analysis of the co-expression patterns of 100s of surface-expressed proteins across millions of individual cells. In this study, we demonstrate that this approach allows the comprehensive analysis of the cellular constituency of the steady-state murine lung and to identify novel cellular heterogeneity in the lungs of melanoma metastasis bearing mice. We show that by using supervised machine learning, Infinity Flow enhances the accuracy and depth of clustering or dimensionality reduction algorithms. Infinity Flow is a highly scalable, low-cost and accessible solution to single cell proteomics in complex tissues.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Chunxiang Wang ◽  
Xin Gao ◽  
Juntao Liu

Abstract Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.


Sign in / Sign up

Export Citation Format

Share Document