scholarly journals Deep soft K-means clustering with self-training for single-cell RNA sequence data

2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Liang Chen ◽  
Weinan Wang ◽  
Yuyao Zhai ◽  
Minghua Deng

Abstract Single-cell RNA sequencing (scRNA-seq) allows researchers to study cell heterogeneity at the cellular level. A crucial step in analyzing scRNA-seq data is to cluster cells into subpopulations to facilitate subsequent downstream analysis. However, frequent dropout events and increasing size of scRNA-seq data make clustering such high-dimensional, sparse and massive transcriptional expression profiles challenging. Although some existing deep learning-based clustering algorithms for single cells combine dimensionality reduction with clustering, they either ignore the distance and affinity constraints between similar cells or make some additional latent space assumptions like mixture Gaussian distribution, failing to learn cluster-friendly low-dimensional space. Therefore, in this paper, we combine the deep learning technique with the use of a denoising autoencoder to characterize scRNA-seq data while propose a soft self-training K-means algorithm to cluster the cell population in the learned latent space. The self-training procedure can effectively aggregate the similar cells and pursue more cluster-friendly latent space. Our method, called ‘scziDesk’, alternately performs data compression, data reconstruction and soft clustering iteratively, and the results exhibit excellent compatibility and robustness in both simulated and real data. Moreover, our proposed method has perfect scalability in line with cell size on large-scale datasets.

2020 ◽  
Vol 36 (9) ◽  
pp. 2778-2786 ◽  
Author(s):  
Shobana V Stassen ◽  
Dickson M D Siu ◽  
Kelvin C M Lee ◽  
Joshua W K Ho ◽  
Hayden K H So ◽  
...  

Abstract Motivation New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity. Results We introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis. Availability and implementation https://github.com/ShobiStassen/PARC. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Yue Deng ◽  
Feng Bao ◽  
Qionghai Dai ◽  
Lani F. Wu ◽  
Steven J. Altschuler

Recent advances in large-scale single cell RNA-seq enable fine-grained characterization of phenotypically distinct cellular states within heterogeneous tissues. We present scScope, a scalable deep-learning based approach that can accurately and rapidly identify cell-type composition from millions of noisy single-cell gene-expression profiles.


2020 ◽  
Author(s):  
Noa Liscovitch-Brauer ◽  
Antonino Montalbano ◽  
Jiale Deng ◽  
Alejandro Méndez-Mancilla ◽  
Hans-Hermann Wessels ◽  
...  

AbstractPooled CRISPR screens have been used to identify genes responsible for specific phenotypes and diseases, and, more recently, to connect genetic perturbations with multi-dimensional gene expression profiles. Here, we describe a method to link genome-wide chromatin accessibility to genetic perturbations in single cells. This scalable, cost-effective method combines pooled CRISPR perturbations with a single-cell combinatorial indexing assay for transposase-accessible chromatin (CRISPR-sciATAC). Using a human and mouse species-mixing experiment, we show that CRISPR-sciATAC separates single cells with a low doublet rate. Then, in human myelogenous leukemia cells, we apply CRISPR-sciATAC to target 21 chromatin-related genes that are frequently mutated in cancer and 84 subunits and cofactors of chromatin remodeling complexes, generating chromatin accessibility data for ~30,000 single cells. Using this large-scale atlas, we correlate loss of specific chromatin remodelers with changes in accessibility — globally and at the binding sites of individual transcription factors. For example, we show that loss of the H3K27 methyltransferase EZH2 leads to increased accessibility at heterochromatic regions involved in embryonic development and triggers expression of multiple genes in the HOXA and HOXD clusters. At a subset of regulatory sites, we also analyze dynamic changes in nucleosome spacing upon loss of chromatin remodelers. CRISPR-sciATAC is a high-throughput, low-cost single-cell method that can be applied broadly to study the role of genetic perturbations on chromatin in normal and disease states.


2019 ◽  
Author(s):  
Shobana V. Stassen ◽  
Dickson M. D. Siu ◽  
Kelvin C. M. Lee ◽  
Joshua W. K. Ho ◽  
Hayden K. H. So ◽  
...  

AbstractMotivationNew single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity.ResultsWe introduce a highly scalable graph-based clustering algorithm PARC - phenotyping by accelerated refined community-partitioning – for ultralarge-scale, high-dimensional single-cell data (> 1 million cells). Using large single cell mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without sub-sampling of cells, including Phenograph, FlowSOM, and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single cell data set of 1.1M cells within 13 minutes, compared to >2 hours to the next fastest graph-clustering algorithm, Phenograph. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis.Availability and Implementationhttps://github.com/ShobiStassen/PARC


2019 ◽  
Vol 35 (22) ◽  
pp. 4688-4695 ◽  
Author(s):  
Rui Hou ◽  
Elena Denisenko ◽  
Alistair R R Forrest

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) measures gene expression at the resolution of individual cells. Massively multiplexed single-cell profiling has enabled large-scale transcriptional analyses of thousands of cells in complex tissues. In most cases, the true identity of individual cells is unknown and needs to be inferred from the transcriptomic data. Existing methods typically cluster (group) cells based on similarities of their gene expression profiles and assign the same identity to all cells within each cluster using the averaged expression levels. However, scRNA-seq experiments typically produce low-coverage sequencing data for each cell, which hinders the clustering process. Results We introduce scMatch, which directly annotates single cells by identifying their closest match in large reference datasets. We used this strategy to annotate various single-cell datasets and evaluated the impacts of sequencing depth, similarity metric and reference datasets. We found that scMatch can rapidly and robustly annotate single cells with comparable accuracy to another recent cell annotation tool (SingleR), but that it is quicker and can handle larger reference datasets. We demonstrate how scMatch can handle large customized reference gene expression profiles that combine data from multiple sources, thus empowering researchers to identify cell populations in any complex tissue with the desired precision. Availability and implementation scMatch (Python code) and the FANTOM5 reference dataset are freely available to the research community here https://github.com/forrest-lab/scMatch. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 23 (1) ◽  
Author(s):  
Bhupinder Pal ◽  
Yunshun Chen ◽  
Michael J. G. Milevskiy ◽  
François Vaillant ◽  
Lexie Prokopuk ◽  
...  

Abstract Background Heterogeneity within the mouse mammary epithelium and potential lineage relationships have been recently explored by single-cell RNA profiling. To further understand how cellular diversity changes during mammary ontogeny, we profiled single cells from nine different developmental stages spanning late embryogenesis, early postnatal, prepuberty, adult, mid-pregnancy, late-pregnancy, and post-involution, as well as the transcriptomes of micro-dissected terminal end buds (TEBs) and subtending ducts during puberty. Methods The single cell transcriptomes of 132,599 mammary epithelial cells from 9 different developmental stages were determined on the 10x Genomics Chromium platform, and integrative analyses were performed to compare specific time points. Results The mammary rudiment at E18.5 closely aligned with the basal lineage, while prepubertal epithelial cells exhibited lineage segregation but to a less differentiated state than their adult counterparts. Comparison of micro-dissected TEBs versus ducts showed that luminal cells within TEBs harbored intermediate expression profiles. Ductal basal cells exhibited increased chromatin accessibility of luminal genes compared to their TEB counterparts suggesting that lineage-specific chromatin is established within the subtending ducts during puberty. An integrative analysis of five stages spanning the pregnancy cycle revealed distinct stage-specific profiles and the presence of cycling basal, mixed-lineage, and 'late' alveolar intermediates in pregnancy. Moreover, a number of intermediates were uncovered along the basal-luminal progenitor cell axis, suggesting a continuum of alveolar-restricted progenitor states. Conclusions This extended single cell transcriptome atlas of mouse mammary epithelial cells provides the most complete coverage for mammary epithelial cells during morphogenesis to date. Together with chromatin accessibility analysis of TEB structures, it represents a valuable framework for understanding developmental decisions within the mouse mammary gland.


2020 ◽  
Author(s):  
Feng Tian ◽  
Fan Zhou ◽  
Xiang Li ◽  
Wenping Ma ◽  
Honggui Wu ◽  
...  

SummaryBy circumventing cellular heterogeneity, single cell omics have now been widely utilized for cell typing in human tissues, culminating with the undertaking of human cell atlas aimed at characterizing all human cell types. However, more important are the probing of gene regulatory networks, underlying chromatin architecture and critical transcription factors for each cell type. Here we report the Genomic Architecture of Cells in Tissues (GeACT), a comprehensive genomic data base that collectively address the above needs with the goal of understanding the functional genome in action. GeACT was made possible by our novel single-cell RNA-seq (MALBAC-DT) and ATAC-seq (METATAC) methods of high detectability and precision. We exemplified GeACT by first studying representative organs in human mid-gestation fetus. In particular, correlated gene modules (CGMs) are observed and found to be cell-type-dependent. We linked gene expression profiles to the underlying chromatin states, and found the key transcription factors for representative CGMs.HighlightsGenomic Architecture of Cells in Tissues (GeACT) data for human mid-gestation fetusDetermining correlated gene modules (CGMs) in different cell types by MALBAC-DTMeasuring chromatin open regions in single cells with high detectability by METATACIntegrating transcriptomics and chromatin accessibility to reveal key TFs for a CGM


2021 ◽  
Vol 9 (Suppl 1) ◽  
pp. A12.1-A12
Author(s):  
Y Arjmand Abbassi ◽  
N Fang ◽  
W Zhu ◽  
Y Zhou ◽  
Y Chen ◽  
...  

Recent advances of high-throughput single cell sequencing technologies have greatly improved our understanding of the complex biological systems. Heterogeneous samples such as tumor tissues commonly harbor cancer cell-specific genetic variants and gene expression profiles, both of which have been shown to be related to the mechanisms of disease development, progression, and responses to treatment. Furthermore, stromal and immune cells within tumor microenvironment interact with cancer cells to play important roles in tumor responses to systematic therapy such as immunotherapy or cell therapy. However, most current high-throughput single cell sequencing methods detect only gene expression levels or epigenetics events such as chromatin conformation. The information on important genetic variants including mutation or fusion is not captured. To better understand the mechanisms of tumor responses to systematic therapy, it is essential to decipher the connection between genotype and gene expression patterns of both tumor cells and cells in the tumor microenvironment. We developed FocuSCOPE, a high-throughput multi-omics sequencing solution that can detect both genetic variants and transcriptome from same single cells. FocuSCOPE has been used to successfully perform single cell analysis of both gene expression profiles and point mutations, fusion genes, or intracellular viral sequences from thousands of cells simultaneously, delivering comprehensive insights of tumor and immune cells in tumor microenvironment at single cell resolution.Disclosure InformationY. Arjmand Abbassi: None. N. Fang: None. W. Zhu: None. Y. Zhou: None. Y. Chen: None. U. Deutsch: None.


2016 ◽  
Author(s):  
George Dimitriadis ◽  
Joana Neto ◽  
Adam R. Kampff

AbstractElectrophysiology is entering the era of ‘Big Data’. Multiple probes, each with hundreds to thousands of individual electrodes, are now capable of simultaneously recording from many brain regions. The major challenge confronting these new technologies is transforming the raw data into physiologically meaningful signals, i.e. single unit spikes. Sorting the spike events of individual neurons from a spatiotemporally dense sampling of the extracellular electric field is a problem that has attracted much attention [22, 23], but is still far from solved. Current methods still rely on human input and thus become unfeasible as the size of the data sets grow exponentially.Here we introduce the t-student stochastic neighbor embedding (t-sne) dimensionality reduction method [27] as a visualization tool in the spike sorting process. T-sne embeds the n-dimensional extracellular spikes (n = number of features by which each spike is decomposed) into a low (usually two) dimensional space. We show that such embeddings, even starting from different feature spaces, form obvious clusters of spikes that can be easily visualized and manually delineated with a high degree of precision. We propose that these clusters represent single units and test this assertion by applying our algorithm on labeled data sets both from hybrid [23] and paired juxtacellular/extracellular recordings [15]. We have released a graphical user interface (gui) written in python as a tool for the manual clustering of the t-sne embedded spikes and as a tool for an informed overview and fast manual curration of results from other clustering algorithms. Furthermore, the generated visualizations offer evidence in favor of the use of probes with higher density and smaller electrodes. They also graphically demonstrate the diverse nature of the sorting problem when spikes are recorded with different methods and arise from regions with different background spiking statistics.


2019 ◽  
Author(s):  
Ning Wang ◽  
Andrew E. Teschendorff

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.


Sign in / Sign up

Export Citation Format

Share Document