Imputing single-cell RNA-seq data by considering cell heterogeneity and prior expression of dropouts

Journal of Molecular Cell Biology ◽

10.1093/jmcb/mjaa052 ◽

2020 ◽

Author(s):

Lihua Zhang ◽

Shihua Zhang

Keyword(s):

Single Cell ◽

Expression Patterns ◽

Dropout Rate ◽

Population Based ◽

Low Rank ◽

Cell Heterogeneity ◽

Rna Seq ◽

A Cell ◽

Low Dimensional ◽

The Relationship

Abstract Single-cell RNA sequencing (scRNA-seq) provides a powerful tool to determine expression patterns of thousands of individual cells. However, the analysis of scRNA-seq data remains a computational challenge due to the high technical noise such as the presence of dropout events that lead to a large proportion of zeros for expressed genes. Taking into account the cell heterogeneity and the relationship between dropout rate and expected expression level, we present a cell sub-population based bounded low-rank (PBLR) method to impute the dropouts of scRNA-seq data. Through application to both simulated and real scRNA-seq datasets, PBLR is shown to be effective in recovering dropout events, and it can dramatically improve the low-dimensional representation and the recovery of gene‒gene relationships masked by dropout events compared to several state-of-the-art methods. Moreover, PBLR also detects accurate and robust cell sub-populations automatically, shedding light on its flexibility and generality for scRNA-seq data analysis.

Download Full-text

PBLR: an accurate single cell RNA-seq data imputation tool considering cell heterogeneity and prior expression level of dropouts

10.1101/379883 ◽

2018 ◽

Cited By ~ 6

Author(s):

Lihua Zhang ◽

Shihua Zhang

Keyword(s):

Data Analysis ◽

Single Cell ◽

Expression Patterns ◽

Dropout Rate ◽

Structural Effect ◽

Cell Heterogeneity ◽

Novel Method ◽

Cell Subpopulations ◽

Low Dimensional ◽

Precise Expression

AbstractSingle-cell RNA sequencing (scRNA-seq) provides a powerful tool to determine precise expression patterns of tens of thousands of individual cells, decipher cell heterogeneity and cell subpopulations and so on. However, scRNA-seq data analysis remains challenging due to various technical noise, e.g., the presence of dropout events (i.e., excess zero counts). Taking account of cell heterogeneity and structural effect of expression on dropout rate, we propose a novel method named PBLR to accurately impute the dropouts of scRNA-seq data. PBLR is an effective tool to recover dropout events on both simulated and real scRNA-seq datasets, and can dramatically improve low-dimensional representation and recovery of gene-gene relationship masked by dropout events compared to several state-of-the-art methods. Moreover, PBLR also detect accurate and robust cell subpopulations automatically, shedding light its flexibility and generality for scRNA-seq data analysis.

Download Full-text

A probabilistic model-based bi-clustering method for single-cell transcriptomic data analysis

10.1101/181362 ◽

2017 ◽

Cited By ~ 2

Author(s):

Sha Cao ◽

Tao Sheng ◽

Xin Chen ◽

Qin Ma ◽

Chi Zhang

Keyword(s):

Single Cell ◽

Expression Patterns ◽

Cell Types ◽

Computational Techniques ◽

Rna Seq ◽

Transcriptomic Data ◽

Large Numbers ◽

A Cell ◽

Multiple Cell ◽

Cell Data

AbstractWe present here novel computational techniques for tackling four problems related to analyses of single-cell RNA-Seq data: (1) a mixture model for coping with multiple cell types in a cell population; (2) a truncated model for handling the unquantifiable errors caused by large numbers of zeros or low-expression values; (3) a bi-clustering technique for detection of sub-populations of cells sharing common expression patterns among subsets of genes; and (4) detection of small cell sub-populations with distinct expression patterns. Through case studies, we demonstrated that these techniques can derive high-resolution information from single-cell data that are not feasible using existing techniques.

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Clustering Single-Cell RNA-Seq Data with Regularized Gaussian Graphical Model

Genes ◽

10.3390/genes12020311 ◽

2021 ◽

Vol 12 (2) ◽

pp. 311

Author(s):

Zhenqiu Liu

Keyword(s):

Single Cell ◽

Free Parameter ◽

Graphical Model ◽

Expression Patterns ◽

Information Criterion ◽

Log P ◽

Rna Seq ◽

Clustering Methods ◽

Wide Range ◽

Free Parameters

Single-cell RNA-seq (scRNA-seq) is a powerful tool to measure the expression patterns of individual cells and discover heterogeneity and functional diversity among cell populations. Due to variability, it is challenging to analyze such data efficiently. Many clustering methods have been developed using at least one free parameter. Different choices for free parameters may lead to substantially different visualizations and clusters. Tuning free parameters is also time consuming. Thus there is need for a simple, robust, and efficient clustering method. In this paper, we propose a new regularized Gaussian graphical clustering (RGGC) method for scRNA-seq data. RGGC is based on high-order (partial) correlations and subspace learning, and is robust over a wide-range of a regularized parameter λ. Therefore, we can simply set λ=2 or λ=log(p) for AIC (Akaike information criterion) or BIC (Bayesian information criterion) without cross-validation. Cell subpopulations are discovered by the Louvain community detection algorithm that determines the number of clusters automatically. There is no free parameter to be tuned with RGGC. When evaluated with simulated and benchmark scRNA-seq data sets against widely used methods, RGGC is computationally efficient and one of the top performers. It can detect inter-sample cell heterogeneity, when applied to glioblastoma scRNA-seq data.

Download Full-text

MEDALT: single-cell copy number lineage tracing enabling gene discovery

Genome Biology ◽

10.1186/s13059-021-02291-5 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 2

Author(s):

Fang Wang ◽

Qihan Wang ◽

Vakul Mohanty ◽

Shaoheng Liang ◽

Jinzhuang Dou ◽

...

Keyword(s):

Breast Cancer ◽

Single Cell ◽

Copy Number ◽

Speciation Analysis ◽

Population Based ◽

Lineage Tracing ◽

Breast Cancer Patients ◽

Lineage Tree ◽

History Of ◽

A Cell

AbstractWe present a Minimal Event Distance Aneuploidy Lineage Tree (MEDALT) algorithm that infers the evolution history of a cell population based on single-cell copy number (SCCN) profiles, and a statistical routine named lineage speciation analysis (LSA), whichty facilitates discovery of fitness-associated alterations and genes from SCCN lineage trees. MEDALT appears more accurate than phylogenetics approaches in reconstructing copy number lineage. From data from 20 triple-negative breast cancer patients, our approaches effectively prioritize genes that are essential for breast cancer cell fitness and predict patient survival, including those implicating convergent evolution.The source code of our study is available at https://github.com/KChen-lab/MEDALT.

Download Full-text

Single-Cell Transcriptomics Reveals the Expression of Aging- and Senescence-Associated Genes in Distinct Cancer Cell Populations

Cells ◽

10.3390/cells10113126 ◽

2021 ◽

Vol 10 (11) ◽

pp. 3126

Author(s):

Dominik Saul ◽

Robyn Laura Kosinsky

Keyword(s):

Single Cell ◽

Myelogenous Leukemia ◽

Ductal Adenocarcinoma ◽

Cell Populations ◽

Rna Seq ◽

Healthy Control ◽

Transcriptional Changes ◽

The Relationship ◽

Senescence Associated Genes

The human aging process is associated with molecular changes and cellular degeneration, resulting in a significant increase in cancer incidence with age. Despite their potential correlation, the relationship between cancer- and ageing-related transcriptional changes is largely unknown. In this study, we aimed to analyze aging-associated transcriptional patterns in publicly available bulk mRNA-seq and single-cell RNA-seq (scRNA-seq) datasets for chronic myelogenous leukemia (CML), colorectal cancer (CRC), hepatocellular carcinoma (HCC), lung cancer (LC), and pancreatic ductal adenocarcinoma (PDAC). Indeed, we detected that various aging/senescence-induced genes (ASIGs) were upregulated in malignant diseases compared to healthy control samples. To elucidate the importance of ASIGs during cell development, pseudotime analyses were performed, which revealed a late enrichment of distinct cancer-specific ASIG signatures. Notably, we were able to demonstrate that all cancer entities analyzed in this study comprised cell populations expressing ASIGs. While only minor correlations were detected between ASIGs and transcriptome-wide changes in PDAC, a high proportion of ASIGs was induced in CML, CRC, HCC, and LC samples. These unique cellular subpopulations could serve as a basis for future studies on the role of aging and senescence in human malignancies.

Download Full-text

Leveraging high-powered RNA-Seq datasets to improve inference of regulatory activity in single-cell RNA-Seq data

10.1101/553040 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ning Wang ◽

Andrew E. Teschendorff

Keyword(s):

Transcription Factors ◽

Single Cell ◽

Cell Fate ◽

Regulatory Networks ◽

Large Scale ◽

Single Cells ◽

Differential Expression Analysis ◽

Dropout Rate ◽

Rna Seq ◽

Regulatory Activity

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.

Download Full-text

JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

10.1101/2020.10.06.327601 ◽

2020 ◽

Author(s):

Mohit Goyal ◽

Guillermo Serrano ◽

Ilan Shomorony ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Single Cell ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Latent Space ◽

Cell Type Specific ◽

Low Dimensional

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.

Download Full-text

Ensemble Classification through Random Projections for single-cell RNA-seq data

10.1101/2020.06.24.169136 ◽

2020 ◽

Author(s):

Aristidis G. Vrahatis ◽

Sotiris Tasoulis ◽

Spiros Georgakopoulos ◽

Vassilis Plagianakos

Keyword(s):

Single Cell ◽

Random Projection ◽

Classification Performance ◽

Majority Voting ◽

Ensemble Classification ◽

High Dimensionality ◽

Computational Time ◽

Biomedical Data ◽

Rna Seq ◽

Low Dimensional

AbstractNowadays the biomedical data are generated exponentially, creating datasets for analysis with ultra-high dimensionality and complexity. This revolution, which has been caused by recent advents in biotechnologies, has driven to big-data and data-driven computational approaches. An indicative example is the emerging single-cell RNA-sequencing (scRNA-seq) technology, which isolates and measures individual cells. Although scRNA-seq has revolutionized the biotechnology domain, such data computational analysis is a major challenge because of their ultra-high dimensionality and complexity. Following this direction, in this work we study the properties, effectiveness and generalization of the recently proposed MRPV algorithm for single cell RNA-seq data. MRPV is an ensemble classification technique utilizing multiple ultra-low dimensional Random Projected spaces. A given classifier determines the class for each sample for all independent spaces while a majority voting scheme defines their predominant class. We show that Random Projection ensembles offer a platform not only for a low computational time analysis but also for enhancing classification performance. The developed methodologies were applied to four real biomedical high dimensional data from single-cell RNA-seq studies and compared against well-known and similar classification tools. Experimental results showed that based on simplistic tools we can create a computationally fast, simple, yet effective approach for single cell RNA-seq data with ultra-high dimensionality.

Download Full-text

Improved detection of tumor suppressor events in single-cell RNA-Seq data

npj Genomic Medicine ◽

10.1038/s41525-020-00151-y ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Andrew E. Teschendorff ◽

Ning Wang

Keyword(s):

Transcription Factors ◽

Tumor Suppressor ◽

Single Cell ◽

Cancer Cells ◽

Dropout Rate ◽

Lung Epithelial Cells ◽

Rna Seq ◽

Regulatory Activity ◽

Tissue Specific ◽

Early Tumor

Abstract Tissue-specific transcription factors are frequently inactivated in cancer. To fully dissect the heterogeneity of such tumor suppressor events requires single-cell resolution, yet this is challenging because of the high dropout rate. Here we propose a simple yet effective computational strategy called SCIRA to infer regulatory activity of tissue-specific transcription factors at single-cell resolution and use this tool to identify tumor suppressor events in single-cell RNA-Seq cancer studies. We demonstrate that tissue-specific transcription factors are preferentially inactivated in the corresponding cancer cells, suggesting that these are driver events. For many known or suspected tumor suppressors, SCIRA predicts inactivation in single cancer cells where differential expression does not, indicating that SCIRA improves the sensitivity to detect changes in regulatory activity. We identify NKX2-1 and TBX4 inactivation as early tumor suppressor events in normal non-ciliated lung epithelial cells from smokers. In summary, SCIRA can help chart the heterogeneity of tumor suppressor events at single-cell resolution.

Download Full-text