Optimal Gene Filtering for Single-Cell data (OGFSC)—a gene filtering algorithm for single-cell RNA-seq data

Jie Hao; Wei Cao; Jian Huang; Xin Zou; Ze-Guang Han

doi:10.1093/bioinformatics/bty1016

Optimal Gene Filtering for Single-Cell data (OGFSC)—a gene filtering algorithm for single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/bty1016 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2602-2609 ◽

Cited By ~ 3

Author(s):

Jie Hao ◽

Wei Cao ◽

Jian Huang ◽

Xin Zou ◽

Ze-Guang Han

Keyword(s):

Single Cell ◽

Supplementary Information ◽

Rna Seq ◽

Aging Research ◽

Technical Noise ◽

Transcriptomic Data ◽

Knowledge Based ◽

Gene Filtering ◽

Cell Data ◽

Gene Expression Levels

Abstract Motivation Single-cell transcriptomic data are commonly accompanied by extremely high technical noise due to the low RNA concentrations from individual cells. Precise identification of differentially expressed genes and cell populations are heavily dependent on the effective reduction of technical noise, e.g. by gene filtering. However, there is still no well-established standard in the current approaches of gene filtering. Investigators usually filter out genes based on single fixed threshold, which commonly leads to both over- and under-stringent errors. Results In this study, we propose a novel algorithm, termed as Optimal Gene Filtering for Single-Cell data, to construct a thresholding curve based on gene expression levels and the corresponding variances. We validated our method on multiple single-cell RNA-seq datasets, including simulated and published experimental datasets. The results show that the known signal and known noise are reliably discriminated in the simulated datasets. In addition, the results of seven experimental datasets demonstrate that these cells of the same annotated types are more sharply clustered using our method. Interestingly, when we re-analyze the dataset from an aging research recently published in Science, we find a list of regulated genes which is different from that reported in the original study, because of using different filtering methods. However, the knowledge based on our findings better matches the progression of immunosenescence. In summary, we here provide an alternative opportunity to probe into the true level of technical noise in single-cell transcriptomic data. Availability and implementation https://github.com/XZouProjects/OGFSC.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution

Bioinformatics ◽

10.1093/bioinformatics/btz778 ◽

2019 ◽

Author(s):

Magdalena E Strauss ◽

Paul D W Kirk ◽

John E Reid ◽

Lorenz Wernisch

Keyword(s):

Single Cell ◽

Time Course ◽

Gene Clusters ◽

Supplementary Information ◽

Rna Seq ◽

Clustering Methods ◽

Novel Approach ◽

Broad Array ◽

Recent Method ◽

Cell Data

Abstract Motivation Many methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters. Results The proposed method, GPseudoClust, is a novel approach that jointly infers pseudotemporal ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering methods, efficient MCMC sampling, and novel subsampling strategies which aid computation.We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings. Availability An implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/GPseudoClust. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text

scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R

10.1101/069633 ◽

2016 ◽

Cited By ~ 10

Author(s):

Davis J. McCarthy ◽

Kieran R. Campbell ◽

Aaron T. L. Lun ◽

Quin F. Wills

Keyword(s):

Quality Control ◽

Single Cell ◽

Sequence Data ◽

Supplementary Information ◽

Processing Quality ◽

Rna Seq ◽

Study Gene Expression ◽

Supplementary Material ◽

Downstream Analysis ◽

Cell Data

AbstractMotivationSingle-cell RNA sequencing (scRNA-seq) is increasingly used to study gene expression at the level of individual cells. However, preparing raw sequence data for further analysis is not a straightforward process. Biases, artifacts, and other sources of unwanted variation are present in the data, requiring substantial time and effort to be spent on pre-processing, quality control (QC) and normalisation.ResultsWe have developed the R/Bioconductor package scater to facilitate rigorous pre-processing, quality control, normalisation and visualisation of scRNA-seq data. The package provides a convenient, flexible workflow to process raw sequencing reads into a high-quality expression dataset ready for downstream analysis. scater provides a rich suite of plotting tools for single-cell data and a flexible data structure that is compatible with existing tools and can be used as infrastructure for future software development.AvailabilityThe open-source code, along with installation instructions, vignettes and case studies, is available through Bioconductor at http://bioconductor.org/packages/scater.Supplementary informationSupplementary material is available online at bioRxiv accompanying this manuscript, and all materials required to reproduce the results presented in this paper are available at dx.doi.org/10.5281/zenodo.60139.

Download Full-text

A probabilistic model-based bi-clustering method for single-cell transcriptomic data analysis

10.1101/181362 ◽

2017 ◽

Cited By ~ 2

Author(s):

Sha Cao ◽

Tao Sheng ◽

Xin Chen ◽

Qin Ma ◽

Chi Zhang

Keyword(s):

Single Cell ◽

Expression Patterns ◽

Cell Types ◽

Computational Techniques ◽

Rna Seq ◽

Transcriptomic Data ◽

Large Numbers ◽

A Cell ◽

Multiple Cell ◽

Cell Data

AbstractWe present here novel computational techniques for tackling four problems related to analyses of single-cell RNA-Seq data: (1) a mixture model for coping with multiple cell types in a cell population; (2) a truncated model for handling the unquantifiable errors caused by large numbers of zeros or low-expression values; (3) a bi-clustering technique for detection of sub-populations of cells sharing common expression patterns among subsets of genes; and (4) detection of small cell sub-populations with distinct expression patterns. Through case studies, we demonstrated that these techniques can derive high-resolution information from single-cell data that are not feasible using existing techniques.

Download Full-text

Transferable representations of single-cell transcriptomic data

10.1101/2021.04.13.439707 ◽

2021 ◽

Author(s):

Ethan Weinberger ◽

Su-In Lee

Keyword(s):

Single Cell ◽

Large Scale ◽

Future Research ◽

Rna Seq ◽

Cell Type ◽

Experimental Conditions ◽

Technical Noise ◽

Transcriptomic Data ◽

Low Dimensional ◽

Computational Resources

Advances in single-cell RNA-seq (scRNA-seq) technologies are enabling the construction of large-scale, human-annotated reference cell atlases, creating unprecedented opportunities to accelerate future research. However, effectively leveraging information from these atlases, such as clustering labels or cell type annotations, remains challenging due to substantial technical noise and sparsity in scRNA-seq measurements. To address this problem, we present HD-AE, a deep autoencoder designed to extract integrated low-dimensional representations of scRNA-seq measurements across datasets from different labs and experimental conditions. Unlike previous approaches, HD-AE's representations successfully transfer to new query datasets without needing to retrain the model. Researchers without substantial computational resources or machine learning expertise can thus leverage the robust representations learned by pretrained HD-AE models to compare embeddings of their own data with previously generated sets of reference embeddings.

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SMILE: Mutual Information Learning for Integration of Single-cell Omics Data

Bioinformatics ◽

10.1093/bioinformatics/btab706 ◽

2021 ◽

Author(s):

Yang Xu ◽

Priyojit Das ◽

Rachel Patton McCord

Keyword(s):

Deep Learning ◽

Mutual Information ◽

Single Cell ◽

Learning Algorithm ◽

Cellular Systems ◽

Supplementary Information ◽

Omics Data ◽

Learning Approaches ◽

Rna Seq ◽

Integrate Data

Abstract Motivation Deep learning approaches have empowered single-cell omics data analysis in many ways and generated new insights from complex cellular systems. As there is an increasing need for single cell omics data to be integrated across sources, types, and features of data, the challenges of integrating single-cell omics data are rising. Here, we present an unsupervised deep learning algorithm that learns discriminative representations for single-cell data via maximizing mutual information, SMILE (Single-cell Mutual Information Learning). Results Using a unique cell-pairing design, SMILE successfully integrates multi-source single-cell transcriptome data, removing batch effects and projecting similar cell types, even from different tissues, into the shared space. SMILE can also integrate data from two or more modalities, such as joint profiling technologies using single-cell ATAC-seq, RNA-seq, DNA methylation, Hi-C, and ChIP data. When paired cells are known, SMILE can integrate data with unmatched feature, such as genes for RNA-seq and genome wide peaks for ATAC-seq. Integrated representations learned from joint profiling technologies can then be used as a framework for comparing independent single source data. Supplementary information Supplementary data are available at Bioinformatics online. The source code of SMILE including analyses of key results in the study can be found at: https://github.com/rpmccordlab/SMILE.

Download Full-text

sc-REnF:An entropy guided robust feature selection for clustering of single-cell rna-seq data

10.1101/2020.10.10.334573 ◽

2020 ◽

Author(s):

Snehalika Lall ◽

Abhik Ghosh ◽

Sumanta Ray ◽

Sanghamitra Bandyopadhyay

Keyword(s):

Single Cell ◽

Gene Selection ◽

Rna Seq ◽

Technical Noise ◽

Marker Selection ◽

Cell Clustering ◽

Typing Methods ◽

Original Application ◽

Downstream Analysis ◽

Cell Typing

ABSTRACTMany single-cell typing methods require pure clustering of cells, which is susceptible towards the technical noise, and heavily dependent on high quality informative genes selected in the preliminary steps of downstream analysis. Techniques for gene selection in single-cell RNA sequencing (scRNA-seq) data are seemingly simple which casts problems with respect to the resolution of (sub-)types detection, marker selection and ultimately impacts towards cell annotation. We introduce sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering. Thereby, gene selection is robust and less sensitive towards the technical noise present in the data, producing a pure clustering of cells, beyond classifying independent and unknown sample with utmost accuracy. The corresponding software is available at: https://github.com/Snehalikalall/sc-REnF

Download Full-text

EpiScanpy: integrated single-cell epigenomic analysis

10.1101/648097 ◽

2019 ◽

Cited By ~ 4

Author(s):

Anna Danese ◽

Maria L. Richter ◽

David S. Fischer ◽

Fabian J. Theis ◽

Maria Colomé-Tatché

Keyword(s):

Dna Methylation ◽

Single Cell ◽

Large Scale ◽

Feature Space ◽

Rna Seq ◽

Computational Framework ◽

Learning Techniques ◽

Multiple Feature ◽

The Many ◽

Cell Data

ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.

Download Full-text

DEsingle for detecting three types of differential expression in single-cell RNA-seq data

10.1101/173997 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zhun Miao ◽

Ke Deng ◽

Xiaowo Wang ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Differential Expression ◽

Negative Binomial ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Binomial Model ◽

Supplementary Data ◽

Rna Seq ◽

Real Zeros

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.

Download Full-text

High-throughput single-cell RNA-seq data imputation and characterization with surrogate-assisted automated deep learning

Briefings in Bioinformatics ◽

10.1093/bib/bbab368 ◽

2021 ◽

Author(s):

Xiangtao Li ◽

Shaochuan Li ◽

Lei Huang ◽

Shixiong Zhang ◽

Ka-chun Wong

Keyword(s):

Gene Expression ◽

Neural Networks ◽

Single Cell ◽

Deep Neural Networks ◽

Expression Profiles ◽

Marker Gene ◽

Gene Expression Profiles ◽

Underlying Mechanisms ◽

Cell Data ◽

Gene Expression Levels

Abstract Single-cell RNA sequencing (scRNA-seq) technologies have been heavily developed to probe gene expression profiles at single-cell resolution. Deep imputation methods have been proposed to address the related computational challenges (e.g. the gene sparsity in single-cell data). In particular, the neural architectures of those deep imputation models have been proven to be critical for performance. However, deep imputation architectures are difficult to design and tune for those without rich knowledge of deep neural networks and scRNA-seq. Therefore, Surrogate-assisted Evolutionary Deep Imputation Model (SEDIM) is proposed to automatically design the architectures of deep neural networks for imputing gene expression levels in scRNA-seq data without any manual tuning. Moreover, the proposed SEDIM constructs an offline surrogate model, which can accelerate the computational efficiency of the architectural search. Comprehensive studies show that SEDIM significantly improves the imputation and clustering performance compared with other benchmark methods. In addition, we also extensively explore the performance of SEDIM in other contexts and platforms including mass cytometry and metabolic profiling in a comprehensive manner. Marker gene detection, gene ontology enrichment and pathological analysis are conducted to provide novel insights into cell-type identification and the underlying mechanisms. The source code is available at https://github.com/li-shaochuan/SEDIM.

Download Full-text