scholarly journals Emergent Statistical Laws in Single-Cell Transcriptomic Data

2021 ◽  
Author(s):  
Silvia Lazzardi ◽  
Filippo Valle ◽  
Andrea Mazzolini ◽  
Antonio Scialdone ◽  
Michele Caselle ◽  
...  

Large scale data on single-cell gene expression have the potential to unravel the specific transcriptional programs of different cell types. The structure of these expression datasets suggests a similarity with several other complex systems that can be analogously described through the statistics of their basic building blocks. Transcriptomes of single cells are collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identify several emergent statistical laws in single-cell transcriptomic data closely similar to regularities found in linguistics, ecology or genomics. A simple mathematical framework can be used to analyze the relations between different laws and the possible mechanisms behind their ubiquity. Importantly, treatable statistical models can be useful tools in transcriptomics to disentangle the actual biological variability from general statistical effects present in most component systems and from the consequences of the sampling process inherent to the experimental technique.

2018 ◽  
Author(s):  
Nikos Konstantinides ◽  
Katarina Kapuralin ◽  
Chaimaa Fadil ◽  
Luendreo Barboza ◽  
Rahul Satija ◽  
...  

SummaryTranscription factors regulate the molecular, morphological, and physiological characters of neurons and generate their impressive cell type diversity. To gain insight into general principles that govern how transcription factors regulate cell type diversity, we used large-scale single-cell mRNA sequencing to characterize the extensive cellular diversity in the Drosophila optic lobes. We sequenced 55,000 single optic lobe neurons and glia and assigned them to 52 clusters of transcriptionally distinct single cells. We validated the clustering and annotated many of the clusters using RNA sequencing of characterized FACS-sorted single cell types, as well as marker genes specific to given clusters. To identify transcription factors responsible for inducing specific terminal differentiation features, we used machine-learning to generate a ‘random forest’ model. The predictive power of the model was confirmed by showing that two transcription factors expressed specifically in cholinergic (apterous) and glutamatergic (traffic-jam) neurons are necessary for the expression of ChAT and VGlut in many, but not all, cholinergic or glutamatergic neurons, respectively. We used a transcriptome-wide approach to show that the same terminal characters, including but not restricted to neurotransmitter identity, can be regulated by different transcription factors in different cell types, arguing for extensive phenotypic convergence. Our data provide a deep understanding of the developmental and functional specification of a complex brain structure.


2019 ◽  
Vol 21 (4) ◽  
pp. 1209-1223 ◽  
Author(s):  
Raphael Petegrosso ◽  
Zhuliu Li ◽  
Rui Kuang

Abstract   Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability All the source code and data are available at https://github.com/kuanglab/single-cell-review.


2017 ◽  
Author(s):  
Aparna Bhaduri ◽  
Tomasz J. Nowakowski ◽  
Alex A. Pollen ◽  
Arnold R. Kriegstein

AbstractHigh throughput methods for profiling the transcriptomes of single cells have recently emerged as transformative approaches for large-scale population surveys of cellular diversity in heterogeneous primary tissues. Efficient generation of such an atlas will depend on sufficient sampling of the diverse cell types while remaining cost-effective to enable a comprehensive examination of organs, developmental stages, and individuals. To examine the relationship between cell number and transcriptional heterogeneity in the context of unbiased cell type classification, we explicitly explored the population structure of a publically available 1.3 million cell dataset from the E18.5 mouse brain. We propose a computational framework for inferring the saturation point of cluster discovery in a single cell mRNA-seq experiment, centered around cluster preservation in downsampled datasets. In addition, we introduce a “complexity index”, which characterizes the heterogeneity of cells in a given dataset. Using Cajal-Retzius cells as an example of a limited complexity dataset, we explored whether biological distinctions relate to technical clustering. Surprisingly, we found that clustering distinctions carrying biologically interpretable meaning are achieved with far fewer cells (20,000). Together, these findings suggest that most of the biologically interpretable insights from the 1.3 million cells can be recapitulated by analyzing 50,000 randomly selected cells, indicating that instead of profiling few individuals at high “cellular coverage”, the much anticipated cell atlasing studies may instead benefit from profiling more individuals, or many time points at lower cellular coverage.Recent efforts seek to create a comprehensive cell atlas of the human body1,2 Current technology, however, makes it precipitously expensive to perform analysis of every cell. Therefore, designing effective sampling strategies be critical to generate a working atlas in an efficient, cost-effective, and streamlined manner. The advent of single cell and single nucleus mRNA sequencing (RNAseq) in droplet format3,4 now enables large scale sampling of cells from any tissue, and a recently released publicly available dataset of 1.3 million single cells from the E18.5 mouse brain generated with the 10X Chromium5 provides an opportunity to explore the relationship between population structure and the number of sampled cells necessary to reveal the underlying diversity of cell types. Here, we present a framework for how researchers can evaluate whether a dataset has reached saturation, and we estimate how many cells would be required to generate an atlas of the sample analyzed here. This framework can be applied to any organ or cell type specific atlas for any organism.


2017 ◽  
Author(s):  
Shujing Lai ◽  
Yang Xu ◽  
Wentao Huang ◽  
Mengmeng Jiang ◽  
Haide Chen ◽  
...  

SummaryThe classical hematopoietic hierarchy, which is mainly built with fluorescence-activated cell sorting (FACS) technology, proves to be inaccurate in recent studies. Single cell RNA-seq (scRNA-seq) analysis provides a solution to overcome the limit of FACS-based cell type definition system for the dissection of complex cellular hierarchy. However, large-scale scRNA-seq is constrained by the throughput and cost of traditional methods. Here, we developed Microwell-seq, a high-throughput and low-cost scRNA-seq platform using extremely simple devices. Using Microwell-seq, we constructed a single-cell resolution transcriptome atlas of human hematopoietic differentiation hierarchy by profiling more than 50,000 single cells throughout adult human hematopoietic system. We found that adult human hematopoietic stem and progenitor cell (HSPC) compartment is dominated by progenitors primed with lineage specific regulators. Our analysis revealed differentiation pathways for each cell types, through which HSPCs directly progress to lineage biased progenitors before differentiation. We propose a revised adult human hematopoietic hierarchy independent of oligopotent progenitors. Our study also demonstrates the broad applicability of Microwell-seq technology.


2021 ◽  
Author(s):  
Hongru Shen ◽  
Xilin Shen ◽  
Mengyao Feng ◽  
Dan Wu ◽  
Chao Zhang ◽  
...  

Advancement in single-cell RNA sequencing leads to exponential accumulation of single-cell expression data. However, there is still lack of tools that could integrate these unlimited accumulation of single-cell expression data. Here, we presented a universal approach iSEEEK for integrating super large-scale single-cell expression via exploring expression rankings of top-expressing genes. We developed iSEEEK with 13.7 million single-cells. We demonstrated the efficiency of iSEEEK with canonical single-cell downstream tasks on five heterogenous datasets encompassing human and mouse samples. iSEEEK achieved good clustering performance benchmarked against well-annotated cell labels. In addition, iSEEEK could transfer its knowledge learned from large-scale expression data on new dataset that was not involved in its development. iSEEEK enables identification of gene-gene interaction networks that are characteristic of specific cell types. Our study presents a simple and yet effective method to integrate super large-scale single-cell transcriptomes and would facilitate translational single-cell research from bench to bedside.


2018 ◽  
Vol 20 (4) ◽  
pp. 1384-1394 ◽  
Author(s):  
Alessandra Dal Molin ◽  
Barbara Di Camillo

Abstract The sequencing of the transcriptome of single cells, or single-cell RNA-sequencing, has now become the dominant technology for the identification of novel cell types in heterogeneous cell populations or for the study of stochastic gene expression. In recent years, various experimental methods and computational tools for analysing single-cell RNA-sequencing data have been proposed. However, most of them are tailored to different experimental designs or biological questions, and in many cases, their performance has not been benchmarked yet, thus increasing the difficulty for a researcher to choose the optimal single-cell transcriptome sequencing (scRNA-seq) experiment and analysis workflow. In this review, we aim to provide an overview of the current available experimental and computational methods developed to handle single-cell RNA-sequencing data and, based on their peculiarities, we suggest possible analysis frameworks depending on specific experimental designs. Together, we propose an evaluation of challenges and open questions and future perspectives in the field. In particular, we go through the different steps of scRNA-seq experimental protocols such as cell isolation, messenger RNA capture, reverse transcription, amplification and use of quantitative standards such as spike-ins and Unique Molecular Identifiers (UMIs). We then analyse the current methodological challenges related to preprocessing, alignment, quantification, normalization, batch effect correction and methods to control for confounding effects.


2021 ◽  
Author(s):  
Chloe Xueqi Wang ◽  
Lin Zhang ◽  
Bo Wang

The surge of single-cell RNA sequencing technologies enables the accessibility to large single-cell RNA-seq datasets at the scale of hundreds of thousands of single cells. Integrative analysis of large-scale scRNA-seq datasets has the potential of revealing de novo cell types as well as aggregating biological information. However, most existing methods fail to integrate multiple large-scale scRNA-seq datasets in a computational and memory efficient way. We hereby propose OCAT, One Cell At a Time, a graph-based method that sparsely encodes single-cell gene expressions to integrate data from multiple sources without most variable gene selection or explicit batch effect correction. We demonstrate that OCAT efficiently integrates multiple scRNA-seq datasets and achieves the state-of-the-art performance in cell-type clustering, especially in challenging scenarios of non-overlapping cell types. In addition, OCAT facilitates a variety of downstream analyses, such as gene prioritization, trajectory inference, pseudotime inference and cell inference. OCAT is a unifying tool to simplify and expedite single-cell data analysis.


Author(s):  
Martin Philpott ◽  
Jonathan Watson ◽  
Anjan Thakurta ◽  
Tom Brown ◽  
Tom Brown ◽  
...  

AbstractHere we describe single-cell corrected long-read sequencing (scCOLOR-seq), which enables error correction of barcode and unique molecular identifier oligonucleotide sequences and permits standalone cDNA nanopore sequencing of single cells. Barcodes and unique molecular identifiers are synthesized using dimeric nucleotide building blocks that allow error detection. We illustrate the use of the method for evaluating barcode assignment accuracy, differential isoform usage in myeloma cell lines, and fusion transcript detection in a sarcoma cell line.


2021 ◽  
Vol 22 (11) ◽  
pp. 5793
Author(s):  
Brianna M. Quinville ◽  
Natalie M. Deschenes ◽  
Alex E. Ryckman ◽  
Jagdeep S. Walia

Sphingolipids are a specialized group of lipids essential to the composition of the plasma membrane of many cell types; however, they are primarily localized within the nervous system. The amphipathic properties of sphingolipids enable their participation in a variety of intricate metabolic pathways. Sphingoid bases are the building blocks for all sphingolipid derivatives, comprising a complex class of lipids. The biosynthesis and catabolism of these lipids play an integral role in small- and large-scale body functions, including participation in membrane domains and signalling; cell proliferation, death, migration, and invasiveness; inflammation; and central nervous system development. Recently, sphingolipids have become the focus of several fields of research in the medical and biological sciences, as these bioactive lipids have been identified as potent signalling and messenger molecules. Sphingolipids are now being exploited as therapeutic targets for several pathologies. Here we present a comprehensive review of the structure and metabolism of sphingolipids and their many functional roles within the cell. In addition, we highlight the role of sphingolipids in several pathologies, including inflammatory disease, cystic fibrosis, cancer, Alzheimer’s and Parkinson’s disease, and lysosomal storage disorders.


2021 ◽  
Author(s):  
Qing Xie ◽  
Chengong Han ◽  
Victor Jin ◽  
Shili Lin

Single cell Hi-C techniques enable one to study cell to cell variability in chromatin interactions. However, single cell Hi-C (scHi-C) data suffer severely from sparsity, that is, the existence of excess zeros due to insufficient sequencing depth. Complicate things further is the fact that not all zeros are created equal, as some are due to loci truly not interacting because of the underlying biological mechanism (structural zeros), whereas others are indeed due to insufficient sequencing depth (sampling zeros), especially for loci that interact infrequently. Differentiating between structural zeros and sampling zeros is important since correct inference would improve downstream analyses such as clustering and discovery of subtypes. Nevertheless, distinguishing between these two types of zeros has received little attention in the single cell Hi-C literature, where the issue of sparsity has been addressed mainly as a data quality improvement problem. To fill this gap, in this paper, we propose HiCImpute, a Bayesian hierarchy model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros. HiCImpute takes spatial dependencies of scHi-C 2D data structure into account while also borrowing information from similar single cells and bulk data, when such are available. Through an extensive set of analyses of synthetic and real data, we demonstrate the ability of HiCImpute for identifying structural zeros with high sensitivity, and for accurate imputation of dropout values in sampling zeros. Downstream analyses using data improved from HiCImpute yielded much more accurate clustering of cell types compared to using observed data or data improved by several comparison methods. Most significantly, HiCImpute-improved data has led to the identification of subtypes within each of the excitatory neuronal cells of L4 and L5 in the prefrontal cortex.


Sign in / Sign up

Export Citation Format

Share Document