scholarly journals A new bioinformatics tool to recover missing gene expression in single-cell RNA sequencing data

Author(s):  
Jingyi Jessica Li

Abstract Single-cell RNA sequencing (scRNA-seq) is a burgeoning field where experimental techniques and computational methods have been under rapid evolution in the past six years. These technological advances have allowed biomedical researchers to identify new cell types, delineate cell sub-populations, and infer cell differentiation trajectories in various tissue samples. Among the important features extractable from scRNA-seq data, the predominant ones are individual genes’ expression levels in single cells. Most analyses require a preprocessing step that converts a scRNA-seq dataset into a count matrix, where rows correspond to cells (or genes), columns correspond to genes (or cells), and entries are counts, i.e. a count is the number of sequenced reads or uniquely mapped identifiers (UMIs) mapped to a gene in a cell. Single-cell count matrices are highly sparse; for example, a typical matrix constructed from a droplet-based dataset may have >90% of counts as zeros.

Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 995-995
Author(s):  
Vincent-Philippe Lavallee ◽  
Elham Azizi ◽  
Vaidotas Kiseliovas ◽  
Ignas Masilionis ◽  
Linas Mazutis ◽  
...  

Abstract Introduction: Acute myeloid leukemia (AML) evolution is a multistep process in which cells evolve from hematopoietic stem and progenitor cells (HSPCs) that acquire genetic anomalies, such as chromosomal rearrangements and mutations, which define distinct subgroups. Mutations in Nucleophosmin 1 (NPM1), which occur in ~30% patients, are the most frequent subgroup-defining mutations in AML and appear to be a late driver event in this disease. Bulk RNA-sequencing studies have identified differentially expressed genes between AML subgroups, but they are uninformative of the composition of cell types populating each sample. Large scale Single-cell RNA sequencing (scRNA-seq) technologies now enable a detailed characterization of intra tumoral heterogeneity, and could help to better understand the stepwise evolution from normal to malignant cells. Methods: Twelve primary human AML specimens from MSKCC and Quebec Leukemia Cell Bank, including 8 with NPM1 mutations, were included in this cohort. Cells were subjected to scRNA-seq using 10X Genomics Chromium Single Cell 3' protocols and libraries were sequenced on Illumina HiSeq or NovaSeq platforms. FASTQ files were processed using SEQC pipeline (Azizi E et al, Cell 2018), resulting in a carefully filtered count matrix of > 100,000 single cells (4877 to 11532 cells per sample). Results: Using euclidean distance metrics and t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization, we explored the phenotypic overlap between samples and showed that leukemia cells from different patients were mostly dissimilar, suggesting inter-sample heterogeneity. However, samples with similar morphology and similar NPM1 mutational status were phenotypically closer (Fig A), as anticipated from bulk RNA-sequencing data (TCGA, NEJM 2013). We partitioned cells into distinct clusters using Phenograph (Levine J et al, Cell 2015) (Fig B) and measured the diversity of samples per cluster using Shannon's entropy metric, revealing that mature cell types (B/plasma cells, T/NK and erythroid cells, Fig C), presumably excluded from the tumor bulk, are transcriptionally similar across samples. Most notably, the next most diverse cluster (C36), comprising 438 cells from 11/12 samples, contains cells with a HSPC-like phenotype, as suggested by i) highest correlation of the centroid of this cluster with HSC1 (lin-/CD133+/CD34dim) population from sorted bulk RNA-sequencing data (Novershtern N et al, Cell 2011), and ii) marked GSEA enrichment for stem cell signatures (top enrichment: Jaatinen_hematopoeitic_stem_cell_up, NES = 9.04, FDR q-val = 0). To study the extent to which NPM1 or other mutations drive heterogeneity in leukemia populations, we interrogated 3'-derived single-cell sequences for all recurrent mutations in AML and found that NPM1 gene has unique features (e.g. relatively high single-cell expression and 3' localization) that allow specific identification of mutations in 5 to 34% of cells per mutated sample. To control for the high frequency of false negatives caused by dropouts in scRNA-seq data, we normalized the abundance of mutated vs wild-type cells to provide an estimation of mutation frequency in different cell types (Fig D). As expected, NPM1 mutations were rare in B and T/NK lymphoid cells (also observed using RT-qPCR in sorted populations by Dvorakova D et al, Leuk Lymphoma 2013) and were found in the majority of leukemia and myeloid cells. Interestingly, these mutations were detected at various frequencies in erythroid cells, suggesting that NPM1 mutations are acquired in cells with different lineage commitment in different patients. Most notably, the HSPC-like cluster C36 also contained a subpopulation of cells that have acquired NPM1 mutations and are transcriptionally different from wild-type cells. Conclusion: This study presents a first comprehensive single-cell map of primary AML, and the first 3'-based interrogation of mutations in single cells. It led to the identification phenotypically distinct cells presenting a HSPC-like expression profile which were sub-clonally harboring NPM1 mutations, providing the means to identify deregulated genes in these important leukemia subpopulations. Figure Figure. Disclosures Levine: Epizyme: Patents & Royalties; Celgene: Consultancy, Research Funding; Janssen: Consultancy, Honoraria; Isoplexis: Equity Ownership; C4 Therapeutics: Equity Ownership; Prelude: Research Funding; Gilead: Honoraria; Imago: Equity Ownership; Novartis: Consultancy; Roche: Consultancy, Research Funding; Loxo: Consultancy, Equity Ownership; Qiagen: Equity Ownership, Membership on an entity's Board of Directors or advisory committees.


2018 ◽  
Vol 20 (4) ◽  
pp. 1384-1394 ◽  
Author(s):  
Alessandra Dal Molin ◽  
Barbara Di Camillo

Abstract The sequencing of the transcriptome of single cells, or single-cell RNA-sequencing, has now become the dominant technology for the identification of novel cell types in heterogeneous cell populations or for the study of stochastic gene expression. In recent years, various experimental methods and computational tools for analysing single-cell RNA-sequencing data have been proposed. However, most of them are tailored to different experimental designs or biological questions, and in many cases, their performance has not been benchmarked yet, thus increasing the difficulty for a researcher to choose the optimal single-cell transcriptome sequencing (scRNA-seq) experiment and analysis workflow. In this review, we aim to provide an overview of the current available experimental and computational methods developed to handle single-cell RNA-sequencing data and, based on their peculiarities, we suggest possible analysis frameworks depending on specific experimental designs. Together, we propose an evaluation of challenges and open questions and future perspectives in the field. In particular, we go through the different steps of scRNA-seq experimental protocols such as cell isolation, messenger RNA capture, reverse transcription, amplification and use of quantitative standards such as spike-ins and Unique Molecular Identifiers (UMIs). We then analyse the current methodological challenges related to preprocessing, alignment, quantification, normalization, batch effect correction and methods to control for confounding effects.


Author(s):  
Yinlei Hu ◽  
Bin Li ◽  
Falai Chen ◽  
Kun Qu

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.


2019 ◽  
Author(s):  
Imad Abugessaisa ◽  
Shuhei Noguchi ◽  
Melissa Cardon ◽  
Akira Hasegawa ◽  
Kazuhide Watanabe ◽  
...  

AbstractAnalysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist. We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples. SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.


2019 ◽  
Vol 21 (5) ◽  
pp. 1581-1595 ◽  
Author(s):  
Xinlei Zhao ◽  
Shuang Wu ◽  
Nan Fang ◽  
Xiao Sun ◽  
Jue Fan

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.


Genes ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 240 ◽  
Author(s):  
Prashant N. M. ◽  
Hongyu Liu ◽  
Pavlos Bousounis ◽  
Liam Spurr ◽  
Nawaf Alomran ◽  
...  

With the recent advances in single-cell RNA-sequencing (scRNA-seq) technologies, the estimation of allele expression from single cells is becoming increasingly reliable. Allele expression is both quantitative and dynamic and is an essential component of the genomic interactome. Here, we systematically estimate the allele expression from heterozygous single nucleotide variant (SNV) loci using scRNA-seq data generated on the 10×Genomics Chromium platform. We analyzed 26,640 human adipose-derived mesenchymal stem cells (from three healthy donors), sequenced to an average of 150K sequencing reads per cell (more than 4 billion scRNA-seq reads in total). High-quality SNV calls assessed in our study contained approximately 15% exonic and >50% intronic loci. To analyze the allele expression, we estimated the expressed variant allele fraction (VAFRNA) from SNV-aware alignments and analyzed its variance and distribution (mono- and bi-allelic) at different minimum sequencing read thresholds. Our analysis shows that when assessing positions covered by a minimum of three unique sequencing reads, over 50% of the heterozygous SNVs show bi-allelic expression, while at a threshold of 10 reads, nearly 90% of the SNVs are bi-allelic. In addition, our analysis demonstrates the feasibility of scVAFRNA estimation from current scRNA-seq datasets and shows that the 3′-based library generation protocol of 10×Genomics scRNA-seq data can be informative in SNV-based studies, including analyses of transcriptional kinetics.


2021 ◽  
Vol 41 (3) ◽  
pp. 1012-1018
Author(s):  
Jean Acosta ◽  
Daniel Ssozi ◽  
Peter van Galen

The blood system is often represented as a tree-like structure with stem cells that give rise to mature blood cell types through a series of demarcated steps. Although this representation has served as a model of hierarchical tissue organization for decades, single-cell technologies are shedding new light on the abundance of cell type intermediates and the molecular mechanisms that ensure balanced replenishment of differentiated cells. In this Brief Review, we exemplify new insights into blood cell differentiation generated by single-cell RNA sequencing, summarize considerations for the application of this technology, and highlight innovations that are leading the way to understand hematopoiesis at the resolution of single cells. Graphic Abstract: A graphic abstract is available for this article.


2017 ◽  
Author(s):  
Luke Zappia ◽  
Belinda Phipson ◽  
Alicia Oshlack

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.


2018 ◽  
Author(s):  
Aaron T. L. Lun ◽  
Samantha Riesenfeld ◽  
Tallulah Andrews ◽  
Tomas Gomes ◽  
John C. Marioni ◽  
...  

AbstractDroplet-based single-cell RNA sequencing protocols have dramatically increased the throughput and efficiency of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets. Existing methods for cell calling set a minimum threshold on the total unique molecular identifier (UMI) count for each library, which indiscriminately discards cell libraries with low UMI counts. Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that our method has greater power than existing approaches for detecting cell libraries with low UMI counts, while controlling the false discovery rate among detected cells. We also apply our method to real data, where we show that the use of our method results in the retention of distinct cell types that would otherwise have been discarded.


GigaScience ◽  
2019 ◽  
Vol 8 (10) ◽  
Author(s):  
Yun-Ching Chen ◽  
Abhilash Suresh ◽  
Chingiz Underbayev ◽  
Clare Sun ◽  
Komudi Singh ◽  
...  

AbstractBackgroundIn single-cell RNA-sequencing analysis, clustering cells into groups and differentiating cell groups by differentially expressed (DE) genes are 2 separate steps for investigating cell identity. However, the ability to differentiate between cell groups could be affected by clustering. This interdependency often creates a bottleneck in the analysis pipeline, requiring researchers to repeat these 2 steps multiple times by setting different clustering parameters to identify a set of cell groups that are more differentiated and biologically relevant.FindingsTo accelerate this process, we have developed IKAP—an algorithm to identify major cell groups and improve differentiating cell groups by systematically tuning parameters for clustering. We demonstrate that, with default parameters, IKAP successfully identifies major cell types such as T cells, B cells, natural killer cells, and monocytes in 2 peripheral blood mononuclear cell datasets and recovers major cell types in a previously published mouse cortex dataset. These major cell groups identified by IKAP present more distinguishing DE genes compared with cell groups generated by different combinations of clustering parameters. We further show that cell subtypes can be identified by recursively applying IKAP within identified major cell types, thereby delineating cell identities in a multi-layered ontology.ConclusionsBy tuning the clustering parameters to identify major cell groups, IKAP greatly improves the automation of single-cell RNA-sequencing analysis to produce distinguishing DE genes and refine cell ontology using single-cell RNA-sequencing data.


Sign in / Sign up

Export Citation Format

Share Document