scholarly journals Controlling for confounding effects in single cell RNA sequencing studies using both control and target genes

2016 ◽  
Author(s):  
Mengjie Chen ◽  
Xiang Zhou

Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes for controlling for confounding effects in scRNAseq studies; the control genes are used to infer the confounding effects, which are then used to normalize target genes of primary interest. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method is based on partial least squares and models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.

Author(s):  
Alex M. Ascensión ◽  
Sandra Fuertes-Álvarez ◽  
Olga Ibañez-Solé ◽  
Ander Izeta ◽  
Marcos J. Araúzo-Bravo

2021 ◽  
Vol 12 (2) ◽  
pp. 317-334
Author(s):  
Omar Alaqeeli ◽  
Li Xing ◽  
Xuekui Zhang

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.


Author(s):  
Zilong Zhang ◽  
Feifei Cui ◽  
Chen Lin ◽  
Lingling Zhao ◽  
Chunyu Wang ◽  
...  

Abstract Single-cell RNA sequencing (scRNA-seq) has enabled us to study biological questions at the single-cell level. Currently, many analysis tools are available to better utilize these relatively noisy data. In this review, we summarize the most widely used methods for critical downstream analysis steps (i.e. clustering, trajectory inference, cell-type annotation and integrating datasets). The advantages and limitations are comprehensively discussed, and we provide suggestions for choosing proper methods in different situations. We hope this paper will be useful for scRNA-seq data analysts and bioinformatics tool developers.


GigaScience ◽  
2020 ◽  
Vol 9 (10) ◽  
Author(s):  
Mehmet Tekman ◽  
Bérénice Batut ◽  
Alexander Ostrovsky ◽  
Christophe Antoniewski ◽  
Dave Clements ◽  
...  

Abstract Background The vast ecosystem of single-cell RNA-sequencing tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large computing requirements and the statistically driven methods needed to process and understand these ever-growing datasets. Results Here we outline several Galaxy workflows and learning resources for single-cell RNA-sequencing, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework provides tools, workflows, and trainings that not only enable users to perform 1-click 10x preprocessing but also empower them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The downstream analysis supports a range of high-quality interoperable suites separated into common stages of analysis: inspection, filtering, normalization, confounder removal, and clustering. The teaching resources cover concepts from computer science to cell biology. Access to all resources is provided at the singlecell.usegalaxy.eu portal. Conclusions The reproducible and training-oriented Galaxy framework provides a sustainable high-performance computing environment for users to run flexible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy community provide a means for users to learn, publish, and teach single-cell RNA-sequencing analysis.


2019 ◽  
Vol 21 (5) ◽  
pp. 1581-1595 ◽  
Author(s):  
Xinlei Zhao ◽  
Shuang Wu ◽  
Nan Fang ◽  
Xiao Sun ◽  
Jue Fan

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.


2019 ◽  
Vol 122 (4) ◽  
pp. 1291-1296 ◽  
Author(s):  
Djuna von Maydell ◽  
Mehdi Jorfi

Microglia constitute ~10–20% of glial cells in the adult human brain. They are the resident phagocytic immune cells of the central nervous system and play an integral role as first responders during inflammation. Microglia are commonly classified as “HM” (homeostatic), “M1” (classically activated proinflammatory), or “M2” (alternatively activated). Multiple single-cell RNA-sequencing studies suggest that this discrete classification system does not accurately and fully capture the vast heterogeneity of microglial states in the brain. In fact, a recent single-cell RNA-sequencing study showed that microglia exist along a continuous spectrum of states. This spectrum spans heterogeneous populations of homeostatic and neuropathology-associated microglia in both healthy and Alzheimer’s disease (AD) mouse brains. Major risk factors, such as sex, age, and genes, modulate microglial states, suggesting that shifts along the trajectory might play a causal role in AD pathogenesis. This study provides important insight into the cellular mechanisms of AD and underlines the potential of novel cell-based therapies for AD.


2019 ◽  
Author(s):  
Lukas M. Simon ◽  
Fangfang Yan ◽  
Zhongming Zhao

AbstractSingle cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic data sets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. Here, we present DrivAER, a machine learning approach that scores annotated gene sets based on their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. We demonstrate that DrivAER extracts the key driving pathways and transcription factors that regulate complex biological processes from scRNA-seq data.


2017 ◽  
Author(s):  
Jonathan A. Griffiths ◽  
Arianne C. Richard ◽  
Karsten Bach ◽  
Aaron T.L. Lun ◽  
John C Marioni

AbstractBarcode swapping results in the mislabeling of sequencing reads between multiplexed samples on the new patterned flow cell Illumina sequencing machines. This may compromise the validity of numerous genomic assays, especially for single-cell studies where many samples are routinely multiplexed together. The severity and consequences of barcode swapping for single-cell transcriptomic studies remain poorly understood. We have used two statistical approaches to robustly quantify the fraction of swapped reads in each of two plate-based single-cell RNA sequencing datasets. We found that approximately 2.5% of reads were mislabeled between samples on the HiSeq 4000 machine, which is lower than previous reports. We observed no correlation between the swapped fraction of reads and the concentration of free barcode across plates. Furthermore, we have demonstrated that barcode swapping may generate complex but artefactual cell libraries in droplet-based single-cell RNA sequencing studies. To eliminate these artefacts, we have developed an algorithm to exclude individual molecules that have swapped between samples in 10X Genomics experiments, exploiting the combinatorial complexity present in the data. This permits the continued use of cutting-edge sequencing machines for droplet-based experiments while avoiding the confounding effects of barcode swapping.


2021 ◽  
Vol 218 (6) ◽  
Author(s):  
Dev Bhatt ◽  
Boxi Kang ◽  
Deepali Sawant ◽  
Liangtao Zheng ◽  
Kristy Perez ◽  
...  

Single-cell RNA sequencing is a powerful tool to examine cellular heterogeneity, novel markers and target genes, and therapeutic mechanisms in human cancers and animal models. Here, we analyzed single-cell RNA sequencing data of T cells obtained from multiple mouse tumor models by PCA-based subclustering coupled with TCR tracking using the STARTRAC algorithm. This approach revealed various differentiated T cell subsets and activation states, and a correspondence of T cell subsets between human and mouse tumors. STARTRAC analyses demonstrated peripheral T cell subsets that were developmentally connected with tumor-infiltrating CD8+ cells, CD4+ Th1 cells, and T reg cells. In addition, large amounts of paired TCRα/β sequences enabled us to identify a specific enrichment of paired public TCR clones in tumor. Finally, we identified CCR8 as a tumor-associated T reg cell marker that could preferentially deplete tumor-associated T reg cells. We showed that CCR8-depleting antibody treatment provided therapeutic benefit in CT26 tumors and synergized with anti–PD-1 treatment in MC38 and B16F10 tumor models.


Sign in / Sign up

Export Citation Format

Share Document