scholarly journals Mapbatch: Conservative Batch Normalization for Single Cell RNA-Sequencing Data Enables Discovery of Rare Cell Populations in a Multiple Myeloma Cohort

Blood ◽  
2021 ◽  
Vol 138 (Supplement 1) ◽  
pp. 2954-2954
Author(s):  
Chern Han Yong ◽  
Shawn Hoon ◽  
Sanjay De Mel ◽  
Stacy Xu ◽  
Jonathan Adam Scolnick ◽  
...  

Abstract Introduction Many cancers involve the participation of rare cell populations that may only be found in a subset of patients. Single-cell RNA sequencing (scRNA-seq) can identify distinct cell populations across multiple samples with batch normalization used to reduce processing-based effects between samples. However, aggressive normalization obscures rare cell populations, which may be erroneously grouped with other cell types. There is a need for conservative batch normalization that maintains the biological signal necessary to detect rare cell populations. MapBatch We designed a batch normalization tool, MapBatch, based on two principles: an autoencoder trained with a single sample learns the underlying gene expression structure of cell types without batch effect; and an ensemble model combines multiple autoencoders, allowing the use of multiple samples for training. Each autoencoder is trained on one sample, learning a projection into the biological space S representing the real expression differences between cells in that sample (Figure 1a, middle). When other samples are projected into S, the projection reduces expression differences orthogonal to S, while preserving differences along S. The reverse projection transforms the data back into gene space at the autoencoder's output, sans expression differences orthogonal to S (Figure 1a, right). Since batch-based technical differences are not represented in S, this transformation selectively removes batch effect between samples, while preserving biological signal. The autoencoder output thus represents normalized expression data, conditioned on the training sample. To incorporate multiple samples into training, MapBatch uses an ensemble of autoencoders, each trained with a single sample (Figure 1b). We train with a minimal number of samples necessary to cover the different cell populations in the dataset. We implement regularization using dropout and noise layers, and an a priori feature extraction layer using KEGG gene modules. The autoencoders' outputs are concatenated for downstream analysis. For visualization and clustering, we use the top principal components of the concatenated outputs. For differential expression (DE), we perform DE on each of the gene matrices output by each model, then take the result with the lowest P-value. To test MapBatch, we generated a synthetic dataset based on 7 batches of publicly available PBMC data. For each batch we simulated rare cell populations by selecting one of three cell types to perturb by up and down-regulating 40 genes in 0.5%-2% of the cells (Figure 1c). We simulated additional batch effect by scaling each gene in each batch with a scaling factor. Upon visualization and clustering, cells grouped largely by batch (Figure 1d). After batch normalization, cells grouped by cell type rather than batch, and all three perturbed cell populations were successfully delineated (Figure 1e). DE between each perturbed population and its mother cells accurately retrieved the perturbed genes, showing that normalization maintained real expression differences (Figure 1e). In contrast, three methods tested Seurat (Stuart et al., 2019), Harmony (Korsunsky et al., 2019), and Liger (Welch et al., 2019) could only derive a subset of the perturbed populations (Figures 1f-h). MapBatch identifies rare populations in multiple myeloma (MM) We used MapBatch to process bone marrow scRNA-seq data from 14 MM samples and 2 healthy controls. After batch normalization, unsupervised clustering identified 20 clusters, which we annotated using MapCell (Koh & Hoon, 2019) (Figures 2a, 2b). We identified 3 small clusters of cells that could not be reliably annotated, comprising less than 1% of total cells and found in only a subset of patients (Figures 2c, 2d). As validation, we observed that these cells were present in distinct clusters in individual samples using their uncorrected expression data, providing evidence that these clusters were not driven by batch effect nor MapBatch (Figure 2e). Conclusion Batch normalization of scRNA-seq data involves a trade-off between minimizing batch effect and maximizing the remaining biological signal. While most methods lean towards the former, MapBatch maintains more biological signal for downstream analysis, enabling the discovery of previously difficult to find cell populations. Figure 1 Figure 1. Disclosures Xu: Proteona Pte Ltd: Current Employment. Scolnick: Proteona Pte Ltd: Current holder of individual stocks in a privately-held company. Huo: Proteona Pte Ltd: Ended employment in the past 24 months. Lovci: Proteona Pte Ltd: Current Employment. Chng: Amgen: Honoraria, Research Funding; Abbvie: Honoraria; Janssen: Honoraria, Research Funding; Novartis: Honoraria; Celgene: Honoraria, Research Funding.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Andrea Tangherloni ◽  
Federico Ricciuti ◽  
Daniela Besozzi ◽  
Pietro Liò ◽  
Ana Cvejic

Abstract Background Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics.


2021 ◽  
Vol 7 (10) ◽  
pp. eabc5464
Author(s):  
Kiya W. Govek ◽  
Emma C. Troisi ◽  
Zhen Miao ◽  
Rachael G. Aubin ◽  
Steven Woodhouse ◽  
...  

Highly multiplexed immunohistochemistry (mIHC) enables the staining and quantification of dozens of antigens in a tissue section with single-cell resolution. However, annotating cell populations that differ little in the profiled antigens or for which the antibody panel does not include specific markers is challenging. To overcome this obstacle, we have developed an approach for enriching mIHC images with single-cell RNA sequencing data, building upon recent experimental procedures for augmenting single-cell transcriptomes with concurrent antigen measurements. Spatially-resolved Transcriptomics via Epitope Anchoring (STvEA) performs transcriptome-guided annotation of highly multiplexed cytometry datasets. It increases the level of detail in histological analyses by enabling the systematic annotation of nuanced cell populations, spatial patterns of transcription, and interactions between cell types. We demonstrate the utility of STvEA by uncovering the architecture of poorly characterized cell types in the murine spleen using published cytometry and mIHC data of this organ.


F1000Research ◽  
2019 ◽  
Vol 7 ◽  
pp. 1306 ◽  
Author(s):  
Clarence K. Mah ◽  
Alexander T. Wenzel ◽  
Edwin F. Juarez ◽  
Thorin Tabor ◽  
Michael M. Reich ◽  
...  

Single-cell RNA sequencing (scRNA-seq) has emerged as a popular method to profile gene expression at the resolution of individual cells. While there have been methods and software specifically developed to analyze scRNA-seq data, they are most accessible to users who program. We have created a scRNA-seq clustering analysis GenePattern Notebook that provides an interactive, easy-to-use interface for data analysis and exploration of scRNA-Seq data, without the need to write or view any code. The notebook provides a standard scRNA-seq analysis workflow for pre-processing data, identification of sub-populations of cells by clustering, and exploration of biomarkers to characterize heterogeneous cell populations and delineate cell types.


2021 ◽  
Author(s):  
Hanbyeol Kim ◽  
Joongho Lee ◽  
Keunsoo Kang ◽  
Seokhyun Yoon

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.


2021 ◽  
Vol 12 ◽  
Author(s):  
Bin Zou ◽  
Tongda Zhang ◽  
Ruilong Zhou ◽  
Xiaosen Jiang ◽  
Huanming Yang ◽  
...  

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 3515-3515
Author(s):  
Muntasir M Majumder ◽  
Aino Maija Leppä ◽  
Caroline A Heckman

Abstract Introduction Off-target cytotoxicity resulting in severe side effects and compromising patient survival often hampers the development of new cancer therapeutics. Understanding the complete drug response landscape of different cell populations is crucial to identify drugs that selectively eradicate the malignant cell population, but spare healthy cells. Here, we developed a high content, no wash, multi-parametric flow cytometry based assay that enables testing of blood cancer patient samples and simultaneously monitors the effects of several drugs on 11 hematopoietic cell types. The assay can be used to i) dissect malignant from healthy cell responses and predict off-target effects; ii) assess drug effects on immune cell subsets; iii) identify drugs that can potentially be repositioned to new blood cancer indications. Methods Mononuclear cells were prepared from bone marrow aspirates of 7 multiple myeloma (MM) and 3 acute myeloid leukemia (AML) patients plus the peripheral blood from a healthy donor, which were collected following informed consent and in compliance with the Declaration of Helsinki. Optimal cell density, antibody dilutions, incubation time, and wash versus no wash assay conditions for the selected antibody panels were determined. Cells were incubated at a density of 2 million cells/ml in either 96- or 384-well plates for 3 days. The antibodies were tested in two panels to study the effects of 6 drugs in 5 dilutions (1-10000 nM) (clofarabine, bortezomib, dexamethasone, navitoclax, venetoclax and omipalisib) on 11 cell populations, namely hematopoietic stem cells (HSCs) (CD34+CD38-), common progenitor cells (CPCs) (CD34+CD38+), monocytes (CD14+), B cells (CD45+CD19+), cytotoxic T cells (CD45+CD3+CD8+), T helper cells (CD45+CD3+CD4+), NK-T cells (CD45+CD3+CD56+), NK cells (CD45+CD56+CD3-), clonal plasma cells (CD138+CD38+), other plasma cells (CD138+CD38-) and granulocytes (CD45+, SSC++). Annexin-V and 7AAD were used to distinguish live cell populations from apoptotic and dead cells. After 1 h incubation with antibodies, the plates were read with the iQue Screener PLUS instrument (Intellicyt). Counts for each population were used to generate four parameter nonlinear regression fitted dose response curves with GraphPad Prism 7. Three samples were tested in duplicate to assess reproducibility. Results To decrease the complexity of the assay, we tested all antibodies under wash and no wash conditions, and found that results from both conditions were comparable. To minimize the amount of sample needed as well as maximize the number of drugs tested and cell populations that can be detected, we set up the assay in both 96- and 384-well plates. The assay was highly reproducible when samples were tested in replicate and was scalable to a 384-well format without compromising sensitivity to detect rare populations such as plasma cells. Due to the differentiation of immature cells to specialized cell types, the drug responses of specific populations tended to drift. HSCs (CD34+CD38-) were shown to be refractory to the tested drugs compared to CPCs characterized as (CD34+CD38+) and other cell types. Interestingly, the proteasome inhibitor bortezomib was cytotoxic to all cell populations except for CD138+CD38- plasma cells. Clofarabine, a nucleoside analog used to treat ALL, effectively targeted CPC, NK and B cells, while HSCs and plasma cells were resistant. The glucocorticoid and immunosuppressive drug dexamethasone specifically targeted B and NK cells compared to T cell populations (CD8+, CD4+), while NK-T cells were modestly sensitive. The cell population response patterns were similar in samples derived from MM, AML and healthy individuals, highlighting that the drug responses are highly cell type specific. Summary Using a high content, multi-parametric assay, we could rapidly assess the effect of several drugs on specific cell populations in individual patient samples. Our results demonstrate that many drugs preferentially affect different hematological cell lineages. Although heterogeneity was observed between individual patients, the pattern of cytotoxic response exhibited by specific cell types was consistent among samples derived from MM, AML and healthy donors. The assay will be useful to identify drugs with maximal on-target and minimal off-target specificity, and can potentially be used to guide treatment decision and predict patient response Disclosures Heckman: Celgene: Research Funding; Pfizer: Research Funding.


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 4278-4278
Author(s):  
Shovik Bandyopadhyay ◽  
Liyang Yu ◽  
Daniel A.C. Fisher ◽  
Olga Malkova ◽  
Stephen T. Oh

Abstract Introduction: Mass cytometry is a powerful tool for analyzing cellular networks, with the ability to generate massive data sets encompassing > 40 parameters measured simultaneously at the single cell level. Various groups have created a variety of platforms to analyze this high dimensional data in unique and efficient ways. These tools have a range of applications: from using phenotypic similarities to cluster cells, stratifying unique signaling subpopulations based on observed stimulation responses, mapping the developmental trajectory of cell types, among many others. We have previously utilized mass cytometry to characterize NFkB hyperactivation in myeloproliferative neoplasms. Here we applied mass cytometric analysis to a cohort of patients with secondary acute myeloid leukemia (sAML) following a history of chronic MPN. The objective of this work was to identify populations of functionally primitive leukemic cells, relying not only on traditional immunophenotypic designations (which can vary considerably in sAML), but also by inferring functional status based on the presence or absence of cytokine hypersensitivity and constitutively active signaling in specific cell populations. Results: Dimensionality reduction and clustering analysis by viSNE and SPADE identified multiple cell subsets outside the hematopoietic stem/progenitor cell (HSPC) compartment that exhibited overt thrombopoietin (TPO) sensitivity, while healthy controls had highly localized responses largely restricted to the HSPC compartment. Using Phenograph, ten sAML metaclusters were identified containing cells from six sAML patients analyzed. One of these metaclusters represented a distinct subpopulation of CD61+ CD34- CD38- CD45lo cells with variable CD90 and CD11b expression. This subpopulation of CD61+ cells was not identified by manual gating, and exhibited significantly greater STAT3/STAT5 phosphorylation in response to TPO than did lineage-negative CD34+ CD38- cells in five out of six (83%) AML patients examined. In addition, substantially elevated basal STAT3 phosphorylation in this population was hypersensitive to TPO and largely resistant to ex vivo ruxolitnib. The classify function of Phenograph was utilized to determine whether the cytokine hypersensitivity observed in the viSNE and SPADE analysis could be entirely accounted for by the aforementioned CD61+ CD34- CD38- CD45lo population. The signaling responses highly predictive of specific cell types were identified, which were used to assess the functional status of sAML cells compared to healthy Lin- CD34+ CD38- cells. By this approach, sAML cells were found to exhibit significant incongruity between surface cell type designation and functional designation. Furthermore, functionally primitive cells displayed a spectrum of myeloid surface markers, suggesting that restricting analysis to a subset of strictly surface-defined cells would potentially obscure populations of interest. Conclusions: Our analysis revealed a distinct, previously undescribed population of CD61+ CD34- CD38- CD45lo cells in sAML. While the biological relevance of this population requires validation by functional assays, this result demonstrates that immunophenotypic changes in traditional surface-marker-defined populations may conceal important cell populations. These cells, and other functionally primitive but mature-designated cells could be relevant to studying sAML disease pathogenesis, progression, and/or response to therapy. This study further demonstrates the potential for mass cytometry to elucidate rare leukemic subpopulations in highly heterogeneous tumors. Disclosures Oh: Gilead: Membership on an entity's Board of Directors or advisory committees, Research Funding; Incyte Corporation: Membership on an entity's Board of Directors or advisory committees, Research Funding; Janssen: Research Funding; CTI: Research Funding.


2020 ◽  
Author(s):  
Wanqiu Chen ◽  
Yongmei Zhao ◽  
Xin Chen ◽  
Xiaojiang Xu ◽  
Zhaowei Yang ◽  
...  

AbstractSingle-cell RNA sequencing (scRNA-seq) has become a very powerful technology for biomedical research and is becoming much more affordable as methods continue to evolve, but it is unknown how reproducible different platforms are using different bioinformatics pipelines, particularly the recently developed scRNA-seq batch correction algorithms. We carried out a comprehensive multi-center cross-platform comparison on different scRNA-seq platforms using standard reference samples. We compared six pre-processing pipelines, seven bioinformatics normalization procedures, and seven batch effect correction methods including CCA, MNN, Scanorama, BBKNN, Harmony, limma and ComBat to evaluate the performance and reproducibility of 20 scRNA-seq data sets derived from four different platforms and centers. We benchmarked scRNA-seq performance across different platforms and testing sites using global gene expression profiles as well as some cell-type specific marker genes. We showed that there were large batch effects; and the reproducibility of scRNA-seq across platforms was dictated both by the expression level of genes selected and the batch correction methods used. We found that CCA, MNN, and BBKNN all corrected the batch variations fairly well for the scRNA-seq data derived from biologically similar samples across platforms/sites. However, for the scRNA-seq data derived from or consisting of biologically distinct samples, limma and ComBat failed to correct batch effects, whereas CCA over-corrected the batch effect and misclassified the cell types and samples. In contrast, MNN, Harmony and BBKNN separated biologically different samples/cell types into correspondingly distinct dimensional subspaces; however, consistent with this algorithm’s logic, MNN required that the samples evaluated each contain a shared portion of highly similar cells. In summary, we found a great cross-platform consistency in separating two distinct samples when an appropriate batch correction method was used. We hope this large cross-platform/site scRNA-seq data set will provide a valuable resource, and that our findings will offer useful advice for the single-cell sequencing community.


2021 ◽  
Author(s):  
Laura M. Richards ◽  
Mazdak Riverin ◽  
Suluxan Mohanraj ◽  
Shamini Ayyadhury ◽  
Danielle C. Croucher ◽  
...  

Tumours are routinely profiled with single-cell RNA sequencing (scRNA-seq) to characterize their diverse cellular ecosystems of malignant, immune, and stromal cell types. When combining data from multiple samples or studies, batch-specific technical variation can confound biological signals. However, scRNA-seq batch integration methods are often not designed for, or benchmarked, on datasets containing cancer cells. Here, we compare 5 data integration tools applied to 171,206 cells from 5 tumour scRNA-seq datasets. Based on our results, STACAS and fastMNN are the most suitable methods for integrating tumour datasets, demonstrating robust batch effect correction while preserving relevant biological variability in the malignant compartment. This comparison provides a framework for evaluating how well single-cell integration methods correct for technical variability while preserving biological heterogeneity of malignant and non-malignant cell populations.


Blood ◽  
2020 ◽  
Vol 136 (Supplement 1) ◽  
pp. 41-42
Author(s):  
Yanyan Wang ◽  
Christian Rohde ◽  
Fengbiao Zhou ◽  
Marco Hennrich ◽  
Laura Poisa-Beiro ◽  
...  

Introduction: RNA modifications are emerging as important determinants of cell identity and cell fate. Small nucleolar RNAs (snoRNA) guide pseudouridylation and 2'-O-methylation of RNA species. C/D box snoRNAs are essential for AML1/ETO-induced leukemia (Zhou et al. Nat Cell Biol 2017). Dynamics and relevance of these modifications in hematopoiesis are unknown. Here, we aimed to determine the plasticity of ribosomal 2'-O-methylation (Ribomethylome) patterns in hematopoietic cell populations and the interdependence with snoRNA expression, transcriptomics and proteomics. Methods: Healthy donors (19-86yrs) donated bone marrow and six cell populations were sorted or prepared: Hematopoietic stem/progenitor cell (HPC), Monocyte/macrophage precursor (MON), Granulocytic precursor (GRA), Erythroid precursor (ERY), Lymphocyte progenitor (LYM), and Mesenchymal stem/stromal cell (MSC). Small RNA sequencing and Ribometh-seq data were obtained for 65 and 55 samples, respectively. Data were analyzed together with accompanying RNA-seq and Mass-spec proteomics data which were available for all the specimens. Bioinformatics analyses were based on PCA, tSNE, spearman correlation, paired t-test, GSEA and ANOVA. Results: The analyses of 2'-O-methylation (Ribomethylome) in six bone marrow cell types from healthy donors revealed that ribosomal modifications occurred different during the process of hematopoietic differentiation. Among these sides, HPC and Myeloid lineage showed significant variability between different cell populations. Ribomethylome patterns differed between cell types and PCA analyses indicated that cellular identity was matched with a specific Ribomethylome pattern. Plasticity in Ribomethylomes were most evident for HPC, LYM, GRA and MON which showed high levels of 2'-O-methylation (almost 100% of rRNA methylated) whereas methylation levels in MSC cells were much lower (Spearman correlation<0.4). These findings indicated that Ribomethylome patterns were cell type specific. Using snoRNA sequencing, we showed that snoRNA expression levels differed between the different cell types. C/D box snoRNAs were variably expressed, and the expression differences for SNORD68 and SNORD87 were associated with respective Ribomethylome changes of predicted target sites. We next analyzed the association between specific 2'-O-methylation levels and the levels of protein expression. Only those proteins were included for whom no association between mRNA and total protein levels were observed. Spearman rank analyses suggested that RAB7A, PSME1 involved "antigen processing and presentation" and FLNA, RCC2 involved "cell migration" correlated closely with 2'-O-methylation of the dynamically regulated sites 28S_3723_SNORD87 and 5.8S_14_SNORD71. Conclusion: Our finding based on multi-omics analyses identifies cell type specific Ribomethylomes. Myeloid differentiation is associated with specific Ribomethylome changes. Distinct Ribomethylomes may contribute to cellular identity by directing translation of specific sets of mRNAs. Figure 1: The effects of ribomethylome and protein translation were evident and separated by different cell populations(tSNE). Figure 1 Disclosures Müller-Tidow: Daiichi Sankyo: Research Funding; BiolineRx: Research Funding; Janssen-Cilag GmbH: Speakers Bureau; Pfizer: Research Funding, Speakers Bureau.


Sign in / Sign up

Export Citation Format

Share Document