scholarly journals Projected t-SNE for batch correction

2020 ◽  
Vol 36 (11) ◽  
pp. 3522-3527 ◽  
Author(s):  
Emanuele Aliverti ◽  
Jeffrey L Tilson ◽  
Dayne L Filer ◽  
Benjamin Babcock ◽  
Alejandro Colaneri ◽  
...  

Abstract Motivation Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. Results The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. Availability and implementation Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. Contact [email protected]

2020 ◽  
Author(s):  
Mohit Goyal ◽  
Guillermo Serrano ◽  
Ilan Shomorony ◽  
Mikel Hernaez ◽  
Idoia Ochoa

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.


Author(s):  
Samuel Melton ◽  
Sharad Ramanathan

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Massimo Andreatta ◽  
Santiago J Carmona

Abstract Summary STACAS is a computational method for the identification of integration anchors in the Seurat environment, optimized for the integration of single-cell (sc) RNA-seq datasets that share only a subset of cell types. We demonstrate that by (i) correcting batch effects while preserving relevant biological variability across datasets, (ii) filtering aberrant integration anchors with a quantitative distance measure and (iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. Availability and implementation Source code and R package available at https://github.com/carmonalab/STACAS; Docker image available at https://hub.docker.com/repository/docker/mandrea1/stacas_demo.


2021 ◽  
Vol 12 ◽  
Author(s):  
Bin Zou ◽  
Tongda Zhang ◽  
Ruilong Zhou ◽  
Xiaosen Jiang ◽  
Huanming Yang ◽  
...  

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.


2020 ◽  
Author(s):  
Wanqiu Chen ◽  
Yongmei Zhao ◽  
Xin Chen ◽  
Xiaojiang Xu ◽  
Zhaowei Yang ◽  
...  

AbstractSingle-cell RNA sequencing (scRNA-seq) has become a very powerful technology for biomedical research and is becoming much more affordable as methods continue to evolve, but it is unknown how reproducible different platforms are using different bioinformatics pipelines, particularly the recently developed scRNA-seq batch correction algorithms. We carried out a comprehensive multi-center cross-platform comparison on different scRNA-seq platforms using standard reference samples. We compared six pre-processing pipelines, seven bioinformatics normalization procedures, and seven batch effect correction methods including CCA, MNN, Scanorama, BBKNN, Harmony, limma and ComBat to evaluate the performance and reproducibility of 20 scRNA-seq data sets derived from four different platforms and centers. We benchmarked scRNA-seq performance across different platforms and testing sites using global gene expression profiles as well as some cell-type specific marker genes. We showed that there were large batch effects; and the reproducibility of scRNA-seq across platforms was dictated both by the expression level of genes selected and the batch correction methods used. We found that CCA, MNN, and BBKNN all corrected the batch variations fairly well for the scRNA-seq data derived from biologically similar samples across platforms/sites. However, for the scRNA-seq data derived from or consisting of biologically distinct samples, limma and ComBat failed to correct batch effects, whereas CCA over-corrected the batch effect and misclassified the cell types and samples. In contrast, MNN, Harmony and BBKNN separated biologically different samples/cell types into correspondingly distinct dimensional subspaces; however, consistent with this algorithm’s logic, MNN required that the samples evaluated each contain a shared portion of highly similar cells. In summary, we found a great cross-platform consistency in separating two distinct samples when an appropriate batch correction method was used. We hope this large cross-platform/site scRNA-seq data set will provide a valuable resource, and that our findings will offer useful advice for the single-cell sequencing community.


2016 ◽  
Author(s):  
Kieran R Campbell ◽  
Christopher Yau

AbstractPseudotime estimation from single-cell gene expression allows the recovery of temporal information from otherwise static profiles of individual cells. This pseudotemporal information can be used to characterise transient events in temporally evolving biological systems. Conventional algorithms typically emphasise an unsupervised transcriptome-wide approach and use retrospective analysis to evaluate the behaviour of individual genes. Here we introduce an orthogonal approach termed “Ouija” that learns pseudotimes from a small set of marker genes that might ordinarily be used to retrospectively confirm the accuracy of unsupervised pseudotime algorithms. Crucially, we model these genes in terms of switch-like or transient behaviour along the trajectory, allowing us to understand why the pseudotimes have been inferred and learn informative parameters about the behaviour of each gene. Since each gene is associated with a switch or peak time the genes are effectively ordered along with the cells, allowing each part of the trajectory to be understood in terms of the behaviour of certain genes. In the following we introduce our model and demonstrate that in many instances a small panel of marker genes can recover pseudotimes that are consistent with those obtained using the entire transcriptome. Furthermore, we show that our method can detect differences in the regulation timings between two genes and identify “metastable” states - discrete cell types along the continuous trajectories - that recapitulate known cell types. Ouija therefore provides a powerful complimentary approach to existing whole transcriptome based pseudotime estimation methods. An open source implementation is available at http://www.github.com/kieranrcampbell/ouija as an R package and at http://www.github.com/kieranrcampbell/ouijaflow as a Python/TensorFlow package.


Author(s):  
Massimo Andreatta ◽  
Santiago J. Carmona

AbstractComputational tools for the integration of single-cell transcriptomics data are designed to correct batch effects between technical replicates or different technologies applied to the same population of cells. However, they have inherent limitations when applied to heterogeneous sets of data with moderate overlap in cell states or sub-types. STACAS is a package for the identification of integration anchors in the Seurat environment, optimized for the integration of datasets that share only a subset of cell types. We demonstrate that by i) correcting batch effects while preserving relevant biological variability across datasets, ii) filtering aberrant integration anchors with a quantitative distance measure, and iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. We anticipate that the algorithm will be a useful tool for the construction of comprehensive single-cell atlases by integration of the growing amount of single-cell data becoming available in public repositories.Code availabilityR package:https://github.com/carmonalab/STACASDocker image:https://hub.docker.com/repository/docker/mandrea1/stacas_demo


2022 ◽  
Author(s):  
Stephen Coleman ◽  
Xaquin Castro Dopico ◽  
Gunilla B Karlsson Hedestam ◽  
Paul DW Kirk ◽  
Chris Wallace

Systematic differences between batches of samples present significant challenges when analysing biological data. Such batch effects are well-studied and are liable to occur in any setting where multiple batches are assayed. Many existing methods for accounting for these have focused on high-dimensional data such as RNA-seq and have assumptions that reflect this. Here we focus on batch-correction in low-dimensional classification problems. We propose a semi-supervised Bayesian generative classifier based on mixture models that jointly predicts class labels and models batch effects. Our model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects. We explore two choices for the within-class densities: the multivariate normal and the multivariate t. A simulation study demonstrates that our method performs well compared to popular off-the-shelf machine learning methods and is also quick; performing 15,000 iterations on a dataset of 500 samples with 2 measurements each in 7.3 seconds for the MVN mixture model and 11.9 seconds for the MVT mixture model. We apply our model to two datasets generated using the enzyme-linked immunosorbent assay (ELISA), a spectrophotometric assay often used to screen for antibodies. The examples we consider were collected in 2020 and measure seropositivity for SARS-CoV-2. We use our model to estimate seroprevalence in the populations studied. We implement the models in C++ using a Metropolis-within-Gibbs algorithm; this is available in the R package at https://github.com/stcolema/BatchMixtureModel. Scripts to recreate our analysis are at https://github.com/stcolema/BatchClassifierPaper.


2020 ◽  
Author(s):  
Yang Chen ◽  
Tadepally Lakshmikanth ◽  
Jaromir Mikes ◽  
Petter Brodin

AbstractSingle-cell methods such as flow cytometry, Mass cytometry and single-cell mRNA sequencing collect high-dimensional data on thousands to millions of individual cells. An important aim during the analysis of such data is to classify cells into known categories and cell types. One commonly used approach towards this is clustering of cells with similar features followed by manual annotation of clusters in relation to known biology. A second approach, commonly used for cytometry data relies on manual sorting or “gating” of cells, often based on pairwise combinations of measurements used in a stepwise and very tedious process of cell annotation. Both of these approaches require manual inspection and annotation of every new dataset generated, a process that is not only time consuming but also subjective and surely influential for the conclusions drawn. The manual annotation is also difficult to reproduce by other researchers with a different perception of features that signify their cells of interest. Here we propose an alternative strategy based on machine learning of known phenotypes from manually curated, high-dimensional data and thereby enabling rapid classification of subsequent datasets in a more reproducible manner. This simple approach increases both throughput, reproducibility and simplicity of cell classification in single-cell biology.


2019 ◽  
Author(s):  
Tongxin Wang ◽  
Travis S Johnson ◽  
Wei Shao ◽  
Zixiao Lu ◽  
Bryan R Helm ◽  
...  

AbstractTo fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for cell lineation and identifyingbona fidetranscriptional signals, it is necessary to combine data from multiple experiments. We presentBERMUDA(Batch-Effect ReMoval Using Deep Autoencoders) — a novel transfer-learning-based method for batch-effect correction in scRNA-seq data.BERMUDAeffectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate thatBERMUDAoutperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets.


Sign in / Sign up

Export Citation Format

Share Document