A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification

Systematic differences between batches of samples present significant challenges when analysing biological data. Such batch effects are well-studied and are liable to occur in any setting where multiple batches are assayed. Many existing methods for accounting for these have focused on high-dimensional data such as RNA-seq and have assumptions that reflect this. Here we focus on batch-correction in low-dimensional classification problems. We propose a semi-supervised Bayesian generative classifier based on mixture models that jointly predicts class labels and models batch effects. Our model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects. We explore two choices for the within-class densities: the multivariate normal and the multivariate t. A simulation study demonstrates that our method performs well compared to popular off-the-shelf machine learning methods and is also quick; performing 15,000 iterations on a dataset of 500 samples with 2 measurements each in 7.3 seconds for the MVN mixture model and 11.9 seconds for the MVT mixture model. We apply our model to two datasets generated using the enzyme-linked immunosorbent assay (ELISA), a spectrophotometric assay often used to screen for antibodies. The examples we consider were collected in 2020 and measure seropositivity for SARS-CoV-2. We use our model to estimate seroprevalence in the populations studied. We implement the models in C++ using a Metropolis-within-Gibbs algorithm; this is available in the R package at https://github.com/stcolema/BatchMixtureModel. Scripts to recreate our analysis are at https://github.com/stcolema/BatchClassifierPaper.

Download Full-text

Projected t-SNE for batch correction

Bioinformatics ◽

10.1093/bioinformatics/btaa189 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3522-3527 ◽

Cited By ~ 3

Author(s):

Emanuele Aliverti ◽

Jeffrey L Tilson ◽

Dayne L Filer ◽

Benjamin Babcock ◽

Alejandro Colaneri ◽

...

Keyword(s):

Single Cell ◽

High Dimensional Data ◽

Cell Types ◽

R Package ◽

High Dimensional ◽

Batch Effects ◽

Batch Correction ◽

Fundamental Information ◽

Cell Gene Expression ◽

Low Dimensional

Abstract Motivation Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. Results The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. Availability and implementation Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. Contact [email protected]

Download Full-text

Simulation-based comprehensive study of batch effects in metabolomics studies

10.1101/2019.12.16.878637 ◽

2019 ◽

Author(s):

Miao Yu ◽

Anna Roszkowska ◽

Janusz Pawliszyn

Keyword(s):

In Silico ◽

Real Data ◽

R Package ◽

Correction Method ◽

Statistical Properties ◽

Batch Effects ◽

Metabolomics Data ◽

Batch Correction ◽

Metabolomic Data ◽

Simulation Based

AbstractBatch effects will influence the interpretation of metabolomics data. In order to avoid misleading results, batch effects should be corrected and normalized prior to statistical analysis. Metabolomics studies are usually performed without targeted compounds (e.g., internal standards) and it is a challenging task to validate batch effects correction methods. In addition, statistical properties of metabolomics data are quite different from genomics data (where most of the currently used batch correction methods have originated from). In this study, we firstly analyzed already published metabolomics datasets so as to summarize and discuss their statistical properties. Then, based on available datasets, we developed novel statistical properties-based in silico simulations of metabolomics peaks’ intensity data so as to analyze the influence of batch effects on metabolomic data with the use of currently available batch correction strategies. Overall, 252000 batch corrections on 14000 different in silico simulated datasets and related differential analyses were performed in order to evaluate and validate various batch correction methods. The obtained results indicate that log transformations strongly influence the performance of all investigated batch correction methods. False positive rates increased after application of batch correction methods with almost no improvement on true positive rates among the analyzed batch correction methods. Hence, in metabolomic studies it is recommended to implement preliminary experiments to simulate batch effects from real data in order to select adequate batch correction method, based on a given distribution of peaks intensity. The presented study is reproducible and related R package mzrtsim software can be found online (https://github.com/yufree/mzrtsim).

Download Full-text

Assessment of batch-correction methods for scRNA-seq data with a new test metric

10.1101/200345 ◽

2017 ◽

Cited By ~ 13

Author(s):

Maren Büttner ◽

Zhichao Miao ◽

F Alexander Wolf ◽

Sarah A Teichmann ◽

Fabian J Theis

Keyword(s):

Data Integration ◽

Principal Component ◽

R Package ◽

Batch Effect ◽

Data Sets ◽

Batch Effects ◽

Visual Evaluation ◽

Batch Correction ◽

Multiple Data Sets ◽

Reduced Representations

AbstractSingle-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations. As with all genomics experiments, batch effects can hamper data integration and interpretation. The success of batch effect correction is often evaluated by visual inspection of dimension-reduced representations such as principal component analysis. This is inherently imprecise due to the high number of genes and non-normal distribution of gene expression. Here, we present a k-nearest neighbour batch effect test (kBET, https://github.com/theislab/kBET) to quantitatively measure batch effects. kBET is easier to interpret, more sensitive and more robust than visual evaluation and other measures of batch effects. We use kBET to assess commonly used batch regression and normalisation approaches, and quantify the extent to which they remove batch effects while preserving biological variability. Our results illustrate that batch correction based on log-transformation or scran pooling followed by ComBat reduced the batch effect while preserving structure across data sets. Finally we show that kBET can pinpoint successful data integration methods across multiple data sets, in this case from different publications all charting mouse embryonic development. This has important implications for future data integration efforts, which will be central to projects such as the Human Cell Atlas where data for the same tissue may be generated in multiple locations around the world.[Before final publication, we will upload the R package to Bioconductor]

Download Full-text

A zero-inflated non-negative matrix factorization for the deconvolution of mixed signals of biological data

The International Journal of Biostatistics ◽

10.1515/ijb-2020-0039 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Yixin Kong ◽

Ariangela Kozik ◽

Cindy H. Nakatsu ◽

Yava L. Jones-Hall ◽

Hyonho Chun

Keyword(s):

Matrix Factorization ◽

Factor Model ◽

R Package ◽

Biological Data ◽

Superior Performance ◽

Sequencing Data ◽

Fecal Microbiome ◽

Brain Gene Expression ◽

Cell Transcriptome ◽

Non Negative Matrix Factorization

Abstract A latent factor model for count data is popularly applied in deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the accuracy of the estimates could be much improved. However, the advantage quickly disappears in the presence of excessive zeros. To correctly account for this phenomenon in both mixed and pure samples, we propose a zero-inflated non-negative matrix factorization and derive an effective multiplicative parameter updating rule. In simulation studies, our method yielded the smallest bias. We applied our approach to brain gene expression as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF.

Download Full-text

Tag N’ Train: a technique to train improved classifiers on unlabeled data

Journal of High Energy Physics ◽

10.1007/jhep01(2021)153 ◽

2021 ◽

Vol 2021 (1) ◽

Cited By ~ 2

Author(s):

Oz Amram ◽

Cristina Mantilla Suarez

Keyword(s):

Real Data ◽

Unlabeled Data ◽

Machine Learning Techniques ◽

Jet Physics ◽

Classification Problems ◽

Weak Classifier ◽

Potential Applications ◽

Substantial Progress ◽

Resonance Search ◽

Class Labels

Abstract There has been substantial progress in applying machine learning techniques to classification problems in collider and jet physics. But as these techniques grow in sophistication, they are becoming more sensitive to subtle features of jets that may not be well modeled in simulation. Therefore, relying on simulations for training will lead to sub-optimal performance in data, but the lack of true class labels makes it difficult to train on real data. To address this challenge we introduce a new approach, called Tag N’ Train (TNT), that can be applied to unlabeled data that has two distinct sub-objects. The technique uses a weak classifier for one of the objects to tag signal-rich and background-rich samples. These samples are then used to train a stronger classifier for the other object. We demonstrate the power of this method by applying it to a dijet resonance search. By starting with autoencoders trained directly on data as the weak classifiers, we use TNT to train substantially improved classifiers. We show that Tag N’ Train can be a powerful tool in model-agnostic searches and discuss other potential applications.

Download Full-text

JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

10.1101/2020.10.06.327601 ◽

2020 ◽

Author(s):

Mohit Goyal ◽

Guillermo Serrano ◽

Ilan Shomorony ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Single Cell ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Latent Space ◽

Cell Type Specific ◽

Low Dimensional

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.

Download Full-text

gprofiler2 -- an R package for gene list functional enrichment analysis and namespace conversion toolset g:Profiler

F1000Research ◽

10.12688/f1000research.24956.2 ◽

2020 ◽

Vol 9 ◽

pp. 709 ◽

Cited By ~ 1

Author(s):

Liis Kolberg ◽

Uku Raudvere ◽

Ivan Kuzmin ◽

Jaak Vilo ◽

Hedi Peterson

Keyword(s):

Gene List ◽

Enrichment Analysis ◽

Functional Enrichment Analysis ◽

Automated Analysis ◽

R Package ◽

Biological Data ◽

Functional Enrichment ◽

Link Type ◽

Functional Profiling ◽

Rest Api

g:Profiler (https://biit.cs.ut.ee/gprofiler) is a widely used gene list functional profiling and namespace conversion toolset that has been contributing to reproducible biological data analysis already since 2007. Here we introduce the accompanying R package, gprofiler2, developed to facilitate programmatic access to g:Profiler computations and databases via REST API. The gprofiler2 package provides an easy-to-use functionality that enables researchers to incorporate functional enrichment analysis into automated analysis pipelines written in R. The package also implements interactive visualisation methods to help to interpret the enrichment results and to illustrate them for publications. In addition, gprofiler2 gives access to the versatile gene/protein identifier conversion functionality in g:Profiler enabling to map between hundreds of different identifier types or orthologous species. The gprofiler2 package is freely available at the CRAN repository.

Download Full-text

Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench

10.1101/2020.05.22.111211 ◽

2020 ◽

Author(s):

Ruben Chazarra-Gil ◽

Stijn van Dongen ◽

Vladimir Yu Kiselev ◽

Martin Hemberg

Keyword(s):

Single Cell ◽

Computational Methods ◽

Rna Seq ◽

Batch Effects ◽

Systematic Comparison ◽

Batch Correction ◽

Link Type ◽

Biological Signals ◽

The Cost

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.

Download Full-text

Neurobiological successor features for spatial navigation

10.1101/789412 ◽

2019 ◽

Cited By ~ 3

Author(s):

William de Cothi ◽

Caswell Barry

Keyword(s):

Grid Cell ◽

Biological Data ◽

Basis Set ◽

Grid Cells ◽

Dimensional Representation ◽

Boundary Vector ◽

Good Account ◽

Cell Firing ◽

Low Dimensional ◽

Environmental Geometry

AbstractThe hippocampus has long been observed to encode a representation of an animal’s position in space. Recent evidence suggests that the nature of this representation is somewhat predictive and can be modelled by learning a successor representation (SR) between distinct positions in an environment. However, this discretisation of space is subjective making it difficult to formulate predictions about how some environmental manipulations should impact the hippocampal representation. Here we present a model of place and grid cell firing as a consequence of learning a SR from a basis set of known neurobiological features – boundary vector cells (BVCs). The model describes place cell firing as the successor features of the SR, with grid cells forming a low-dimensional representation of these successor features. We show that the place and grid cells generated using the BVC-SR model provide a good account of biological data for a variety of environmental manipulations, including dimensional stretches, barrier insertions, and the influence of environmental geometry on the hippocampal representation of space.

Download Full-text

Capturing discrete latent structures: choose LDs over PCs

Biostatistics ◽

10.1093/biostatistics/kxab030 ◽

2021 ◽

Author(s):

Theresa A Alexander ◽

Rafael A Irizarry ◽

Héctor Corrada Bravo

Keyword(s):

Dimensionality Reduction ◽

Biological Data ◽

Reduction Technique ◽

Latent Structure ◽

High Dimensional ◽

Underlying Structure ◽

Linear Transformations ◽

Latent Structures ◽

Low Dimensional ◽

Discriminatory Information

Summary High-dimensional biological data collection across heterogeneous groups of samples has become increasingly common, creating high demand for dimensionality reduction techniques that capture underlying structure of the data. Discovering low-dimensional embeddings that describe the separation of any underlying discrete latent structure in data is an important motivation for applying these techniques since these latent classes can represent important sources of unwanted variability, such as batch effects, or interesting sources of signal such as unknown cell types. The features that define this discrete latent structure are often hard to identify in high-dimensional data. Principal component analysis (PCA) is one of the most widely used methods as an unsupervised step for dimensionality reduction. This reduction technique finds linear transformations of the data which explain total variance. When the goal is detecting discrete structure, PCA is applied with the assumption that classes will be separated in directions of maximum variance. However, PCA will fail to accurately find discrete latent structure if this assumption does not hold. Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), attempt to mitigate these problems with PCA by creating a low-dimensional space where similar objects are modeled by nearby points in the low-dimensional embedding and dissimilar objects are modeled by distant points with high probability. However, since t-SNE and UMAP are computationally expensive, often a PCA reduction is done before applying them which makes it sensitive to PCAs downfalls. Also, tSNE is limited to only two or three dimensions as a visualization tool, which may not be adequate for retaining discriminatory information. The linear transformations of PCA are preferable to non-linear transformations provided by methods like t-SNE and UMAP for interpretable feature weights. Here, we propose iterative discriminant analysis (iDA), a dimensionality reduction technique designed to mitigate these limitations. iDA produces an embedding that carries discriminatory information which optimally separates latent clusters using linear transformations that permit post hoc analysis to determine features that define these latent structures.

Download Full-text