Review of Batch Effects Prevention, Diagnostics, and Correction Approaches

Correcting for experiment-specific variability in expression compendia can remove underlying signals

GigaScience ◽

10.1093/gigascience/giaa117 ◽

2020 ◽

Vol 9 (11) ◽

Author(s):

Alexandra J Lee ◽

YoSon Park ◽

Georgia Doing ◽

Deborah A Hogan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Large Scale ◽

Original Signal ◽

Batch Effects ◽

Technical Variability ◽

The Past ◽

Statistical Correction ◽

Before And After ◽

Data Collections ◽

Biological Patterns

Abstract Motivation In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. Objective We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. Method We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. Results The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. Conclusion When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.

Download Full-text

Abstract 893: Batch effects in tumor biomarker studies using tissue microarrays: Extent, impact, and remediation

10.1158/1538-7445.am2021-893 ◽

2021 ◽

Author(s):

Konrad H. Stopsack ◽

Molin Wang ◽

Svitlana Tyekucheva ◽

Travis A. Gerke ◽

J. Bailey Vaselkiv ◽

...

Keyword(s):

Tissue Microarrays ◽

Tumor Biomarker ◽

Batch Effects

Download Full-text

Removal of batch effects using stratified subsampling of metabolomic data for in vitro endocrine disruptors screening

Talanta ◽

10.1016/j.talanta.2018.11.019 ◽

2019 ◽

Vol 195 ◽

pp. 77-86 ◽

Cited By ~ 7

Author(s):

Julien Boccard ◽

David Tonoli ◽

Petra Strajhar ◽

Fabienne Jeanneret ◽

Alex Odermatt ◽

...

Keyword(s):

Endocrine Disruptors ◽

Batch Effects ◽

Metabolomic Data

Download Full-text

Evidence of batch effects masking treatment effect in GAW20 methylation data

BMC Proceedings ◽

10.1186/s12919-018-0129-6 ◽

2018 ◽

Vol 12 (S9) ◽

Cited By ~ 2

Author(s):

Angelo J. Canty ◽

Andrew D. Paterson

Keyword(s):

Treatment Effect ◽

Methylation Data ◽

Batch Effects

Download Full-text

Tolerance Bounds and Cpk Confidence Bounds Under Batch Effects

Advances in Stochastic Models for Reliability, Quality and Safety ◽

10.1007/978-1-4612-2234-7_24 ◽

1998 ◽

pp. 361-379 ◽

Cited By ~ 3

Author(s):

Fritz Scholz ◽

Mark Vangel

Keyword(s):

Batch Effects ◽

Confidence Bounds

Download Full-text

Batch effects correction for microbiome data with Dirichlet-multinomial regression

Bioinformatics ◽

10.1093/bioinformatics/bty874 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2348-2348 ◽

Cited By ~ 2

Author(s):

Zhenwei Dai ◽

Sunny H Wong ◽

Jun Yu ◽

Yingying Wei

Keyword(s):

Batch Effects ◽

Multinomial Regression ◽

Microbiome Data

Download Full-text

Comparative analysis of antibody- and lipid-based multiplexing methods for single-cell RNA-seq

10.1101/2020.11.16.384222 ◽

2020 ◽

Author(s):

Viacheslav Mylka ◽

Jeroen Aerts ◽

Irina Matetovici ◽

Suresh Poovathingal ◽

Niels Vandamme ◽

...

Keyword(s):

Genetic Variation ◽

Comparative Analysis ◽

Single Cell ◽

Cell Lines ◽

Clinical Studies ◽

Clinical Samples ◽

Rna Seq ◽

Batch Effects ◽

Single Cell Sequencing ◽

Single Nucleus

ABSTRACTMultiplexing of samples in single-cell RNA-seq studies allows significant reduction of experimental costs, straightforward identification of doublets, increased cell throughput, and reduction of sample-specific batch effects. Recently published multiplexing techniques using oligo-conjugated antibodies or - lipids allow barcoding sample-specific cells, a process called ‘hashing’. Here, we compare the hashing performance of TotalSeq-A and -C antibodies, custom synthesized lipids and MULTI-seq lipid hashes in four cell lines, both for single-cell RNA-seq and single-nucleus RNA-seq. Hashing efficiency was evaluated using the intrinsic genetic variation of the cell lines. Benchmarking of different hashing strategies and computational pipelines indicates that correct demultiplexing can be achieved with both lipid- and antibody-hashed human cells and nuclei, with MULTISeqDemux as the preferred demultiplexing function and antibody-based hashing as the most efficient protocol on cells. Antibody hashing was further evaluated on clinical samples using PBMCs from healthy and SARS-CoV-2 infected patients, where we demonstrate a more affordable approach for large single-cell sequencing clinical studies, while simultaneously reducing batch effects.

Download Full-text

JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

10.1101/2020.10.06.327601 ◽

2020 ◽

Author(s):

Mohit Goyal ◽

Guillermo Serrano ◽

Ilan Shomorony ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Single Cell ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Latent Space ◽

Cell Type Specific ◽

Low Dimensional

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.

Download Full-text

Batch Effect Removal via Batch-Free Encoding

10.1101/380816 ◽

2018 ◽

Cited By ~ 1

Author(s):

Uri Shaham

Keyword(s):

Deep Learning ◽

Biological Properties ◽

Batch Effect ◽

Training Data ◽

Rna Seq ◽

Batch Effects ◽

Training Time ◽

Learning Techniques ◽

Downstream Analysis ◽

Biological Patterns

AbstractBiological measurements often contain systematic errors, also known as “batch effects”, which may invalidate downstream analysis when not handled correctly. The problem of removing batch effects is of major importance in the biological community. Despite recent advances in this direction via deep learning techniques, most current methods may not fully preserve the true biological patterns the data contains. In this work we propose a deep learning approach for batch effect removal. The crux of our approach is learning a batch-free encoding of the data, representing its intrinsic biological properties, but not batch effects. In addition, we also encode the systematic factors through a decoding mechanism and require accurate reconstruction of the data. Altogether, this allows us to fully preserve the true biological patterns represented in the data. Experimental results are reported on data obtained from two high throughput technologies, mass cytometry and single-cell RNA-seq. Beyond good performance on training data, we also observe that our system performs well on test data obtained from new patients, which was not available at training time. Our method is easy to handle, a publicly available code can be found at https://github.com/ushaham/BatchEffectRemoval2018.

Download Full-text

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

10.1101/269316 ◽

2018 ◽

Cited By ~ 1

Author(s):

Allison A. Regier ◽

Yossi Farjoun ◽

David Larson ◽

Olga Krasheninina ◽

Hyun Min Kang ◽

...

Keyword(s):

Data Processing ◽

Genome Sequencing ◽

Statistical Power ◽

Human Genetics ◽

Variant Calling ◽

Joint Analysis ◽

Sequencing Analysis ◽

Batch Effects ◽

Many Sources ◽

Genomic Regions

AbstractHundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years to interrogate a broad range of traits, across diverse populations. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power for trait mapping, and will enable studies of genome biology, population genetics and genome function at unprecedented scale. A central challenge for joint analysis is that different WGS data processing and analysis pipelines cause substantial batch effects in combined datasets, necessitating computationally expensive reprocessing and harmonization prior to variant calling. This approach is no longer tenable given the scale of current studies and data volumes. Here, in a collaboration across multiple genome centers and NIH programs, we define WGS data processing standards that allow different groups to produce “functionally equivalent” (FE) results suitable for joint variant calling with minimal batch effects. Our approach promotes broad harmonization of upstream data processing steps, while allowing for diverse variant callers. Importantly, it allows each group to continue innovating on data processing pipelines, as long as results remain compatible. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results – including single nucleotide (SNV), insertion/deletion (indel) and structural variation (SV) – and produce significantly less variability than sequencing replicates. Residual inter-pipeline variability is concentrated at low quality sites and repetitive genomic regions prone to stochastic effects. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for broad data sharing and community-wide “big-data” human genetics studies.

Download Full-text