batch effect
Recently Published Documents


TOTAL DOCUMENTS

106
(FIVE YEARS 71)

H-INDEX

13
(FIVE YEARS 6)

2022 ◽  
Vol 12 ◽  
Author(s):  
Ricardo Melo Ferreira ◽  
Benjamin J. Freije ◽  
Michael T. Eadon

The kidney is composed of heterogeneous groups of epithelial, endothelial, immune, and stromal cells, all in close anatomic proximity. Spatial transcriptomic technologies allow the interrogation of in situ expression signatures in health and disease, overlaid upon a histologic image. However, some spatial gene expression platforms have not yet reached single-cell resolution. As such, deconvolution of spatial transcriptomic spots is important to understand the proportion of cell signature arising from these varied cell types in each spot. This article reviews the various deconvolution strategies discussed in the 2021 Indiana O’Brien Center for Microscopy workshop. The unique features of Seurat transfer score methodology, SPOTlight, Robust Cell Type Decomposition, and BayesSpace are reviewed. The application of normalization and batch effect correction across spatial transcriptomic samples is also discussed.


Genes ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 44
Author(s):  
Iago Maceda ◽  
Oscar Lao

The 1000 Genomes Project (1000G) is one of the most popular whole genome sequencing datasets used in different genomics fields and has boosting our knowledge in medical and population genomics, among other fields. Recent studies have reported the presence of ghost mutation signals in the 1000G. Furthermore, studies have shown that these mutations can influence the outcomes of follow-up studies based on the genetic variation of 1000G, such as single nucleotide variants (SNV) imputation. While the overall effect of these ghost mutations can be considered negligible for common genetic variants in many populations, the potential bias remains unclear when studying low frequency genetic variants in the population. In this study, we analyze the effect of the sequencing center in predicted loss of function (LoF) alleles, the number of singletons, and the patterns of archaic introgression in the 1000G. Our results support previous studies showing that the sequencing center is associated with LoF and singletons independent of the population that is considered. Furthermore, we observed that patterns of archaic introgression were distorted for some populations depending on the sequencing center. When analyzing the frequency of SNPs showing extreme patterns of genotype differentiation among centers for CEU, YRI, CHB, and JPT, we observed that the magnitude of the sequencing batch effect was stronger at MAF < 0.2 and showed different profiles between CHB and the other populations. All these results suggest that data from 1000G must be interpreted with caution when considering statistics using variants at low frequency.


Author(s):  
Qing Xia ◽  
Jeffrey A. Thompson ◽  
Devin C. Koestler

Abstract Batch-effects present challenges in the analysis of high-throughput molecular data and are particularly problematic in longitudinal studies when interest lies in identifying genes/features whose expression changes over time, but time is confounded with batch. While many methods to correct for batch-effects exist, most assume independence across samples; an assumption that is unlikely to hold in longitudinal microarray studies. We propose Batch effect Reduction of mIcroarray data with Dependent samples usinG Empirical Bayes (BRIDGE), a three-step parametric empirical Bayes approach that leverages technical replicate samples profiled at multiple timepoints/batches, so-called “bridge samples”, to inform batch-effect reduction/attenuation in longitudinal microarray studies. Extensive simulation studies and an analysis of a real biological data set were conducted to benchmark the performance of BRIDGE against both ComBat and longitudinal ComBat. Our results demonstrate that while all methods perform well in facilitating accurate estimates of time effects, BRIDGE outperforms both ComBat and longitudinal ComBat in the removal of batch-effects in data sets with bridging samples, and perhaps as a result, was observed to have improved statistical power for detecting genes with a time effect. BRIDGE demonstrated competitive performance in batch effect reduction of confounded longitudinal microarray studies, both in simulated and a real data sets, and may serve as a useful preprocessing method for researchers conducting longitudinal microarray studies that include bridging samples.


2021 ◽  
Author(s):  
Qian Wang ◽  
Jingyang Niu ◽  
Wei Xu ◽  
Dongming Wei ◽  
Kun Qian

Abstract Background: The amount of available biological data has exploded since the emergence of high-throughput technologies, which is not only revolting the way we recognize molecules and diseases but also bringing novel analytical challenges to bioinformatics analysis. In the last decade, deep learning has become a dominant technique in data science. However, classification accuracy is plagued with domain discrepancy. Notably, in the presence of multiple batches, domain discrepancy typically happens between individual batches. The recently proposed pair-wise adaptation approach may be suboptimal as it fails to eliminate the external factors across multiple batches and takes the classification task into account simultaneously. Results: We propose a joint deep learning framework for integrating batch effect removal and classification upon various omics data. To this end, we validate it on two private metabolomics (MALDI MS) datasets and one public transcriptomics (scRNA-seq) dataset. Especially for the former, we have achieved the highest diagnostic accuracy (ACC), with notable ~10% improvement than over state-of-the-art methods. Overall, these results indicate that our approach removes batch effect more effectively than conventional methods and yields more accurate classification results for smart diagnosis.


2021 ◽  
Author(s):  
Michael F. Adamer ◽  
Sarah C. Brueningk ◽  
Alejandro Tejada-Arranz ◽  
Fabienne Estermann ◽  
Marek Basler ◽  
...  

With the steadily increasing abundance of omics data produced all over the world, sometimes decades apart and under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch effect removal for entire databases lies in the large number and coincide of both batches and desired, biological variation resulting in design matrix singularity. This problem currently cannot be solved by any common batch correction algorithm. In this study, we present reComBat, a regularised version of the empirical Bayes method to overcome this limitation. We demonstrate our approach for the harmonisation of public gene expression data of the human opportunistic pathogen Pseudomonas aeruginosa and study a several metrics to empirically demonstrate that batch effects are successfully mitigated while biologically meaningful gene expression variation is retained. reComBat fills the gap in batch correction approaches applicable to large scale, public omics databases and opens up new avenues for data driven analysis of complex biological processes beyond the scope of a single study.


2021 ◽  
Author(s):  
Mathias N Stokholm ◽  
Maria B Rabaglino ◽  
Haja N Kadarmideen

Transcriptomic data is often expensive and difficult to generate in large cohorts in comparison to genomic data and therefore is often important to integrate multiple transcriptomic datasets from both microarray and next generation sequencing (NGS) based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including re-annotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining already existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically re-annotating the data and removing the batch effect. The removal of the batch effect is verified with Principal Component Analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. The datasets included Affymetrix, Agilent and RNA-sequencing data. Furthermore, we compared the GEDI package to already existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Marta Moreno-Torres ◽  
Guillem García-Llorens ◽  
Erika Moro ◽  
Rebeca Méndez ◽  
Guillermo Quintás ◽  
...  

AbstractREACH (Registration, Evaluation, Authorization and Restriction of Chemicals) is a global strategy and regulation policy of the EU that aims to improve the protection of human health and the environment through the better and earlier identification of the intrinsic properties of chemical substances. It entered into force on 1st June 2007 (EC 1907/2006). REACH and EU policies plead for the use of robust high-throughput "omic" techniques for the in vitro investigation of the toxicity of chemicals that can provide an estimation of their hazards as well as information regarding the underlying mechanisms of toxicity. In agreement with the 3R’s principles, cultured cells are nowadays widely used for this purpose, where metabolomics can provide a real-time picture of the metabolic effects caused by exposure of cells to xenobiotics, enabling the estimations about their toxicological hazards. High quality and robust metabolomics data sets are essential for precise and accurate hazard predictions. Currently, the acquisition of consistent and representative metabolomic data is hampered by experimental drawbacks that hinder reproducibility and difficult robust hazard interpretation. Using the differentiated human liver HepG2 cells as model system, and incubating with hepatotoxic (acetaminophen and valproic acid) and non-hepatotoxic compounds (citric acid), we evaluated in-depth the impact of several key experimental factors (namely, cell passage, processing day and storage time, and compound treatment) and instrumental factors (batch effect) on the outcome of an UPLC-MS metabolomic analysis data set. Results showed that processing day and storage time had a significant impact on the retrieved cell's metabolome, while the effect of cell passage was minor. Meta-analysis of results from pathway analysis showed that batch effect corrections and quality control (QC) measures are critical to enable consistent and meaningful estimations of the effects caused by compounds on cells. The quantitative analysis of the changes in metabolic pathways upon bioactive compound treatment remained consistent despite the concurrent causes of metabolomic data variation. Thus, upon appropriate data retrieval and correction and by an innovative metabolic pathway analysis, the metabolic alteration predictions remained conclusive despite the acknowledged sources of variability.


Blood ◽  
2021 ◽  
Vol 138 (Supplement 1) ◽  
pp. 2954-2954
Author(s):  
Chern Han Yong ◽  
Shawn Hoon ◽  
Sanjay De Mel ◽  
Stacy Xu ◽  
Jonathan Adam Scolnick ◽  
...  

Abstract Introduction Many cancers involve the participation of rare cell populations that may only be found in a subset of patients. Single-cell RNA sequencing (scRNA-seq) can identify distinct cell populations across multiple samples with batch normalization used to reduce processing-based effects between samples. However, aggressive normalization obscures rare cell populations, which may be erroneously grouped with other cell types. There is a need for conservative batch normalization that maintains the biological signal necessary to detect rare cell populations. MapBatch We designed a batch normalization tool, MapBatch, based on two principles: an autoencoder trained with a single sample learns the underlying gene expression structure of cell types without batch effect; and an ensemble model combines multiple autoencoders, allowing the use of multiple samples for training. Each autoencoder is trained on one sample, learning a projection into the biological space S representing the real expression differences between cells in that sample (Figure 1a, middle). When other samples are projected into S, the projection reduces expression differences orthogonal to S, while preserving differences along S. The reverse projection transforms the data back into gene space at the autoencoder's output, sans expression differences orthogonal to S (Figure 1a, right). Since batch-based technical differences are not represented in S, this transformation selectively removes batch effect between samples, while preserving biological signal. The autoencoder output thus represents normalized expression data, conditioned on the training sample. To incorporate multiple samples into training, MapBatch uses an ensemble of autoencoders, each trained with a single sample (Figure 1b). We train with a minimal number of samples necessary to cover the different cell populations in the dataset. We implement regularization using dropout and noise layers, and an a priori feature extraction layer using KEGG gene modules. The autoencoders' outputs are concatenated for downstream analysis. For visualization and clustering, we use the top principal components of the concatenated outputs. For differential expression (DE), we perform DE on each of the gene matrices output by each model, then take the result with the lowest P-value. To test MapBatch, we generated a synthetic dataset based on 7 batches of publicly available PBMC data. For each batch we simulated rare cell populations by selecting one of three cell types to perturb by up and down-regulating 40 genes in 0.5%-2% of the cells (Figure 1c). We simulated additional batch effect by scaling each gene in each batch with a scaling factor. Upon visualization and clustering, cells grouped largely by batch (Figure 1d). After batch normalization, cells grouped by cell type rather than batch, and all three perturbed cell populations were successfully delineated (Figure 1e). DE between each perturbed population and its mother cells accurately retrieved the perturbed genes, showing that normalization maintained real expression differences (Figure 1e). In contrast, three methods tested Seurat (Stuart et al., 2019), Harmony (Korsunsky et al., 2019), and Liger (Welch et al., 2019) could only derive a subset of the perturbed populations (Figures 1f-h). MapBatch identifies rare populations in multiple myeloma (MM) We used MapBatch to process bone marrow scRNA-seq data from 14 MM samples and 2 healthy controls. After batch normalization, unsupervised clustering identified 20 clusters, which we annotated using MapCell (Koh & Hoon, 2019) (Figures 2a, 2b). We identified 3 small clusters of cells that could not be reliably annotated, comprising less than 1% of total cells and found in only a subset of patients (Figures 2c, 2d). As validation, we observed that these cells were present in distinct clusters in individual samples using their uncorrected expression data, providing evidence that these clusters were not driven by batch effect nor MapBatch (Figure 2e). Conclusion Batch normalization of scRNA-seq data involves a trade-off between minimizing batch effect and maximizing the remaining biological signal. While most methods lean towards the former, MapBatch maintains more biological signal for downstream analysis, enabling the discovery of previously difficult to find cell populations. Figure 1 Figure 1. Disclosures Xu: Proteona Pte Ltd: Current Employment. Scolnick: Proteona Pte Ltd: Current holder of individual stocks in a privately-held company. Huo: Proteona Pte Ltd: Ended employment in the past 24 months. Lovci: Proteona Pte Ltd: Current Employment. Chng: Amgen: Honoraria, Research Funding; Abbvie: Honoraria; Janssen: Honoraria, Research Funding; Novartis: Honoraria; Celgene: Honoraria, Research Funding.


Cancers ◽  
2021 ◽  
Vol 13 (19) ◽  
pp. 4809
Author(s):  
Heather M. Whitney ◽  
Hui Li ◽  
Yu Ji ◽  
Peifang Liu ◽  
Maryellen L. Giger

Radiomic features extracted from medical images may demonstrate a batch effect when cases come from different sources. We investigated classification performance using training and independent test sets drawn from two sources using both pre-harmonization and post-harmonization features. In this retrospective study, a database of thirty-two radiomic features, extracted from DCE-MR images of breast lesions after fuzzy c-means segmentation, was collected. There were 944 unique lesions in Database A (208 benign lesions, 736 cancers) and 1986 unique lesions in Database B (481 benign lesions, 1505 cancers). The lesions from each database were divided by year of image acquisition into training and independent test sets, separately by database and in combination. ComBat batch harmonization was conducted on the combined training set to minimize the batch effect on eligible features by database. The empirical Bayes estimates from the feature harmonization were applied to the eligible features of the combined independent test set. The training sets (A, B, and combined) were then used in training linear discriminant analysis classifiers after stepwise feature selection. The classifiers were then run on the A, B, and combined independent test sets. Classification performance was compared using pre-harmonization features to post-harmonization features, including their corresponding feature selection, evaluated using the area under the receiver operating characteristic curve (AUC) as the figure of merit. Four out of five training and independent test scenarios demonstrated statistically equivalent classification performance when compared pre- and post-harmonization. These results demonstrate that translation of machine learning techniques with batch data harmonization can potentially yield generalizable models that maintain classification performance.


Metabolomics ◽  
2021 ◽  
Vol 17 (10) ◽  
Author(s):  
Kui Deng ◽  
Falin Zhao ◽  
Zhiwei Rong ◽  
Lei Cao ◽  
Liuchao Zhang ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document