batch correction
Recently Published Documents


TOTAL DOCUMENTS

65
(FIVE YEARS 47)

H-INDEX

8
(FIVE YEARS 3)

2022 ◽  
Author(s):  
Stephen Coleman ◽  
Xaquin Castro Dopico ◽  
Gunilla B Karlsson Hedestam ◽  
Paul DW Kirk ◽  
Chris Wallace

Systematic differences between batches of samples present significant challenges when analysing biological data. Such batch effects are well-studied and are liable to occur in any setting where multiple batches are assayed. Many existing methods for accounting for these have focused on high-dimensional data such as RNA-seq and have assumptions that reflect this. Here we focus on batch-correction in low-dimensional classification problems. We propose a semi-supervised Bayesian generative classifier based on mixture models that jointly predicts class labels and models batch effects. Our model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects. We explore two choices for the within-class densities: the multivariate normal and the multivariate t. A simulation study demonstrates that our method performs well compared to popular off-the-shelf machine learning methods and is also quick; performing 15,000 iterations on a dataset of 500 samples with 2 measurements each in 7.3 seconds for the MVN mixture model and 11.9 seconds for the MVT mixture model. We apply our model to two datasets generated using the enzyme-linked immunosorbent assay (ELISA), a spectrophotometric assay often used to screen for antibodies. The examples we consider were collected in 2020 and measure seropositivity for SARS-CoV-2. We use our model to estimate seroprevalence in the populations studied. We implement the models in C++ using a Metropolis-within-Gibbs algorithm; this is available in the R package at https://github.com/stcolema/BatchMixtureModel. Scripts to recreate our analysis are at https://github.com/stcolema/BatchClassifierPaper.


2021 ◽  
Author(s):  
Xiangchun Li ◽  
Xilin Shen

Integration of the evolving large-scale single-cell transcriptomes requires scalable batch-correction approaches. Here we propose a simple batch-correction method that is scalable for integrating super large-scale single-cell transcriptomes from diverse sources. The core idea of the method is encoding batch information of each cell as a trainable parameter and added to its expression profile; subsequently, a contrastive learning approach is used to learn feature representation of the additive expression profile. We demonstrate the scalability of the proposed method by integrating 18 million cells obtained from the Human Cell Atlas. Our benchmark comparisons with current state-of-the-art single-cell integration methods demonstrated that our method could achieve comparable data alignment and cluster preservation. Our study would facilitate the integration of super large-scale single-cell transcriptomes. The source code is available at https://github.com/xilinshen/Fugue.


2021 ◽  
Author(s):  
Michael F. Adamer ◽  
Sarah C. Brueningk ◽  
Alejandro Tejada-Arranz ◽  
Fabienne Estermann ◽  
Marek Basler ◽  
...  

With the steadily increasing abundance of omics data produced all over the world, sometimes decades apart and under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch effect removal for entire databases lies in the large number and coincide of both batches and desired, biological variation resulting in design matrix singularity. This problem currently cannot be solved by any common batch correction algorithm. In this study, we present reComBat, a regularised version of the empirical Bayes method to overcome this limitation. We demonstrate our approach for the harmonisation of public gene expression data of the human opportunistic pathogen Pseudomonas aeruginosa and study a several metrics to empirically demonstrate that batch effects are successfully mitigated while biologically meaningful gene expression variation is retained. reComBat fills the gap in batch correction approaches applicable to large scale, public omics databases and opens up new avenues for data driven analysis of complex biological processes beyond the scope of a single study.


2021 ◽  
Author(s):  
Scott R Tyler ◽  
Supinda Bunyavanich ◽  
Eric E Schadt

Single cell RNAseq (scRNAseq) batches range from technical replicates to multi-tissue atlases, thus requiring robust batch correction methods that operate effectively across this similarity spectrum. Currently, no metrics allow for full benchmarking across this spectrum, resulting in benchmarks that quantify removal of batch effects without quantifying preservation of real batch differences. Here, we address these gaps with a new statistical metric [Percent Maximum Difference (PMD)] that linearly quantifies batch similarity, and simulations generating cells from mixtures of distinct gene expression programs (cell-lineages/-types/-states). Using 690 real-world and 672 simulated integrations (7.2e6 cells total) we compared 7 batch integration approaches across the spectrum of similarity with batch-confounded gene expression. Count downsampling appeared the most robust, while others left residual batch effects or produced over-merged datasets. We further released open-source PMD and downsampling packages, with the latter capable of downsampling an organism atlas (245,389 cells) in tens of minutes on a standard computer.


2021 ◽  
Author(s):  
Tianyu Liu ◽  
Yuge Wang ◽  
Hong-yu Zhao

With the advancement of technology, we can generate and access large-scale, high dimensional and diverse genomics data, especially through single-cell RNA sequencing (scRNA-seq). However, integrative downstream analysis from multiple scRNA-seq datasets remains challenging due to batch effects. In this paper, we focus on scRNA-seq data integration and propose a new deep learning framework based on Wasserstein Generative Adversarial Network (WGAN) combined with an attention mechanism to reduce the differences among batches. We also discuss the limitations of the existing methods and demonstrate the advantages of our new model from both theoretical and practical aspects, advocating the use of deep learning in genomics research.


2021 ◽  
Author(s):  
Xuanlin Meng ◽  
Fei Tao ◽  
Ping Xu

In microbial research, the heterogeneity phenomenon is closely associated with microbial physiology in multiple dimensions. For now, A few studies were proposed in transcriptome and proteome analysis to discover the heterogeneity among single cells. However, microbial single cell metabolomics has not been possible yet. Herein, we developed a method, RespectM, based on discontinuous mass spectrometry imaging, which can detect more than 700 metabolites at a rate of 500 cells per hour. While ensuring the high throughput of RespectM, it integrates matrix sublimation, QC-based peak filtering, and batch correction strategies to improve accuracy. The results show that RespectM can distinguish single microbial cells from the blank matrix with an accuracy of 98.4%, depending on classification algorithms. Furthermore, to verify the accuracy of RespectM for distinguishing different single cells, we performed a classification test on Chlamydomonas reinhardtii single cells among allelic strains. The results showed an accuracy of 93.1%, which provides RespectM with enough confidence to perform microbial single cell metabolomics analysis. As we expected, untreated microbial cells will spontaneously undergo metabolic grouping coherence with genetic and biochemical similarities. Interestingly, the pseudo-time analysis also provided intuitive evidence on the metabolic dimension, indicating the cell grouping is based on microbial population heterogeneity. We believe that the RespectM can offer a powerful tool in the microbial study. Researchers can now directly analyze the changes in microbial metabolism at a single-cell level with high efficiency.


Blood ◽  
2021 ◽  
Vol 138 (Supplement 1) ◽  
pp. 3268-3268
Author(s):  
Bianca A Ulloa ◽  
Samima S Habbsa ◽  
Kathryn S Potts ◽  
Alana Lewis ◽  
Mia McKinstry ◽  
...  

Abstract Definitive hematopoietic stem cells (HSCs) emerge in the embryo and sustain major adult hematopoietic lineages. Although their functional potential is detected by transplant, nascent HSC contribution during development is both unknown and difficult to address due to the overlapping emergence of HSC-independent progenitors-cells that lack the multipotency and/or longevity of HSCs but express many of the same markers. Using sorted hematopoietic stem and progenitor cells from zebrafish embryos, we performed single cell RNA sequencing to decipher HSC and HSC-independent progenitor heterogeneity during the time frame of their emergence and initial maturation. After batch correction and dimensional reduction, we identified seven distinct populations that are inferred from RNA velocity analysis to originate from pre-hemogenic endothelium and develop into three main differentiation trajectories. We also determined that HSCs can be distinguished from HSC-independent progenitors based on the temporal regulation and differential activity of the draculin (drl) promoter that was previously shown to mark adult-contributing HSCs. From these studies, we found that the drl promoter is active in HSCs and HSC-independent progenitors at 1-day post-fertilization (dpf) but becomes highly expressed primarily in HSCs by 2 dpf. We applied a drl:cre-ER T2 tamoxifen-inducible Cre-loxP lineage-tracing approach to selectively lineage trace HSCs starting at 2 dpf and track their myeloid and lymphoid contribution during larval development and adulthood. We determined that HSC-independent progenitors primarily contribute to developmental lymphomyelopoiesis with minimal HSC contribution until after 7 dpf. Consistent with this result, we demonstrated that although HSCs robustly regenerated after hematopoietic injury using a novel inducible larval HSC injury model, their depletion had almost no impact on lymphoid and myeloid cell numbers up to 7 dpf. These findings suggest that HSCs are not entirely dormant during development and that there exists an uncoupling of HSC self-renewal and differentiation in development. In conclusion, we determine that it is the HSC-independent progenitors, and not HSCs, that sustain embryonic and early larval lymphomyelopoiesis. Acquiring a greater understanding regarding developmental differences in progenitor and HSC specification and maturation will inform and improve the generation of functional HSCs from renewable pluripotent stem cells. Disclosures No relevant conflicts of interest to declare.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Changsheng Sun ◽  
Jiatong Han ◽  
Yixin Bai ◽  
Zhaowei Zhong ◽  
Yingtao Song ◽  
...  

Background. The aim of this study was at investigating the association between major depressive disorder (MDD) and periodontitis based on crosstalk genes and neuropeptides. Methods. Datasets for periodontitis (GSE10334, GSE16134, and GSE23586) and MDD (GSE38206 and GSE39653) were downloaded from GEO. Following batch correction, a differential expression analysis was applied (MDD: ∣ log 2 FC ∣ > 0 and periodontitis ∣ log 2 FC ∣ ≥ 0.5 , p < 0.05 ). The neuropeptide data were downloaded from NeuroPep and NeuroPedia. Intersected genes were potential crosstalk genes. The correlation between neuropeptides and crosstalk genes in MDD and periodontitis was analyzed with Pearson correlation coefficient. Subsequently, regression analysis was performed to calculate the differentially regulated link. Cytoscape was used to map the pathways of crosstalk genes and neuropeptides and to construct the protein-protein interaction network. Lasso regression was applied to screen neuropeptides, whereby boxplots were created, and receiver operating curve (ROC) analysis was conducted. Results. The MDD dataset contained 30 case and 33 control samples, and the periodontitis dataset contained 430 case and 139 control samples. 35 crosstalk genes were obtained. A total of 102 neuropeptides were extracted from the database, which were not differentially expressed in MDD and periodontitis and had no intersection with crosstalk genes. Through lasso regression, 9 neuropeptides in MDD and 43 neuropeptides in periodontitis were obtained. Four intersected neuropeptide genes were obtained, i.e., ADM, IGF2, PDYN, and RETN. The results of ROC analysis showed that IGF2 was highly predictive in MDD and periodontitis. ADM was better than the other three genes in predicting MDD disease. A total of 13 crosstalk genes were differentially coexpressed with four neuropeptides, whereby FOSB was highly expressed in MDD and periodontitis. Conclusion. The neuropeptide genes ADM, IGF2, PDYN, and RETN were intersected between periodontitis and MDD, and FOSB was a crosstalk gene related to these neuropeptides on the transcriptomic level. These results are a basis for future research in the field, needing further validation.


GigaScience ◽  
2021 ◽  
Vol 10 (10) ◽  
Author(s):  
Vinay S Swamy ◽  
Temesgen D Fufa ◽  
Robert B Hufnagel ◽  
David M McGaughey

Abstract Background: The development of highly scalable single-cell transcriptome technology has resulted in the creation of thousands of datasets, &gt;30 in the retina alone. Analyzing the transcriptomes between different projects is highly desirable because this would allow for better assessment of which biological effects are consistent across independent studies. However it is difficult to compare and contrast data across different projects because there are substantial batch effects from computational processing, single-cell technology utilized, and the natural biological variation. While many single-cell transcriptome-specific batch correction methods purport to remove the technical noise, it is difficult to ascertain which method functions best. Results: We developed a lightweight R package (scPOP, single-cell Pick Optimal Parameters) that brings in batch integration methods and uses a simple heuristic to balance batch merging and cell type/cluster purity. We use this package along with a Snakefile-based workflow system to demonstrate how to optimally merge 766,615 cells from 33 retina datsets and 3 species to create a massive ocular single-cell transcriptome meta-atlas. Conclusions: This provides a model for how to efficiently create meta-atlases for tissues and cells of interest.


2021 ◽  
Vol 12 ◽  
Author(s):  
Bin Zou ◽  
Tongda Zhang ◽  
Ruilong Zhou ◽  
Xiaosen Jiang ◽  
Huanming Yang ◽  
...  

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.


Sign in / Sign up

Export Citation Format

Share Document