scholarly journals MultiBaC: A strategy to remove batch effects between different omic data types

2020 ◽  
Vol 29 (10) ◽  
pp. 2851-2864
Author(s):  
Manuel Ugidos ◽  
Sonia Tarazona ◽  
José M Prats-Montalbán ◽  
Alberto Ferrer ◽  
Ana Conesa

Diversity of omic technologies has expanded in the last years together with the number of omic data integration strategies. However, multiomic data generation is costly, and many research groups cannot afford research projects where many different omic techniques are generated, at least at the same time. As most researchers share their data in public repositories, different omic datasets of the same biological system obtained at different labs can be combined to construct a multiomic study. However, data obtained at different labs or moments in time are typically subjected to batch effects that need to be removed for successful data integration. While there are methods to correct batch effects on the same data types obtained in different studies, they cannot be applied to correct lab or batch effects across omics. This impairs multiomic meta-analysis. Fortunately, in many cases, at least one omics platform—i.e. gene expression— is repeatedly measured across labs, together with the additional omic modalities that are specific to each study. This creates an opportunity for batch analysis. We have developed MultiBaC (multiomic Multiomics Batch-effect Correction correction), a strategy to correct batch effects from multiomic datasets distributed across different labs or data acquisition events. Our strategy is based on the existence of at least one shared data type which allows data prediction across omics. We validate this approach both on simulated data and on a case where the multiomic design is fully shared by two labs, hence batch effect correction within the same omic modality using traditional methods can be compared with the MultiBaC correction across data types. Finally, we apply MultiBaC to a true multiomic data integration problem to show that we are able to improve the detection of meaningful biological effects.

Author(s):  
Kristel Van Steen ◽  
Nuria Malats

The identification of causal or predictive variants/genes/mechanisms for disease-associated traits is characterized by “complex” networks of molecular phenotypes. Present technology and computer power allow building and processing large collections of these data types. However, the super-rapid data generation is counterweighted by a slow-pace for data integration methods development. Most currently available integrative analytic tools pertain to pairing omics data and focus on between-data source relationships, making strong assumptions about within-data source architectures. A limited number of initiatives exist aiming to find the most optimal ways to analyze multiple, possibly related, omics databases, and fully acknowledge the specific characteristics of each data type. A thorough understanding of the underlying assumptions of integrative methods is needed to draw sound conclusions afterwards. In this chapter, the authors discuss how the field of “integromics” has evolved and give pointers towards essential research developments in this context.


Biotechnology ◽  
2019 ◽  
pp. 1826-1866
Author(s):  
Kristel Van Steen ◽  
Nuria Malats

The identification of causal or predictive variants/genes/mechanisms for disease-associated traits is characterized by “complex” networks of molecular phenotypes. Present technology and computer power allow building and processing large collections of these data types. However, the super-rapid data generation is counterweighted by a slow-pace for data integration methods development. Most currently available integrative analytic tools pertain to pairing omics data and focus on between-data source relationships, making strong assumptions about within-data source architectures. A limited number of initiatives exist aiming to find the most optimal ways to analyze multiple, possibly related, omics databases, and fully acknowledge the specific characteristics of each data type. A thorough understanding of the underlying assumptions of integrative methods is needed to draw sound conclusions afterwards. In this chapter, the authors discuss how the field of “integromics” has evolved and give pointers towards essential research developments in this context.


2019 ◽  
Vol 35 (22) ◽  
pp. 4696-4706 ◽  
Author(s):  
Travis S Johnson ◽  
Tongxin Wang ◽  
Zhi Huang ◽  
Christina Y Yu ◽  
Yi Wu ◽  
...  

Abstract Motivation Rapid advances in single cell RNA sequencing (scRNA-seq) have produced higher-resolution cellular subtypes in multiple tissues and species. Methods are increasingly needed across datasets and species to (i) remove systematic biases, (ii) model multiple datasets with ambiguous labels and (iii) classify cells and map cell type labels. However, most methods only address one of these problems on broad cell types or simulated data using a single model type. It is also important to address higher-resolution cellular subtypes, subtype labels from multiple datasets, models trained on multiple datasets simultaneously and generalizability beyond a single model type. Results We developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets (even from different species) and applied our framework on simulated, pancreas and brain scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent cell subtype labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy simulated 1 datasets: 90%; simulated 2 datasets: 94%; pancreas datasets: 88% and brain datasets: 66%) using LAmbDA Feedforward 1 Layer Neural Network with bagging. This method achieved higher weighted accuracy in labeling cellular subtypes than two other state-of-the-art methods, scmap and CaSTLe in brain (66% versus 60% and 32%). Furthermore, it achieved better performance in correctly predicting ambiguous cellular subtype labels across datasets in 88% of test cases compared with CaSTLe (63%), scmap (50%) and MetaNeighbor (50%). LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing. Availability and implementation github.com/tsteelejohnson91/LAmbDA Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Sean M. Gibbons ◽  
Claire Duvallet ◽  
Eric J. Alm

AbstractHigh-throughput data generation platforms, like mass-spectrometry, microarrays, and second-generation sequencing are susceptible to batch effects due to run-to-run variation in reagents, equipment, protocols, or personnel. Currently, batch correction methods are not commonly applied to microbiome sequencing datasets. In this paper, we compare different batch-correction methods applied to microbiome case-control studies. We introduce a model-free normalization procedure where features (i.e. bacterial taxa) in case samples are converted to percentiles of the equivalent features in control samples within a study prior to pooling data across studies. We look at how this percentile-normalization method compares to traditional meta-analysis methods for combining independent p-values and to limma and ComBat, widely used batch-correction models developed for RNA microarray data. Overall, we show that percentile-normalization is a simple, non-parametric approach for correcting batch effects and improving sensitivity in case-control meta-analyses.Author SummaryBatch effects are obstacles to comparing results across studies. Traditional meta-analysis techniques for combining p-values from independent studies, like Fisher’s method, are effective but statistically conservative. If batch-effects can be corrected, then statistical tests can be performed on data pooled across studies, increasing sensitivity to detect differences between treatment groups. Here, we show how a simple, model-free approach corrects for batch effects in case-control microbiome datasets.


2017 ◽  
Author(s):  
Maren Büttner ◽  
Zhichao Miao ◽  
F Alexander Wolf ◽  
Sarah A Teichmann ◽  
Fabian J Theis

AbstractSingle-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations. As with all genomics experiments, batch effects can hamper data integration and interpretation. The success of batch effect correction is often evaluated by visual inspection of dimension-reduced representations such as principal component analysis. This is inherently imprecise due to the high number of genes and non-normal distribution of gene expression. Here, we present a k-nearest neighbour batch effect test (kBET, https://github.com/theislab/kBET) to quantitatively measure batch effects. kBET is easier to interpret, more sensitive and more robust than visual evaluation and other measures of batch effects. We use kBET to assess commonly used batch regression and normalisation approaches, and quantify the extent to which they remove batch effects while preserving biological variability. Our results illustrate that batch correction based on log-transformation or scran pooling followed by ComBat reduced the batch effect while preserving structure across data sets. Finally we show that kBET can pinpoint successful data integration methods across multiple data sets, in this case from different publications all charting mouse embryonic development. This has important implications for future data integration efforts, which will be central to projects such as the Human Cell Atlas where data for the same tissue may be generated in multiple locations around the world.[Before final publication, we will upload the R package to Bioconductor]


Epigenomics ◽  
2021 ◽  
Author(s):  
Amy L Non

Aim: Social scientists have placed particularly high expectations on the study of epigenomics to explain how exposure to adverse social factors like poverty, child maltreatment and racism – particularly early in childhood – might contribute to complex diseases. However, progress has stalled, reflecting many of the same challenges faced in genomics, including overhype, lack of diversity in samples, limited replication and difficulty interpreting significance of findings. Materials & methods: This review focuses on the future of social epigenomics by discussing progress made, ongoing methodological and analytical challenges and suggestions for improvement. Results & conclusion: Recommendations include more diverse sample types, cross-cultural, longitudinal and multi-generational studies. True integration of social and epigenomic data will require increased access to both data types in publicly available databases, enhanced data integration frameworks, and more collaborative efforts between social scientists and geneticists.


CNS Spectrums ◽  
2018 ◽  
Vol 24 (5) ◽  
pp. 479-495 ◽  
Author(s):  
Marco Solmi ◽  
Michele Fornaro ◽  
Kuniyoshi Toyoshima ◽  
Andrè F. Carvalho ◽  
Cristiano A. Köhler ◽  
...  

ObjectiveOur aim was to summarize the efficacy and safety of atomoxetine, amphetamines, and methylphenidate in schizophrenia.MethodsWe undertook a systematic review, searching PubMed/Scopus/Clinicaltrials.gov for double-blind, randomized, placebo-controlled studies of psychostimulants or atomoxetine in schizophrenia published up to 1 January 2017. A meta-analysis of outcomes reported in two or more studies is presented.ResultsWe included 22 studies investigating therapeutic effects of stimulants (k=14) or measuring symptomatic worsening/relapse prediction after stimulant challenge (k=6). Six studies of these two groups plus one additional study investigated biological effects of psychostimulants or atomoxetine. No effect resulted from interventional studies on weight loss (k=1), smoking cessation (k=1), and positive symptoms (k=12), and no improvement was reported with atomoxetine (k=3) for negative symptoms, with equivocal findings for negative (k=6) and mood symptoms (k=2) with amphetamines. Attention, processing speed, working memory, problem solving, and executive functions, among others, showed from no to some improvement with atomoxetine (k=3) or amphetamines (k=6). Meta-analysis did not confirm any effect of stimulants in any symptom domain, including negative symptoms, apart from atomoxetine improving problem solving (k=2, standardized mean difference (SMD)=0.73, 95% CI=0.10–1.36,p=0.02, I2=0%), and trending toward significant improvement in executive functions with amphetamines (k=2, SMD=0.80, 95% CI=−1.68 to +0.08,p=0.08, I2=66%). In challenge studies, amphetamines (k=1) did not worsen symptoms, and methylphenidate (k=5) consistently worsened or predicted relapse. Biological effects of atomoxetine (k=1) and amphetamines (k=1) were cortical activation, without change in β-endorphin (k=1), improved response to antipsychotics after amphetamine challenge (k=2), and an increase of growth hormone–mediated psychosis with methylphenidate (k=2). No major side effects were reported (k=6).ConclusionsNo efficacy for stimulants or atomoxetine on negative symptoms is proven. Atomoxetine or amphetamines may improve cognitive symptoms, while methylphenidate should be avoided in patients with schizophrenia. Insufficient evidence is available to draw firm conclusions.


2018 ◽  
Author(s):  
Uri Shaham

AbstractBiological measurements often contain systematic errors, also known as “batch effects”, which may invalidate downstream analysis when not handled correctly. The problem of removing batch effects is of major importance in the biological community. Despite recent advances in this direction via deep learning techniques, most current methods may not fully preserve the true biological patterns the data contains. In this work we propose a deep learning approach for batch effect removal. The crux of our approach is learning a batch-free encoding of the data, representing its intrinsic biological properties, but not batch effects. In addition, we also encode the systematic factors through a decoding mechanism and require accurate reconstruction of the data. Altogether, this allows us to fully preserve the true biological patterns represented in the data. Experimental results are reported on data obtained from two high throughput technologies, mass cytometry and single-cell RNA-seq. Beyond good performance on training data, we also observe that our system performs well on test data obtained from new patients, which was not available at training time. Our method is easy to handle, a publicly available code can be found at https://github.com/ushaham/BatchEffectRemoval2018.


Sign in / Sign up

Export Citation Format

Share Document