scholarly journals A multivariate method to correct for batch effects in microbiome data

2020 ◽  
Author(s):  
Yiwen Wang ◽  
Kim-Anh Lê Cao

AbstractMicrobial communities are highly dynamic and sensitive to changes in the environment. Thus, microbiome data are highly susceptible to batch effects, defined as sources of unwanted variation that are not related to, and obscure any factors of interest. Existing batch correction methods have been primarily developed for gene expression data. As such, they do not consider the inherent characteristics of microbiome data, including zero inflation, overdispersion and correlation between variables. We introduce a new multivariate and non-parametric batch correction method based on Partial Least Squares Discriminant Analysis. PLSDA-batch first estimates treatment and batch variation with latent components to then subtract batch variation from the data. The resulting batch effect corrected data can then be input in any downstream statistical analysis. Two variants are also proposed to handle unbalanced batch x treatment designs and to include variable selection during component estimation. We compare our approaches with existing batch correction methods removeBatchEffect and ComBat on simulated and three case studies. We show that our three methods lead to competitive performance in removing batch variation while preserving treatment variation, and especially when batch effects have high variability. Reproducible code and vignettes are available on GitHub.

2019 ◽  
Author(s):  
Miao Yu ◽  
Anna Roszkowska ◽  
Janusz Pawliszyn

AbstractBatch effects will influence the interpretation of metabolomics data. In order to avoid misleading results, batch effects should be corrected and normalized prior to statistical analysis. Metabolomics studies are usually performed without targeted compounds (e.g., internal standards) and it is a challenging task to validate batch effects correction methods. In addition, statistical properties of metabolomics data are quite different from genomics data (where most of the currently used batch correction methods have originated from). In this study, we firstly analyzed already published metabolomics datasets so as to summarize and discuss their statistical properties. Then, based on available datasets, we developed novel statistical properties-based in silico simulations of metabolomics peaks’ intensity data so as to analyze the influence of batch effects on metabolomic data with the use of currently available batch correction strategies. Overall, 252000 batch corrections on 14000 different in silico simulated datasets and related differential analyses were performed in order to evaluate and validate various batch correction methods. The obtained results indicate that log transformations strongly influence the performance of all investigated batch correction methods. False positive rates increased after application of batch correction methods with almost no improvement on true positive rates among the analyzed batch correction methods. Hence, in metabolomic studies it is recommended to implement preliminary experiments to simulate batch effects from real data in order to select adequate batch correction method, based on a given distribution of peaks intensity. The presented study is reproducible and related R package mzrtsim software can be found online (https://github.com/yufree/mzrtsim).


NAR Cancer ◽  
2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Susanne Ibing ◽  
Birgitta E Michels ◽  
Moritz Mosdzien ◽  
Helen R Meyer ◽  
Lars Feuerbach ◽  
...  

Abstract MicroRNAs (miRNAs) are small non-coding RNAs with diverse functions in post-transcriptional regulation of gene expression. Sequence and length variants of miRNAs are called isomiRs and can exert different functions compared to their canonical counterparts. The Cancer Genome Atlas (TCGA) provides isomiR-level expression data for patients of various cancer entities collected in a multi-center approach over several years. However, the impact of batch effects within individual cohorts has not been systematically investigated and corrected for before. Therefore, the aim of this study was to identify relevant cohort-specific batch variables and generate batch-corrected isomiR expression data for 16 TCGA cohorts. The main batch variables included sequencing platform, plate, sample purity and sequencing depth. Platform bias was related to certain length and sequence features of individual recurrently affected isomiRs. Furthermore, significant downregulation of reported tumor suppressive isomiRs in lung tumor tissue compared to normal samples was only observed after batch correction, highlighting the importance of working with corrected data. Batch-corrected datasets for all cohorts including quality control are provided as supplement. In summary, this study reveals that batch effects present in the TCGA dataset might mask biologically relevant effects and provides a valuable resource for research on isomiRs in cancer (accessible through GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164767).


2014 ◽  
Vol 7 (7) ◽  
pp. 1969-1977 ◽  
Author(s):  
G. J. Zheng ◽  
Y. Cheng ◽  
K. B. He ◽  
F. K. Duan ◽  
Y. L. Ma

Abstract. The Sunset semi-continuous carbon analyzer (SCCA) is an instrument widely used for carbonaceous aerosol measurement. Despite previous validation work, in this study we identified a new type of SCCA calculation discrepancy caused by the default multipoint baseline correction method. When exceeding a certain threshold carbon load, multipoint correction could cause significant total carbon (TC) underestimation. This calculation discrepancy was characterized for both sucrose and ambient samples, with two protocols based on IMPROVE (Interagency Monitoring of PROtected Visual Environments) (i.e., IMPshort and IMPlong) and one NIOSH (National Institute for Occupational Safety and Health)-like protocol (rtNIOSH). For ambient samples, the IMPshort, IMPlong and rtNIOSH protocol underestimated 22, 36 and 12% of TC, respectively, with the corresponding threshold being ~ 0, 20 and 25 μgC. For sucrose, however, such discrepancy was observed only with the IMPshort protocol, indicating the need of more refractory SCCA calibration substance. Although the calculation discrepancy could be largely reduced by the single-point baseline correction method, the instrumental blanks of single-point method were higher. The correction method proposed was to use multipoint-corrected data when below the determined threshold, and use single-point results when beyond that threshold. The effectiveness of this correction method was supported by correlation with optical data.


2018 ◽  
Vol 35 (13) ◽  
pp. 2348-2348 ◽  
Author(s):  
Zhenwei Dai ◽  
Sunny H Wong ◽  
Jun Yu ◽  
Yingying Wei

2020 ◽  
Author(s):  
Ruben Chazarra-Gil ◽  
Stijn van Dongen ◽  
Vladimir Yu Kiselev ◽  
Martin Hemberg

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.


2020 ◽  
pp. 580-592
Author(s):  
Libi Hertzberg ◽  
Assif Yitzhaky ◽  
Metsada Pasmanik-Chor

This article describes how the last decade has been characterized by the production of huge amounts of different types of biological data. Following that, a flood of bioinformatics tools have been published. However, many of these tools are commercial, or require computational skills. In addition, not all tools provide intuitive and highly accessible visualization of the results. The authors have developed GEView (Gene Expression View), which is a free, user-friendly tool harboring several existing algorithms and statistical methods for the analysis of high-throughput gene, microRNA or protein expression data. It can be used to perform basic analysis such as quality control, outlier detection, batch correction and differential expression analysis, through a single intuitive graphical user interface. GEView is unique in its simplicity and highly accessible visualization it provides. Together with its basic and intuitive functionality it allows Bio-Medical scientists with no computational skills to independently analyze and visualize high-throughput data produced in their own labs.


2002 ◽  
Vol 19 (3) ◽  
pp. 322-339 ◽  
Author(s):  
Brian L. Bosart ◽  
Wen-Chau Lee ◽  
Roger M. Wakimoto

Abstract The navigation correction method proposed in Testud et al. (referred to as the THL method) systematically identifies uncertainties in the aircraft Inertial Navigation System and errors in the radar-pointing angles by analyzing the radar returns from a flat and stationary earth surface. This paper extends the THL study to address 1) error characteristics on the radar display, 2) sensitivity of the dual-Doppler analyses to navigation errors, 3) fine-tuning the navigation corrections for individual flight legs, and 4) identifying navigation corrections over a flat and nonstationary earth surface (e.g., ocean). The results show that the errors in each of the parameters affect the dual-Doppler wind analyses and the first-order derivatives in different manners. The tilt error is the most difficult parameter to determine and has the greatest impact on the dual-Doppler analysis. The extended THL method can further reduce the drift, ground speed, and tilt errors in all flight legs over land by analyzing the residual velocities of the earth surface using the corrections obtained in the calibration legs. When reliable dual-Doppler winds can be deduced at flight level, the Bosart–Lee–Wakimoto method presented here can identify all eight errors by satisfying three criteria: 1) the flight-level dual-Doppler winds near the aircraft are statistically consistent with the in situ winds, 2) the flight-level dual-Doppler winds are continuous across the flight track, and 3) the surface velocities of the left (right) fore radar have the same magnitude but opposite sign as their counterparts of right (left) aft radar. This procedure is able to correct airborne Doppler radar data over the ocean and has been evaluated using datasets collected during past experiments. Consistent calibration factors are obtained in multiple legs. The dual-Doppler analyses using the corrected data are statistically superior to those using uncorrected data.


2014 ◽  
Vol 31 (5) ◽  
pp. 1098-1103 ◽  
Author(s):  
Dong Xia ◽  
Haobo Tan ◽  
Ling Chen ◽  
Weiqiang Mo ◽  
Zhiyang Yuan ◽  
...  

AbstractObservation of UV radiation is of major importance to human health and to the calculation of photochemical reaction rates. However, the sensitivity of UV radiometers decays because of equipment aging. A correction method is therefore proposed by using a decrement formula that is approximately a quadratic function of time and is obtained by fitting the clear-sky observation data from an aged UVS-AB-T UV radiometer with the data simulated by the Tropospheric Ultraviolet and Visible (TUV) radiative transfer model. The corrected data from the older radiometer are verified by the data from another newer radiometer on selected clear-sky days. The results show a high correlation and a low bias between the radiometers, and the mean of the corrected data from the older radiometer is 94.5% of that from the newer radiometer. After a long time of use, the decrement of the observation data would increase dramatically and errors of the data after correction would still be significant. In Dongguan, China, a recommendation is made that a UV radiometer should not be used for more than 5 years when the decrement rate reaches 50%.


Sign in / Sign up

Export Citation Format

Share Document