pldist: ecological dissimilarities for paired and longitudinal microbiome association analysis

Anna M Plantinga; Jun Chen; Robert R Jenq; Michael C Wu

doi:10.1093/bioinformatics/btz120

pldist: ecological dissimilarities for paired and longitudinal microbiome association analysis

Bioinformatics ◽

10.1093/bioinformatics/btz120 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3567-3575 ◽

Cited By ~ 4

Author(s):

Anna M Plantinga ◽

Jun Chen ◽

Robert R Jenq ◽

Michael C Wu

Keyword(s):

Statistical Power ◽

Human Microbiome ◽

R Package ◽

Supplementary Information ◽

Microbiome Composition ◽

Type 1 Error ◽

Wide Range ◽

Subject Variability ◽

Ordination Analysis

Abstract Motivation The human microbiome is notoriously variable across individuals, with a wide range of ‘healthy’ microbiomes. Paired and longitudinal studies of the microbiome have become increasingly popular as a way to reduce unmeasured confounding and to increase statistical power by reducing large inter-subject variability. Statistical methods for analyzing such datasets are scarce. Results We introduce a paired UniFrac dissimilarity that summarizes within-individual (or within-pair) shifts in microbiome composition and then compares these compositional shifts across individuals (or pairs). This dissimilarity depends on a novel transformation of relative abundances, which we then extend to more than two time points and incorporate into several phylogenetic and non-phylogenetic dissimilarities. The data transformation and resulting dissimilarities may be used in a wide variety of downstream analyses, including ordination analysis and distance-based hypothesis testing. Simulations demonstrate that tests based on these dissimilarities retain appropriate type 1 error and high power. We apply the method in two real datasets. Availability and implementation The R package pldist is available on GitHub at https://github.com/aplantin/pldist. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BloodGen3Module: Blood transcriptional module repertoire analysis and visualization using R

Bioinformatics ◽

10.1093/bioinformatics/btab121 ◽

2021 ◽

Author(s):

Darawan Rinchai ◽

Jessica Roelands ◽

Mohammed Toufiq ◽

Wouter Hendrickx ◽

Matthew C Altman ◽

...

Keyword(s):

Transcript Abundance ◽

R Package ◽

Supplementary Information ◽

Illustrative Case ◽

Bioinformatic Tools ◽

Transcriptional Module ◽

Wide Range ◽

Downstream Analysis ◽

Computing Module ◽

Parallel Workflow

Abstract Motivation We previously described the construction and characterization of generic and reusable blood transcriptional module repertoires. More recently we released a third iteration (“BloodGen3” module repertoire) that comprises 382 functionally annotated gene sets (modules) and encompasses 14,168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here a R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MOVICS: an R package for multi-omics integration and visualization in cancer subtyping

Bioinformatics ◽

10.1093/bioinformatics/btaa1018 ◽

2020 ◽

Author(s):

Xiaofan Lu ◽

Jialin Meng ◽

Yujie Zhou ◽

Liyun Jiang ◽

Fangrong Yan

Keyword(s):

Clustering Algorithms ◽

R Package ◽

Supplementary Information ◽

Multiple Perspectives ◽

Model Free ◽

Omics Integration ◽

Wide Range ◽

Breast Cancer Cohort ◽

The One ◽

Minimal Effort

Abstract Summary Stratification of cancer patients into distinct molecular subgroups based on multi-omics data is an important issue in the context of precision medicine. Here, we present MOVICS, an R package for multi-omics integration and visualization in cancer subtyping. MOVICS provides a unified interface for 10 state-of-the-art multi-omics integrative clustering algorithms, and incorporates the most commonly used downstream analyses in cancer subtyping researches, including characterization and comparison of identified subtypes from multiple perspectives, and verification of subtypes in external cohort using two model-free approaches for multiclass prediction. MOVICS also creates feature rich customizable visualizations with minimal effort. By analysing two published breast cancer cohort, we signifies that MOVICS can serve a wide range of users and assist cancer therapy by moving away from the ‘one-size-fits-all’ approach to patient care. Availability and implementation MOVICS package and online tutorial are freely available at https://github.com/xlucpu/MOVICS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Peritonitis in Continuous Ambulatory Peritoneal Dialysis (CAPD) Patients: A Randomized Clinical Trial of Cotrimoxazole Prophylaxis

Peritoneal Dialysis International ◽

10.1177/089686088800800203 ◽

1988 ◽

Vol 8 (2) ◽

pp. 125-128 ◽

Cited By ~ 21

Author(s):

D. N. Churchill ◽

D. W. Taylor ◽

S. I. Vas ◽

J. Singer ◽

M. L. Beecroft ◽

...

Keyword(s):

Peritoneal Dialysis ◽

Continuous Ambulatory Peritoneal Dialysis ◽

Statistical Power ◽

Controlled Trial ◽

Double Blind ◽

Randomized Controlled ◽

Type 1 Error ◽

Significant Difference ◽

Monthly Changes

A double-blind randomized controlled trial compared the effectiveness of prophylactic oral trimethoprim/sulfamethoxazole (cotrimoxazole) to a placebo in preventing peritonitis in continuous ambulatory peritoneal dialysis (CAPD) patients. A daily trimethoprim/sulfamethoxazole dose of 160/800 mg gives a steady state dialysate concentration of 1.07/4.35 mg/L in the final dwell of each dosing interval. Identification of a 40% reduction in peritonitis probability with 80% statistical power and a type 1 error probability of 0.05 required 52 subjects per group. With stratification by previous peritonitis, 56 were allocated to cotrimoxazole and 49 to placebo. For cotrimoxazole there were five deaths and seven catheter losses. For placebo there were three deaths and nine catheter losses. There were 20 withdrawals from cotrimoxazole and 9 from the placebo group. With respect to time to peritonitis, there was no statistically significant difference between cotrimoxazole and placebo groups (p = 0.19). At 6 months, 64.1% of cotrimoxazole and 62.5% of placebo were peritonitis free; at 12 months 41.9% of cotrimoxazole and 35% of placebo were peritonitis free. There was no effect (p > 0.05) of age, sex, catheter care technique, spike or luer, or dialysate additives. Previous peritonitis increased the risk of peritonitis by 2.06 (95% CI, 3.61–1.18) while frequent (six weekly) extension tubing changes increased the risk of by 1.79, (95% CI, 3.04–1.02) when compared to six monthly changes. Cotrimoxazole appears ineffective in prevention of CAPD peritonitis.

Download Full-text

Statistical Power in Psychiatric Research

Australian & New Zealand Journal of Psychiatry ◽

10.3109/00048678609161331 ◽

1986 ◽

Vol 20 (2) ◽

pp. 189-200 ◽

Cited By ~ 19

Author(s):

Kevin D. Bird ◽

Wayne Hall

Keyword(s):

Sample Size ◽

Error Rate ◽

Statistical Power ◽

Error Rates ◽

Psychiatric Research ◽

Type 1 Error ◽

Type 2 Error ◽

Power Analyses

Statistical power is neglected in much psychiatric research, with the consequence that many studies do not provide a reasonable chance of detecting differences between groups if they exist in the population. This paper attempts to improve current practice by providing an introduction to the essential quantities required for performing a power analysis (sample size, effect size, type 1 and type 2 error rates). We provide simplified tables for estimating the sample size required to detect a specified size of effect with a type 1 error rate of α and a type 2 error rate of β, and for estimating the power provided by a given sample size for detecting a specified size of effect with a type 1 error rate of α. We show how to modify these tables to perform power analyses for multiple comparisons in univariate and some multivariate designs. Power analyses for each of these types of design are illustrated by examples.

Download Full-text

Justify Your Alpha: A Primer on Two Practical Approaches

10.31234/osf.io/ts4r6 ◽

2021 ◽

Author(s):

Maximilian Maier ◽

Daniel Lakens

Keyword(s):

Statistical Power ◽

Alternative Hypothesis ◽

R Package ◽

Error Rates ◽

Shiny App ◽

Alpha Level ◽

Type 2 Error ◽

Lindley’S Paradox

The default use of an alpha level of 0.05 is suboptimal for two reasons. First, decisions based on data can be made more efficiently by choosing an alpha level that minimizes the combined Type 1 and Type 2 error rate. Second, it is possible that in studies with very high statistical power p-values lower than the alpha level can be more likely when the null hypothesis is true, than when the alternative hypothesis is true (i.e., Lindley's paradox). This manuscript explains two approaches that can be used to justify a better choice of an alpha level than relying on the default threshold of 0.05. The first approach is based on the idea to either minimize or balance Type 1 and Type 2 error rates. The second approach lowers the alpha level as a function of the sample size to prevent Lindley's paradox. An R package and Shiny app are provided to perform the required calculations. Both approaches have their limitations (e.g., the challenge of specifying relative costs and priors), but can offer an improvement to current practices, especially when sample sizes are large. The use of alpha levels that have a better justification should improve statistical inferences and can increase the efficiency and informativeness of scientific research.

Download Full-text

EraSOR: Erase Sample Overlap in polygenic score analyses

10.1101/2021.12.10.472164 ◽

2021 ◽

Author(s):

Shing Wan Choi ◽

Timothy Shin Heng Mak ◽

Clive J. Hoggart ◽

Paul F. O'Reilly

Keyword(s):

Association Studies ◽

Polygenic Risk Score ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Type 1 Error ◽

Wide Range ◽

Close Relatedness ◽

Target Data

Background: Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work. Results: Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification. Conclusion: The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

Download Full-text

Too True to be Bad: When Sets of Studies with Significant and Non-Significant Findings Are Probably True

10.31234/osf.io/nnkg9 ◽

2017 ◽

Cited By ~ 1

Author(s):

Daniel Lakens ◽

Alexander Etz

Keyword(s):

Publication Bias ◽

Error Rate ◽

Statistical Power ◽

Scientific Literature ◽

Alternative Hypothesis ◽

Meta Analysis ◽

Likelihood Ratios ◽

Type 1 Error

Psychology journals rarely publish non-significant results. At the same time, it is often very unlikely (or ‘too good to be true’) that a set of studies yields exclusively significant results. Here, we use likelihood ratios to explain when sets of studies that contain a mix of significant and non-significant results are likely to be true, or ‘too true to be bad’. As we show, mixed results are not only likely to be observed in lines of research, but when observed, mixed results often provide evidence for the alternative hypothesis, given reasonable levels of statistical power and an adequately controlled low Type 1 error rate. Researchers should feel comfortable submitting such lines of research with an internal meta-analysis for publication. A better understanding of probabilities, accompanied by more realistic expectations of what real lines of studies look like, might be an important step in mitigating publication bias in the scientific literature.

Download Full-text

Too True to be Bad

Social Psychological and Personality Science ◽

10.1177/1948550617693058 ◽

2017 ◽

Vol 8 (8) ◽

pp. 875-881 ◽

Cited By ~ 52

Author(s):

Daniël Lakens ◽

Alexander J. Etz

Keyword(s):

Publication Bias ◽

Error Rate ◽

Statistical Power ◽

Scientific Literature ◽

Alternative Hypothesis ◽

Meta Analysis ◽

Likelihood Ratios ◽

Type 1 Error

Psychology journals rarely publish nonsignificant results. At the same time, it is often very unlikely (or “too good to be true”) that a set of studies yields exclusively significant results. Here, we use likelihood ratios to explain when sets of studies that contain a mix of significant and nonsignificant results are likely to be true or “too true to be bad.” As we show, mixed results are not only likely to be observed in lines of research but also, when observed, often provide evidence for the alternative hypothesis, given reasonable levels of statistical power and an adequately controlled low Type 1 error rate. Researchers should feel comfortable submitting such lines of research with an internal meta-analysis for publication. A better understanding of probabilities, accompanied by more realistic expectations of what real sets of studies look like, might be an important step in mitigating publication bias in the scientific literature.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa105 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3276-3278 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Simulation Method ◽

R Package ◽

Supplementary Information ◽

Expression Data ◽

Sequencing Data ◽

Wide Range ◽

Single Cell Rna Sequencing

Abstract Summary SPsimSeq is a semi-parametric simulation method to generate bulk and single-cell RNA-sequencing data. It is designed to simulate gene expression data with maximal retention of the characteristics of real data. It is reasonably flexible to accommodate a wide range of experimental scenarios, including different sample sizes, biological signals (differential expression) and confounding batch effects. Availability and implementation The R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data

Bioinformatics ◽

10.1093/bioinformatics/btz973 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2345-2351 ◽

Cited By ~ 2

Author(s):

Xinyan Zhang ◽

Nengjun Yi

Keyword(s):

Count Data ◽

Mixed Models ◽

Negative Binomial ◽

Linear Mixed Models ◽

Human Microbiome ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Metagenomics Data

Abstract Motivation Longitudinal metagenomics data, including both 16S rRNA and whole-metagenome shotgun sequencing data, enhanced our abilities to understand the dynamic associations between the human microbiome and various diseases. However, analytic tools have not been fully developed to simultaneously address the main challenges of longitudinal metagenomics data, i.e. high-dimensionality, dependence among samples and zero-inflation of observed counts. Results We propose a fast zero-inflated negative binomial mixed modeling (FZINBMM) approach to analyze high-dimensional longitudinal metagenomic count data. The FZINBMM approach is based on zero-inflated negative binomial mixed models (ZINBMMs) for modeling longitudinal metagenomic count data and a fast EM-IWLS algorithm for fitting ZINBMMs. FZINBMM takes advantage of a commonly used procedure for fitting linear mixed models, which allows us to include various types of fixed and random effects and within-subject correlation structures and quickly analyze many taxa. We found that FZINBMM remarkably outperformed in computational efficiency and was statistically comparable with two R packages, GLMMadaptive and glmmTMB, that use numerical integration to fit ZINBMMs. Extensive simulations and real data applications showed that FZINBMM outperformed other previous methods, including linear mixed models, negative binomial mixed models and zero-inflated Gaussian mixed models. Availability and implementation FZINBMM has been implemented in the R package NBZIMM, available in the public GitHub repository http://github.com//nyiuab//NBZIMM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text