scholarly journals Shrinkage improves estimation of microbial associations under different normalization methods

2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Michelle Badri ◽  
Zachary D Kurtz ◽  
Richard Bonneau ◽  
Christian L Müller

Abstract Estimation of statistical associations in microbial genomic survey count data is fundamental to microbiome research. Experimental limitations, including count compositionality, low sample sizes and technical variability, obstruct standard application of association measures and require data normalization prior to statistical estimation. Here, we investigate the interplay between data normalization, microbial association estimation and available sample size by leveraging the large-scale American Gut Project (AGP) survey data. We analyze the statistical properties of two prominent linear association estimators, correlation and proportionality, under different sample scenarios and data normalization schemes, including RNA-seq analysis workflows and log-ratio transformations. We show that shrinkage estimation, a standard statistical regularization technique, can universally improve the quality of taxon–taxon association estimates for microbiome data. We find that large-scale association patterns in the AGP data can be grouped into five normalization-dependent classes. Using microbial association network construction and clustering as downstream data analysis examples, we show that variance-stabilizing and log-ratio approaches enable the most taxonomically and structurally coherent estimates. Taken together, the findings from our reproducible analysis workflow have important implications for microbiome studies in multiple stages of analysis, particularly when only small sample sizes are available.

2018 ◽  
Author(s):  
Michelle Badri ◽  
Zachary D. Kurtz ◽  
Richard Bonneau ◽  
Christian L. Müller

ABSTRACTConsistent estimation of associations in microbial genomic survey count data is fundamental to microbiome research. Technical limitations, including compositionality, low sample sizes, and technical variability, obstruct standard application of association measures and require data normalization prior to estimating associations. Here, we investigate the interplay between data normalization and microbial association estimation by a comprehensive analysis of statistical consistency. Leveraging the large sample size of the American Gut Project (AGP), we assess the consistency of the two prominent linear association estimators, correlation and proportionality, under different sample scenarios and data normalization schemes, including RNA-seq analysis work flows and log-ratio transformations. We show that shrinkage estimation, a standard technique in high-dimensional statistics, can universally improve the quality of association estimates for microbiome data. We find that large-scale association patterns in the AGP data can be grouped into five normalization-dependent classes. Using microbial association network construction and clustering as examples of exploratory data analysis, we show that variance-stabilizing and log-ratio approaches provide for the most consistent estimation of taxonomic and structural coherence. Taken together, the findings from our reproducible analysis workflow have important implications for microbiome studies in multiple stages of analysis, particularly when only small sample sizes are available.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Stephen P. Fortin ◽  
Stephen S Johnston ◽  
Martijn J Schuemie

Abstract Background Cardinality matching (CM), a novel matching technique, finds the largest matched sample meeting prespecified balance criteria thereby overcoming limitations of propensity score matching (PSM) associated with limited covariate overlap, which are especially pronounced in studies with small sample sizes. The current study proposes a framework for large-scale CM (LS-CM); and compares large-scale PSM (LS-PSM) and LS-CM in terms of post-match sample size, covariate balance and residual confounding at progressively smaller sample sizes. Methods Evaluation of LS-PSM and LS-CM within a comparative cohort study of new users of angiotensin-converting enzyme inhibitor (ACEI) and thiazide or thiazide-like diuretic monotherapy identified from a U.S. insurance claims database. Candidate covariates included patient demographics, and all observed prior conditions, drug exposures and procedures. Propensity scores were calculated using LASSO regression, and candidate covariates with non-zero beta coefficients in the propensity model were defined as matching covariates for use in LS-CM. One-to-one matching was performed using progressively tighter parameter settings. Covariate balance was assessed using standardized mean differences. Hazard ratios for negative control outcomes perceived as unassociated with treatment (i.e., true hazard ratio of 1) were estimated using unconditional Cox models. Residual confounding was assessed using the expected systematic error of the empirical null distribution of negative control effect estimates compared to the ground truth. To simulate diverse research conditions, analyses were repeated within 10 %, 1 and 0.5 % subsample groups with increasingly limited covariate overlap. Results A total of 172,117 patients (ACEI: 129,078; thiazide: 43,039) met the study criteria. As compared to LS-PSM, LS-CM was associated with increased sample retention. Although LS-PSM achieved balance across all matching covariates within the full study population, substantial matching covariate imbalance was observed within the 1 and 0.5 % subsample groups. Meanwhile, LS-CM achieved matching covariate balance across all analyses. LS-PSM was associated with better candidate covariate balance within the full study population. Otherwise, both matching techniques achieved comparable candidate covariate balance and expected systematic error. Conclusions LS-CM found the largest matched sample meeting prespecified balance criteria while achieving comparable candidate covariate balance and residual confounding. We recommend LS-CM as an alternative to LS-PSM in studies with small sample sizes or limited covariate overlap.


2021 ◽  
pp. 107385842110170
Author(s):  
Brian P. Johnson ◽  
Eran Dayan ◽  
Nitzan Censor ◽  
Leonardo G. Cohen

Behavioral research in cognitive and human systems neuroscience has been largely carried out in-person in laboratory settings. Underpowering and lack of reproducibility due to small sample sizes have weakened conclusions of these investigations. In other disciplines, such as neuroeconomics and social sciences, crowdsourcing has been extensively utilized as a data collection tool, and a means to increase sample sizes. Recent methodological advances allow scientists, for the first time, to test online more complex cognitive, perceptual, and motor tasks. Here we review the nascent literature on the use of online crowdsourcing in cognitive and human systems neuroscience. These investigations take advantage of the ability to reliably track the activity of a participant’s computer keyboard, mouse, and eye gaze in the context of large-scale studies online that involve diverse research participant pools. Crowdsourcing allows for testing the generalizability of behavioral hypotheses in real-life environments that are less accessible to lab-designed investigations. Crowdsourcing is further useful when in-laboratory studies are limited, for example during the current COVID-19 pandemic. We also discuss current limitations of crowdsourcing research, and suggest pathways to address them. We conclude that online crowdsourcing is likely to widen the scope and strengthen conclusions of cognitive and human systems neuroscience investigations.


2018 ◽  
Author(s):  
Christopher Chabris ◽  
Patrick Ryan Heck ◽  
Jaclyn Mandart ◽  
Daniel Jacob Benjamin ◽  
Daniel J. Simons

Williams and Bargh (2008) reported that holding a hot cup of coffee caused participants to judge a person’s personality as warmer, and that holding a therapeutic heat pad caused participants to choose rewards for other people rather than for themselves. These experiments featured large effects (r = .28 and .31), small sample sizes (41 and 53 participants), and barely statistically significant results. We attempted to replicate both experiments in field settings with more than triple the sample sizes (128 and 177) and double-blind procedures, but found near-zero effects (r = –.03 and .02). In both cases, Bayesian analyses suggest there is substantially more evidence for the null hypothesis of no effect than for the original physical warmth priming hypothesis.


2021 ◽  
Vol 11 (6) ◽  
pp. 497
Author(s):  
Yoonsuk Jung ◽  
Eui Im ◽  
Jinhee Lee ◽  
Hyeah Lee ◽  
Changmo Moon

Previous studies have evaluated the effects of antithrombotic agents on the performance of fecal immunochemical tests (FITs) for the detection of colorectal cancer (CRC), but the results were inconsistent and based on small sample sizes. We studied this topic using a large-scale population-based database. Using the Korean National Cancer Screening Program Database, we compared the performance of FITs for CRC detection between users and non-users of antiplatelet agents and warfarin. Non-users were matched according to age and sex. Among 5,426,469 eligible participants, 768,733 used antiplatelet agents (mono/dual/triple therapy, n = 701,683/63,211/3839), and 19,569 used warfarin, while 4,638,167 were non-users. Among antiplatelet agents, aspirin, clopidogrel, and cilostazol ranked first, second, and third, respectively, in terms of prescription rates. Users of antiplatelet agents (3.62% vs. 4.45%; relative risk (RR): 0.83; 95% confidence interval (CI): 0.78–0.88), aspirin (3.66% vs. 4.13%; RR: 0.90; 95% CI: 0.83–0.97), and clopidogrel (3.48% vs. 4.88%; RR: 0.72; 95% CI: 0.61–0.86) had lower positive predictive values (PPVs) for CRC detection than non-users. However, there were no significant differences in PPV between cilostazol vs. non-users and warfarin users vs. non-users. For PPV, the RR (users vs. non-users) for antiplatelet monotherapy was 0.86, while the RRs for dual and triple antiplatelet therapies (excluding cilostazol) were 0.67 and 0.22, respectively. For all antithrombotic agents, the sensitivity for CRC detection was not different between users and non-users. Use of antiplatelet agents, except cilostazol, may increase the false positives without improving the sensitivity of FITs for CRC detection.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Florent Le Borgne ◽  
Arthur Chatton ◽  
Maxime Léger ◽  
Rémi Lenain ◽  
Yohann Foucher

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.


2013 ◽  
Vol 113 (1) ◽  
pp. 221-224 ◽  
Author(s):  
David R. Johnson ◽  
Lauren K. Bachan

In a recent article, Regan, Lakhanpal, and Anguiano (2012) highlighted the lack of evidence for different relationship outcomes between arranged and love-based marriages. Yet the sample size ( n = 58) used in the study is insufficient for making such inferences. This reply discusses and demonstrates how small sample sizes reduce the utility of this research.


Sign in / Sign up

Export Citation Format

Share Document