scholarly journals LOCOM: A logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control

2021 ◽  
Author(s):  
Yingtian Hu ◽  
Glen A. Satten ◽  
Yi-Juan Hu

AbstractMotivationCompositional analysis is based on the premise that a relatively small proportion of taxa are “differentially abundant”, while the ratios of the relative abundances of the remaining taxa remain unchanged. Most existing methods of compositional analysis such as ANCOM or ANCOM-BC use log-transformed data, but log-transformation of data with pervasive zero counts is problematic, and these methods cannot always control the false discovery rate (FDR). Further, high-throughput microbiome data such as 16S amplicon or metagenomic sequencing are subject to experimental biases that are introduced in every step of the experimental workflow. McLaren, Willis and Callahan [1] have recently proposed a model for how these biases affect relative abundance data.MethodsMotivated by [1], we show that the (log) odds ratios in a logistic regression comparing counts in two taxa are invariant to experimental biases. With this motivation, we propose LOCOM, a robust logistic regression approach to compositional analysis, that does not require pseudocounts. We use a Firth bias-corrected estimating function to account for sparse data. Inference is based on permutation to account for overdispersion and small sample sizes. Traits can be either binary or continuous, and adjustment for continuous and/or discrete confounding covariates is supported.ResultsOur simulations indicate that LOCOM always preserved FDR and had much improved sensitivity over existing methods. In contrast, ANCOM often had inflated FDR; ANCOM-BC largely controlled FDR but still had modest inflation occasionally; ALDEx2 generally had low sensitivity. LOCOM and ANCOM were robust to experimental biases in every situation, while ANCOM-BC and ALDEx2 had elevated FDR when biases at causal and non-causal taxa were differentially distributed. The flexibility of our method for a variety of microbiome studies is illustrated by the analysis of data from two microbiome studies.Availability and implementationOur R package LOCOM is available on GitHub at https://github.com/yijuanhu/LOCOM in formats appropriate for Macintosh or Windows.

2021 ◽  
Author(s):  
Yingtian Hu ◽  
Glen Satten ◽  
Yijuan Hu

Abstract Motivation: Compositional analysis is based on the premise that a relatively small proportion of taxa are differentially abundant", while the ratios of the relative abundances of the remaining taxa remain unchanged. Most existing methods of compositional analysis such as ANCOM or ANCOM-BC use log-transformed data, but log-transformation of data with pervasive zero counts is problematic, and these methods cannot always control the false discovery rate (FDR). Further, high-throughput microbiome data such as 16S amplicon or metagenomic sequencing are subject to experimental biases that are introduced in every step of the experimental workflow. McLaren, Willis and Callahan [1] have recently proposed a model for how these biases affect relative abundance data. Methods: Motivated by [1], we show that the (log) odds ratios in a logistic regression comparing counts in two taxa are invariant to experimental biases. With this motivation, we propose LOCOM, a robust logistic regression approach to compositional analysis, that does not require pseudocounts. We use a Firth bias-corrected estimating function to account for sparse data. Inference is based on permutation to account for overdispersion and small sample sizes. Traits can be either binary or continuous, and adjustment for continuous and/or discrete confounding covariates is supported. Results: Our simulations indicate that LOCOM always preserved FDR and had much improved sensitivity over existing methods. In contrast, ANCOM often had inflated FDR; ANCOM-BC largely controlled FDR but still had modest inflation occasionally; ALDEx2 generally had low sensitivity. LOCOM and ANCOM were robust to experimental biases in every situation, while ANCOM-BC and ALDEx2 had elevated FDR when biases at causal and non-causal taxa were differentially distributed. The flexibility of our method for a variety of microbiome studies is illustrated by the analysis of data from two microbiome studies. Availability and implementation: Our R package LOCOM is available on GitHub at https://github.com/yijuanhu/LOCOM in formats appropriate for Macintosh or Windows.


2019 ◽  
Vol 35 (17) ◽  
pp. 3184-3186
Author(s):  
Xiao-Fei Zhang ◽  
Le Ou-Yang ◽  
Shuo Yang ◽  
Xiaohua Hu ◽  
Hong Yan

Abstract Summary To identify biological network rewiring under different conditions, we develop a user-friendly R package, named DiffNetFDR, to implement two methods developed for testing the difference in different Gaussian graphical models. Compared to existing tools, our methods have the following features: (i) they are based on Gaussian graphical models which can capture the changes of conditional dependencies; (ii) they determine the tuning parameters in a data-driven manner; (iii) they take a multiple testing procedure to control the overall false discovery rate; and (iv) our approach defines the differential network based on partial correlation coefficients so that the spurious differential edges caused by the variants of conditional variances can be excluded. We also develop a Shiny application to provide easier analysis and visualization. Simulation studies are conducted to evaluate the performance of our methods. We also apply our methods to two real gene expression datasets. The effectiveness of our methods is validated by the biological significance of the identified differential networks. Availability and implementation R package and Shiny app are available at https://github.com/Zhangxf-ccnu/DiffNetFDR. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
David Gerard

AbstractMany bioinformatics pipelines include tests for equilibrium. Tests for diploids are well studied and widely available, but extending these approaches to autopolyploids is hampered by the presence of double reduction, the co-migration of sister chromatid segments into the same gamete during meiosis. Though a hindrance for equilibrium tests, double reduction rates are quantities of interest in their own right, as they provide insights about the meiotic behavior of autopolyploid organisms. Here, we develop procedures to (i) test for equilibrium while accounting for double reduction, and (ii) estimate double reduction given equilibrium. To do so, we take two approaches: a likelihood approach, and a novel U-statistic minimization approach that we show generalizes the classical equilibrium χ2 test in diploids. For small sample sizes and uncertain genotypes, we further develop a bootstrap procedure based on our U-statistic to test for equilibrium. Finally, we highlight the difficulty in distinguishing between random mating and equilibrium in tetraploids at biallelic loci. Our methods are implemented in the hwep R package on GitHub https://github.com/dcgerard/hwep.


Open Biology ◽  
2021 ◽  
Vol 11 (2) ◽  
pp. 200182
Author(s):  
Siriluck Ponsuksili ◽  
Michael Oster ◽  
Henry Reyer ◽  
Frieder Hadlich ◽  
Nares Trakooljul ◽  
...  

Improved utilization of phytates and mineral phosphorus (P) in monogastric animals contributes significantly to preserving the finite resource of mineral P and mitigating environmental pollution. In order to identify pathways and to prioritize candidate genes related to P utilization (PU), the genomic heritability of 77 and 80 trait-dependent expressed miRNAs and mRNAs in 482 Japanese quail were estimated and eQTL (expression quantitative trait loci) were detected. In total, 104 miR-eQTL (microRNA expression quantitative traits loci) were associated with SNP markers (false discovery rate less than 10%) including 41 eQTL of eight miRNAs. Similarly, 944 mRNA-eQTL were identified at the 5% False discovery rate threshold, with 573 being cis-eQTL of 36 mRNAs. High heritabilities of miRNA and mRNA expression coincide with highly significant eQTL. Integration of phenotypic data with transcriptome and microbiome data of the same animals revealed genetic regulated mRNA and miRNA transcripts (SMAD3, CAV1, ENNPP6, ATP2B4, miR-148a-3p, miR-146b-5p, miR-16-5p, miR-194, miR-215-5p, miR-199-3p, miR-1388a-3p) and microbes ( Candidatus Arthromitus , Enterococcus ) that are associated with PU. The results reveal novel insights into the role of mRNAs and miRNAs in host gut tissue functions, which are involved in PU and other related traits, in terms of the genetic regulation and inheritance of their expression and in association with microbiota components.


mSystems ◽  
2017 ◽  
Vol 2 (6) ◽  
Author(s):  
Lingjing Jiang ◽  
Amnon Amir ◽  
James T. Morton ◽  
Ruth Heller ◽  
Ery Arias-Castro ◽  
...  

ABSTRACT DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures. Differential abundance testing is a critical task in microbiome studies that is complicated by the sparsity of data matrices. Here we adapt for microbiome studies a solution from the field of gene expression analysis to produce a new method, discrete false-discovery rate (DS-FDR), that greatly improves the power to detect differential taxa by exploiting the discreteness of the data. Additionally, DS-FDR is relatively robust to the number of noninformative features, and thus removes the problem of filtering taxonomy tables by an arbitrary abundance threshold. We show by using a combination of simulations and reanalysis of nine real-world microbiome data sets that this new method outperforms existing methods at the differential abundance testing task, producing a false-discovery rate that is up to threefold more accurate, and halves the number of samples required to find a given difference (thus increasing the efficiency of microbiome experiments considerably). We therefore expect DS-FDR to be widely applied in microbiome studies. IMPORTANCE DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures.


2020 ◽  
Vol 18 (2) ◽  
pp. 2-18
Author(s):  
David A. Walker ◽  
Thomas J. Smith

The impact of sparse data conditions was examined among one or more predictor variables in logistic regression and assessed the effectiveness of the Firth (1993) procedure in reducing potential parameter estimation bias. Results indicated sparseness in binary predictors introduces bias that is substantial with small sample sizes, and the Firth procedure can effectively correct this bias.


Neurosurgery ◽  
2017 ◽  
Vol 80 (5) ◽  
pp. 769-777 ◽  
Author(s):  
Lucas R. Philipp ◽  
D. Jay McCracken ◽  
Courtney E. McCracken ◽  
Sameer H. Halani ◽  
Brendan P. Lovasik ◽  
...  

Abstract BACKGROUND: Computerized tomography angiography (CTA) is commonly used to diagnose ruptured cerebral aneurysms with sensitivities reported as high as 97% to 100%. Studies validating CTA accuracy in the setting of subarachnoid hemorrhage (SAH) are scarce and limited by small sample sizes. OBJECTIVE: To evaluate the diagnostic accuracy of CTA in detecting intracranial aneurysms in the setting of SAH. METHODS: A single-center, retrospective cohort of 643 patients was reviewed. A total of 401 patients were identified whose diagnostic workup included both CTA and confirmatory digital subtraction angiography (DSA). Aneurysms missed by CTA but diagnosed by DSA were further stratified by size and location. RESULTS: Three hundred and thirty aneurysms were detected by CTA while DSA detected a total of 431 aneurysms. False positive CTA results were seen for 24 aneurysms. DSA identified 125 aneurysms that were missed by CTA and 83.2% of those were <5 mm in diameter. The sensitivity of CTA was 57.6% for aneurysms smaller than 5 mm in size, and 45% for aneurysms originating from the internal carotid artery. The overall sensitivity of CTA in the setting of SAH was 70.7%. CONCLUSION: The accuracy of CTA in the diagnosis of ruptured intracranial aneurysm may be lower than previously reported. CTA has a low sensitivity for aneurysms less than 5 mm in size, in locations adjacent to bony structures, and for those arising from small caliber parent vessels. It is our recommendation that CTA should be used with caution when used alone in the diagnosis of ruptured intracranial aneurysms.


Author(s):  
Xin Bai ◽  
Jie Ren ◽  
Yingying Fan ◽  
Fengzhu Sun

Abstract Motivation The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. Results To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini–Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. Availabilityand implementation Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Colleen Molloy Farrelly

This study aims to confirm prior findings on the usefulness of topological data analysis (TDA) in the analysis of small samples, particularly focused on cohorts of profoundly gifted students, as well as explore the use of TDA-based regression methods for statistical modeling with small samples. A subset of the Gross sample is analyzed through supervised and unsupervised methods, including 16 and 17 individuals, respectively. Unsupervised learning confirmed prior results suggesting that evenly gifted and unevenly gifted subpopulations fundamentally differ. Supervised learning focused on predicting graduate school attendance and awards earned during undergraduate studies, and TDA-based logistic regression models were compared with more traditional machine learning models for logistic regression. Results suggest 1) that TDA-based methods are capable of handing small samples and seem more robust to the issues that arise in small samples than other machine learning methods and 2) that early childhood achievement scores and several factors related to childhood education interventions (such as early entry and radical acceleration) play a role in predicting key educational and professional achievements in adulthood. Possible new directions from this work include the use of TDA-based tools in the analysis of rare cohorts thus-far relegated to qualitative analytics or case studies, as well as potential exploration of early educational factors and adult-level achievement in larger populations of the profoundly gifted, particularly within the Study of Exceptional Talent and Talent Identification Program cohorts.


Sign in / Sign up

Export Citation Format

Share Document