scholarly journals A simultaneous feature selection and compositional association test for detecting sparse associations in high-dimensional metagenomic data

Author(s):  
Andrew Hinton ◽  
Peter J. Mucha

Abstract Background: Numerous metagenomic studies aim to discover associations between the microbial composition of an environment (e.g. Gut, Skin, Oral) and a phenotype of interest. Multivariate analysis (MVA) is often performed in these studies without critical a priori knowledge of which taxa are associated with the phenotype being studied. Consequently, non-parametric MVA methods are applied directly to all taxa surveyed independent of noise. This approach typically reduces statistical power in settings where true associations among only a few taxa are obscured by high dimensionality (i.e. sparse association signals). At the same time, the inclusion of all taxa can confound the extraction of key biological insights. Further, low sample size and compositional sample space constraints exist in these data whereby beyond-study generalizability may be reduced if not properly accounted for. More powerful association tests that are interpretable and directly account for compositional constraints while detecting sparse association signals are needed.Methods: We developed Selection-Energy-Permutation (SelEnergyPerm), a non-parametric group association test with embedded feature selection. SelEnergyPerm directly accounts for compositional constraints by selecting parsimonious log ratio signatures from the set of all pairwise log ratios (PLR) between features (OTUs, taxa, etc.). To do this, network methods are used to rank, select, and maximize the between-group association of a candidate log ratio subset. This process is then repeated with an appropriate permutation testing design to simultaneously determine the significance of the selected signatures and association.Results: Simulation results show SelEnergyPerm selects small independent sets of log ratios that capture strong associations in a range of scenarios with small and large dimensional feature spaces. Additionally, our simulation results demonstrate SelEnergyPerm consistently detects/rejects associations in synthetic data with sparse, dense, or no association signals. We demonstrate the novel benefits of our method in four case studies utilizing publicly available 16S rRNA and whole-genome sequencing datasets.Conclusions: Tools to analyze complex high-dimensional metagenomic datasets with sparse association signals using robust PLR have not been sufficiently developed previously. We propose SelEnergyPerm, a novel framework for the discovery of phenotype-associated, metagenomic log ratio signatures for characterizing and understanding alterations in microbial community structure. SelEnergyPerm is implemented in R, available at https://github.com/andrew84830813/selEnergyPermR.

2021 ◽  
Author(s):  
Zequn Sun ◽  
Jing Zhao ◽  
Zhaoqian Liu ◽  
Qin Ma ◽  
Dongjun Chung

AbstractIdentification of disease-associated microbial species is of great biological and clinical interest. However, this investigation still remains challenges due to heterogeneity in microbial composition between individuals, data quality issues, and complex relationships among species. In this paper, we propose a novel data purification algorithm that allows elimination of noise observations, which leads to increased statistical power to detect disease-associated microbial species. We illustrate the proposed algorithm using the metagenomic data generated from colorectal cancer patients.


2021 ◽  
Author(s):  
Andrew Lamont Hinton ◽  
Peter J Mucha

The demand for tight integration of compositional data analysis and machine learning methodologies for predictive modeling in high-dimensional settings has increased dramatically with the increasing availability of metagenomics data. We develop the differential compositional variation machine learning framework (DiCoVarML) with robust multi-level log ratio bio-marker discovery for metagenomic datasets. Our framework makes use of the full set of pairwise log ratios, scoring ratios according to their variation between classes and then selecting out a small subset of log ratios to accurately predict classes. Importantly, DiCoVarML supports a targeted feature selection mode enabling researchers to define the number of predictors used to develop models. We demonstrate the performance of our framework for binary classification tasks using both synthetic and real datasets. Selecting from all pairwise log ratios within the DiCoVarML framework provides greater flexibility that can in demonstrated cases lead to higher accuracy and enhanced biological insight.


2022 ◽  
Vol 4 (1) ◽  
Author(s):  
Kalins Banerjee ◽  
Jun Chen ◽  
Xiang Zhan

ABSTRACT The important role of human microbiome is being increasingly recognized in health and disease conditions. Since microbiome data is typically high dimensional, one popular mode of statistical association analysis for microbiome data is to pool individual microbial features into a group, and then conduct group-based multivariate association analysis. A corresponding challenge within this approach is to achieve adequate power to detect an association signal between a group of microbial features and the outcome of interest across a wide range of scenarios. Recognizing some existing methods’ susceptibility to the adverse effects of noise accumulation, we introduce the Adaptive Microbiome Association Test (AMAT), a novel and powerful tool for multivariate microbiome association analysis, which unifies both blessings of feature selection in high-dimensional inference and robustness of adaptive statistical association testing. AMAT first alleviates the burden of noise accumulation via distance correlation learning, and then conducts a data-adaptive association test under the flexible generalized linear model framework. Extensive simulation studies and real data applications demonstrate that AMAT is highly robust and often more powerful than several existing methods, while preserving the correct type I error rate. A free implementation of AMAT in R computing environment is available at https://github.com/kzb193/AMAT.


Author(s):  
Chen Cao ◽  
Devin Kwok ◽  
Shannon Edie ◽  
Qing Li ◽  
Bowei Ding ◽  
...  

Abstract The power of genotype–phenotype association mapping studies increases greatly when contributions from multiple variants in a focal region are meaningfully aggregated. Currently, there are two popular categories of variant aggregation methods. Transcriptome-wide association studies (TWAS) represent a set of emerging methods that select variants based on their effect on gene expressions, providing pretrained linear combinations of variants for downstream association mapping. In contrast to this, kernel methods such as sequence kernel association test (SKAT) model genotypic and phenotypic variance use various kernel functions that capture genetic similarity between subjects, allowing nonlinear effects to be included. From the perspective of machine learning, these two methods cover two complementary aspects of feature engineering: feature selection/pruning and feature aggregation. Thus far, no thorough comparison has been made between these categories, and no methods exist which incorporate the advantages of TWAS- and kernel-based methods. In this work, we developed a novel method called kernel-based TWAS (kTWAS) that applies TWAS-like feature selection to a SKAT-like kernel association test, combining the strengths of both approaches. Through extensive simulations, we demonstrate that kTWAS has higher power than TWAS and multiple SKAT-based protocols, and we identify novel disease-associated genes in Wellcome Trust Case Control Consortium genotyping array data and MSSNG (Autism) sequence data. The source code for kTWAS and our simulations are available in our GitHub repository (https://github.com/theLongLab/kTWAS).


Author(s):  
Miguel García-Torres ◽  
Francisco Gómez-Vela ◽  
Federico Divina ◽  
Diego P. Pinto-Roa ◽  
José Luis Vázquez Noguera ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document