Phylogeny-Guided Microbiome OTU-Specific Association Test (POST)

Abstract Background: The relationship between host conditions and microbiome profiles, typically characterized by operational taxonomic units (OTUs), contains important information about the microbial role in human health. Traditional association testing frameworks are challenged by the high-dimensionality and sparsity of typical microbiome profiles. Incorporating phylogenetic information is often used to address these challenges with the assumption that evolutionarily similar taxa tend to behave similarly. However, this assumption may not always be valid due to the complex effect of microbes, and phylogenetic information should be incorporated in a data-supervised fashion. Results: In this work, we propose a local collapsing test called Phylogeny-guided microbiome OTU-Specific association Test (POST). In POST, whether or not to borrow information and how much information to borrow from the neighboring OTUs in the phylogenic tree are supervised by phylogenetic distance and the outcome-OTU association. POST is constructed under the kernel machine framework to accommodate complex OTU effects and extends kernel machine microbiome tests from community-level to OTU-level. Using simulation studies, we showed that when the phylogenetic tree is informative, POST has better performance than existing OTU-level association tests. When the phylogenetic tree is not informative, POST achieves similar performance as existing methods. Finally, we show that POST can identify more outcome-associated OTUs that are of biological relevance in real data applications on bacterial vaginosis and on preterm birth. Conclusions: Using POST, we show that the power of detecting associated microbiome features can be enhanced by adaptively leveraging the phylogenetic information when testing for a target OTU. We developed an user friendly R package POSTm which is now available at CRAN (https://CRAN.R-project.org/package=POSTm) for public access.

Download Full-text

PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes

10.1101/2021.04.23.440821 ◽

2021 ◽

Author(s):

Charlie M Carpenter ◽

Weiming Zhang ◽

Lucas Gillenwater ◽

Cameron Severn ◽

Tusharkanti Ghosh ◽

...

Keyword(s):

Score Test ◽

Real Data ◽

Association Test ◽

Simulation Studies ◽

Data Types ◽

Metabolomics Data ◽

Kernel Machine ◽

Pathway Data ◽

Regression Framework ◽

Testing Power

High-throughput data such as metabolomics, genomics, transcriptomics, and proteomics have become familiar data types within the "-omics" family. For this work, we focus on subsets that interact with one another and represent these "pathways" as graphs. Observed pathways often have disjoint components, i.e. nodes or sets of nodes (metabolites, etc.) not connected to any other within the pathway which notably lessens testing power. In this paper we propose the Pathway Integrated Regression-based Kernel Association Test (PaIRKAT), a new kernel machine regression method for incorporating known pathway information into the semi-parametric kernel regression framework. This paper also contributes an application of a graph kernel regularization method for overcoming disconnected pathways. By incorporating a regularized or "smoothed" graph into a score test, PaIRKAT is capable of providing more powerful tests for associations between biological pathways and phenotypes of interest and will be helpful in identifying novel pathways for targeted clinical research. We evaluate this method through several simulation studies and an application to real metabolomics data from the COPDGene study. Our simulation studies illustrate the robustness of this method to incorrect and incomplete pathway knowledge, and the real data analysis shows meaningful improvements of testing power in pathways. PaIRKAT was developed for application to metabolomic pathway data, but the techniques are easily generalizable to other data sources with a graph-like structure.

Download Full-text

Phylogenetic tree-based microbiome association test

Bioinformatics ◽

10.1093/bioinformatics/btz686 ◽

2019 ◽

Author(s):

Kang Jin Kim ◽

Jaehyun Park ◽

Sang-Chul Park ◽

Sungho Won

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Statistical Power ◽

False Negative ◽

Amplicon Sequencing ◽

R Package ◽

Chronic Fatigue ◽

Association Test ◽

Supplementary Information ◽

Association Analyses

Abstract Motivation Ecological patterns of the human microbiota exhibit high inter-subject variation, with few operational taxonomic units (OTUs) shared across individuals. To overcome these issues, non-parametric approaches, such as the Mann–Whitney U-test and Wilcoxon rank-sum test, have often been used to identify OTUs associated with host diseases. However, these approaches only use the ranks of observed relative abundances, leading to information loss, and are associated with high false-negative rates. In this study, we propose a phylogenetic tree-based microbiome association test (TMAT) to analyze the associations between microbiome OTU abundances and disease phenotypes. Phylogenetic trees illustrate patterns of similarity among different OTUs, and TMAT provides an efficient method for utilizing such information for association analyses. The proposed TMAT provides test statistics for each node, which are combined to identify mutations associated with host diseases. Results Power estimates of TMAT were compared with existing methods using extensive simulations based on real absolute abundances. Simulation studies showed that TMAT preserves the nominal type-1 error rate, and estimates of its statistical power generally outperformed existing methods in the considered scenarios. Furthermore, TMAT can be used to detect phylogenetic mutations associated with host diseases, providing more in-depth insight into bacterial pathology. Availability and implementation The 16S rRNA amplicon sequencing metagenomics datasets for colorectal carcinoma and myalgic encephalomyelitis/chronic fatigue syndrome are available from the European Nucleotide Archive (ENA) database under project accession number PRJEB6070 and PRJEB13092, respectively. TMAT was implemented in the R package. Detailed information is available at http://healthstat.snu.ac.kr/software/tmat. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detection of differentially methylated CpG sites between tumor samples with uneven tumor purities

Bioinformatics ◽

10.1093/bioinformatics/btz885 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2017-2024

Author(s):

Weiwei Zhang ◽

Ziyi Li ◽

Nana Wei ◽

Hua-Jun Wu ◽

Xiaoqi Zheng

Keyword(s):

Real Data ◽

R Package ◽

Differential Methylation ◽

Least Square ◽

Epigenetic Mechanism ◽

Supplementary Information ◽

Cpg Sites ◽

Tumor Purity ◽

Different Sources ◽

Normal Controls

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

mixIndependR: a R package for statistical independence testing of loci in database of multi-locus genotypes

BMC Bioinformatics ◽

10.1186/s12859-020-03945-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Bing Song ◽

August E. Woerner ◽

John Planz

Keyword(s):

Population Genetics ◽

Linkage Disequilibrium ◽

Genetic Markers ◽

Software Package ◽

Tandem Repeats ◽

Population Data ◽

Real Data ◽

R Package ◽

Nucleotide Polymorphisms ◽

Mutual Independence

Abstract Background Multi-locus genotype data are widely used in population genetics and disease studies. In evaluating the utility of multi-locus data, the independence of markers is commonly considered in many genomic assessments. Generally, pairwise non-random associations are tested by linkage disequilibrium; however, the dependence of one panel might be triplet, quartet, or other. Therefore, a compatible and user-friendly software is necessary for testing and assessing the global linkage disequilibrium among mixed genetic data. Results This study describes a software package for testing the mutual independence of mixed genetic datasets. Mutual independence is defined as no non-random associations among all subsets of the tested panel. The new R package “mixIndependR” calculates basic genetic parameters like allele frequency, genotype frequency, heterozygosity, Hardy–Weinberg equilibrium, and linkage disequilibrium (LD) by mutual independence from population data, regardless of the type of markers, such as simple nucleotide polymorphisms, short tandem repeats, insertions and deletions, and any other genetic markers. A novel method of assessing the dependence of mixed genetic panels is developed in this study and functionally analyzed in the software package. By comparing the observed distribution of two common summary statistics (the number of heterozygous loci [K] and the number of share alleles [X]) with their expected distributions under the assumption of mutual independence, the overall independence is tested. Conclusion The package “mixIndependR” is compatible to all categories of genetic markers and detects the overall non-random associations. Compared to pairwise disequilibrium, the approach described herein tends to have higher power, especially when number of markers is large. With this package, more multi-functional or stronger genetic panels can be developed, like mixed panels with different kinds of markers. In population genetics, the package “mixIndependR” makes it possible to discover more about admixture of populations, natural selection, genetic drift, and population demographics, as a more powerful method of detecting LD. Moreover, this new approach can optimize variants selection in disease studies and contribute to panel combination for treatments in multimorbidity. Application of this approach in real data is expected in the future, and this might bring a leap in the field of genetic technology. Availability The R package mixIndependR, is available on the Comprehensive R Archive Network (CRAN) at: https://cran.r-project.org/web/packages/mixIndependR/index.html.

Download Full-text

MonoPhy: A simple R package to find and visualize monophyly issues

10.7287/peerj.preprints.1600v1 ◽

2015 ◽

Author(s):

Orlando Schwery ◽

Brian C O'Meara

Keyword(s):

Phylogenetic Tree ◽

R Package ◽

Input File ◽

Higher Taxa ◽

Additional Input

Background. The monophyly of taxa is an important attribute of a phylogenetic tree, as a lack of it may hint at shortcomings of either the tree or the current taxonomy and can misguide subsequent analyses. While monophyly is conceptually simple, it is manually tedious and time consuming to assess on modern phylogenies of hundreds to thousands of species. Results. The R package MonoPhy allows assessment and exploration of monophyly of taxa in a phylogeny. It can assess the monophyly of genera using the phylogeny only, and with an additional input file, any other desired higher taxa or unranked groups can be checked as well. Conclusion. Summary tables, easily subsettable results and several visualization options allow quick and convenient exploration of monophyly issues, thus making MonoPhy a valuable tool for any researcher working with phylogenies.

Download Full-text

Categorical Functional Data Analysis. The cfda R Package

Mathematics ◽

10.3390/math9233074 ◽

2021 ◽

Vol 9 (23) ◽

pp. 3074

Author(s):

Cristian Preda ◽

Quentin Grimonprez ◽

Vincent Vandewalle

Keyword(s):

Functional Data ◽

Multiple Correspondence Analysis ◽

Real Data ◽

Jump Process ◽

R Package ◽

Finite Basis ◽

Data Set ◽

Stochastic Jump ◽

Finite Set ◽

Infinite Set

Categorical functional data represented by paths of a stochastic jump process with continuous time and a finite set of states are considered. As an extension of the multiple correspondence analysis to an infinite set of variables, optimal encodings of states over time are approximated using an arbitrary finite basis of functions. This allows dimension reduction, optimal representation, and visualisation of data in lower dimensional spaces. The methodology is implemented in the cfda R package and is illustrated using a real data set in the clustering framework.

Download Full-text

The asymptotic distribution of the Net Benefit estimator in presence of right-censoring

Statistical Methods in Medical Research ◽

10.1177/09622802211037067 ◽

2021 ◽

pp. 096228022110370

Author(s):

Brice Ozenne ◽

Esben Budtz-Jørgensen ◽

Julien Péron

Keyword(s):

Asymptotic Distribution ◽

Nuisance Parameter ◽

Real Data ◽

R Package ◽

Right Censoring ◽

Drop Out ◽

Net Benefit ◽

Asymptotic Results ◽

Finite Samples ◽

Benefit Risk Assessment

The benefit–risk balance is a critical information when evaluating a new treatment. The Net Benefit has been proposed as a metric for the benefit–risk assessment, and applied in oncology to simultaneously consider gains in survival and possible side effects of chemotherapies. With complete data, one can construct a U-statistic estimator for the Net Benefit and obtain its asymptotic distribution using standard results of the U-statistic theory. However, real data is often subject to right-censoring, e.g. patient drop-out in clinical trials. It is then possible to estimate the Net Benefit using a modified U-statistic, which involves the survival time. The latter can be seen as a nuisance parameter affecting the asymptotic distribution of the Net Benefit estimator. We present here how existing asymptotic results on U-statistics can be applied to estimate the distribution of the net benefit estimator, and assess their validity in finite samples. The methodology generalizes to other statistics obtained using generalized pairwise comparisons, such as the win ratio. It is implemented in the R package BuyseTest (version 2.3.0 and later) available on Comprehensive R Archive Network.

Download Full-text

powmic: an R package for power assessment in microbiome case–control studies

Bioinformatics ◽

10.1093/bioinformatics/btaa197 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3563-3565

Author(s):

Li Chen

Keyword(s):

Power Analysis ◽

Real Data ◽

Analytical Form ◽

R Package ◽

Case Control ◽

Supplementary Information ◽

Metagenomic Sequencing ◽

Case Control Studies ◽

Simulation Based ◽

Over Dispersion

Abstract Summary Power analysis is essential to decide the sample size of metagenomic sequencing experiments in a case–control study for identifying differentially abundant (DA) microbes. However, the complexity of microbial data characteristics, such as excessive zeros, over-dispersion, compositionality, intrinsically microbial correlations and variable sequencing depths, makes the power analysis particularly challenging because the analytical form is usually unavailable. Here, we develop a simulation-based power assessment strategy and R package powmic, which considers the complexity of microbial data characteristics. A real data example demonstrates the usage of powmic. Availability and implementation powmic R package and online tutorial are available at https://github.com/lichen-lab/powmic. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MiRKAT: kernel machine regression-based global association tests for the microbiome

Bioinformatics ◽

10.1093/bioinformatics/btaa951 ◽

2020 ◽

Author(s):

Nehemiah Wilson ◽

Ni Zhao ◽

Xiang Zhan ◽

Hyunwook Koh ◽

Weijia Fu ◽

...

Keyword(s):

R Package ◽

Effect Sizes ◽

Supplementary Information ◽

Time To Event ◽

Kernel Machine ◽

Association Testing ◽

Higher Power ◽

Kernel Machine Regression ◽

Two Measures ◽

Rv Coefficient

Abstract Summary Distance-based tests of microbiome beta diversity are an integral part of many microbiome analyses. MiRKAT enables distance-based association testing with a wide variety of outcome types, including continuous, binary, censored time-to-event, multivariate, correlated and high-dimensional outcomes. Omnibus tests allow simultaneous consideration of multiple distance and dissimilarity measures, providing higher power across a range of simulation scenarios. Two measures of effect size, a modified R-squared coefficient and a kernel RV coefficient, are incorporated to allow comparison of effect sizes across multiple kernels. Availability and implementation MiRKAT is available on CRAN as an R package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Treeio: An R Package for Phylogenetic Tree Input and Output with Richly Annotated and Associated Data

Molecular Biology and Evolution ◽

10.1093/molbev/msz240 ◽

2019 ◽

Vol 37 (2) ◽

pp. 599-603 ◽

Cited By ~ 25

Author(s):

Li-Gen Wang ◽

Tommy Tsan-Yuk Lam ◽

Shuangbin Xu ◽

Zehan Dai ◽

Lang Zhou ◽

...

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

R Package ◽

External Data ◽

Input And Output ◽

Evolutionary Context ◽

Tree Data ◽

Downstream Analysis ◽

Different Sources ◽

Associated Data

Abstract Phylogenetic trees and data are often stored in incompatible and inconsistent formats. The outputs of software tools that contain trees with analysis findings are often not compatible with each other, making it hard to integrate the results of different analyses in a comparative study. The treeio package is designed to connect phylogenetic tree input and output. It supports extracting phylogenetic trees as well as the outputs of commonly used analytical software. It can link external data to phylogenies and merge tree data obtained from different sources, enabling analyses of phylogeny-associated data from different disciplines in an evolutionary context. Treeio also supports export of a phylogenetic tree with heterogeneous-associated data to a single tree file, including BEAST compatible NEXUS and jtree formats; these facilitate data sharing as well as file format conversion for downstream analysis. The treeio package is designed to work with the tidytree and ggtree packages. Tree data can be processed using the tidy interface with tidytree and visualized by ggtree. The treeio package is released within the Bioconductor and rOpenSci projects. It is available at https://www.bioconductor.org/packages/treeio/.

Download Full-text