scholarly journals PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes

2021 ◽  
Author(s):  
Charlie M Carpenter ◽  
Weiming Zhang ◽  
Lucas Gillenwater ◽  
Cameron Severn ◽  
Tusharkanti Ghosh ◽  
...  

High-throughput data such as metabolomics, genomics, transcriptomics, and proteomics have become familiar data types within the "-omics" family. For this work, we focus on subsets that interact with one another and represent these "pathways" as graphs. Observed pathways often have disjoint components, i.e. nodes or sets of nodes (metabolites, etc.) not connected to any other within the pathway which notably lessens testing power. In this paper we propose the Pathway Integrated Regression-based Kernel Association Test (PaIRKAT), a new kernel machine regression method for incorporating known pathway information into the semi-parametric kernel regression framework. This paper also contributes an application of a graph kernel regularization method for overcoming disconnected pathways. By incorporating a regularized or "smoothed" graph into a score test, PaIRKAT is capable of providing more powerful tests for associations between biological pathways and phenotypes of interest and will be helpful in identifying novel pathways for targeted clinical research. We evaluate this method through several simulation studies and an application to real metabolomics data from the COPDGene study. Our simulation studies illustrate the robustness of this method to incorrect and incomplete pathway knowledge, and the real data analysis shows meaningful improvements of testing power in pathways. PaIRKAT was developed for application to metabolomic pathway data, but the techniques are easily generalizable to other data sources with a graph-like structure.

2021 ◽  
Author(s):  
Caizhi Huang ◽  
Benjamin John Callahan ◽  
Michael C Wu ◽  
Shannon T. Holloway ◽  
Hayden Brochu ◽  
...  

Abstract Background: The relationship between host conditions and microbiome profiles, typically characterized by operational taxonomic units (OTUs), contains important information about the microbial role in human health. Traditional association testing frameworks are challenged by the high-dimensionality and sparsity of typical microbiome profiles. Incorporating phylogenetic information is often used to address these challenges with the assumption that evolutionarily similar taxa tend to behave similarly. However, this assumption may not always be valid due to the complex effect of microbes, and phylogenetic information should be incorporated in a data-supervised fashion. Results: In this work, we propose a local collapsing test called Phylogeny-guided microbiome OTU-Specific association Test (POST). In POST, whether or not to borrow information and how much information to borrow from the neighboring OTUs in the phylogenic tree are supervised by phylogenetic distance and the outcome-OTU association. POST is constructed under the kernel machine framework to accommodate complex OTU effects and extends kernel machine microbiome tests from community-level to OTU-level. Using simulation studies, we showed that when the phylogenetic tree is informative, POST has better performance than existing OTU-level association tests. When the phylogenetic tree is not informative, POST achieves similar performance as existing methods. Finally, we show that POST can identify more outcome-associated OTUs that are of biological relevance in real data applications on bacterial vaginosis and on preterm birth. Conclusions: Using POST, we show that the power of detecting associated microbiome features can be enhanced by adaptively leveraging the phylogenetic information when testing for a target OTU. We developed an user friendly R package POSTm which is now available at CRAN (https://CRAN.R-project.org/package=POSTm) for public access.


2018 ◽  
Vol 28 (10-11) ◽  
pp. 3123-3141 ◽  
Author(s):  
Yi Tang ◽  
Wan Tang

Excessive zeros are common in practice and may cause overdispersion and invalidate inferences when fitting Poisson regression models. Zero-inflated Poisson regression models may be applied if there are inflated zeros; however, it is desirable to test if there are inflated zeros before such zero-inflated Poisson models are applied. Assuming a constant probability of being a structural zero in a zero-inflated Poisson regression model, the existence of the inflated zeros may be tested by testing whether the constant probability is zero. In such situations, the Wald, score, and likelihood ratio tests can be applied. Without specifying a zero-inflated Poisson model, He et al. recently developed a test by comparing the amount of observed zeros with that expected under the Poisson model. In this paper, we develop a closed form for the test and compare it with the Wald, score, and likelihood ratio tests through simulation studies. The simulation studies show that the test of He et al. is the best in controlling type I errors, while the score test generally has the least power among the tests. The tests are illustrated with two real data examples.


Author(s):  
Zhongxue Chen ◽  
Shizhong Han ◽  
Kai Wang

AbstractMany gene- and pathway-based association tests have been proposed in the literature. Among them, the SKAT is widely used, especially for rare variants association studies. In this paper, we investigate the connection between SKAT and a principal component analysis. This investigation leads to a procedure that encompasses SKAT as a special case. Through simulation studies and real data applications, we compare the proposed method with some existing tests.


Author(s):  
Guanghao Qi ◽  
Nilanjan Chatterjee

Abstract Background Previous studies have often evaluated methods for Mendelian randomization (MR) analysis based on simulations that do not adequately reflect the data-generating mechanisms in genome-wide association studies (GWAS) and there are often discrepancies in the performance of MR methods in simulations and real data sets. Methods We use a simulation framework that generates data on full GWAS for two traits under a realistic model for effect-size distribution coherent with the heritability, co-heritability and polygenicity typically observed for complex traits. We further use recent data generated from GWAS of 38 biomarkers in the UK Biobank and performed down sampling to investigate trends in estimates of causal effects of these biomarkers on the risk of type 2 diabetes (T2D). Results Simulation studies show that weighted mode and MRMix are the only two methods that maintain the correct type I error rate in a diverse set of scenarios. Between the two methods, MRMix tends to be more powerful for larger GWAS whereas the opposite is true for smaller sample sizes. Among the other methods, random-effect IVW (inverse-variance weighted method), MR-Robust and MR-RAPS (robust adjust profile score) tend to perform best in maintaining a low mean-squared error when the InSIDE assumption is satisfied, but can produce large bias when InSIDE is violated. In real-data analysis, some biomarkers showed major heterogeneity in estimates of their causal effects on the risk of T2D across the different methods and estimates from many methods trended in one direction with increasing sample size with patterns similar to those observed in simulation studies. Conclusion The relative performance of different MR methods depends heavily on the sample sizes of the underlying GWAS, the proportion of valid instruments and the validity of the InSIDE assumption. Down-sampling analysis can be used in large GWAS for the possible detection of bias in the MR methods.


2017 ◽  
Vol 16 ◽  
pp. 117693511774727 ◽  
Author(s):  
Jian Wang ◽  
Rajesh Talluri ◽  
Sanjay Shete

To address the complexity of the X-chromosome inactivation (XCI) process, we previously developed a unified approach for the association test for X-chromosomal single-nucleotide polymorphisms (SNPs) and the disease of interest, accounting for different biological possibilities of XCI: random, skewed, and escaping XCI. In the original study, we focused on the SNP-disease association test but did not provide knowledge regarding the underlying XCI models. One can use the highest likelihood ratio (LLR) to select XCI models (max-LLR approach). However, that approach does not formally compare the LLRs corresponding to different XCI models to assess whether the models are distinguishable. Therefore, we propose an LLR comparison procedure (comp-LLR approach), inspired by the Cox test, to formally compare the LLRs of different XCI models to select the most likely XCI model that describes the underlying XCI process. We conduct simulation studies to investigate the max-LLR and comp-LLR approaches. The simulation results show that compared with the max-LLR, the comp-LLR approach has higher probability of identifying the correct underlying XCI model for the scenarios when the underlying XCI process is random XCI, escaping XCI, or skewed XCI to the deleterious allele. We applied both approaches to a head and neck cancer genetic study to investigate the underlying XCI processes for the X-chromosomal genetic variants.


2021 ◽  
Author(s):  
Nastaran Maus Esfahani ◽  
Daniel Catchpoole ◽  
Javed Khan ◽  
Paul J. Kennedy

AbstractBackgroundCopy number variants (CNVs) are the gain or loss of DNA segments in the genome. Studies have shown that CNVs are linked to various disorders, including autism, intellectual disability, and schizophrenia.Consequently, the interest in studying a possible association of CNVs to specific disease traits is growing. However, due to the specific multi-dimensional characteristics of the CNVs, methods for testing the association between CNVs and the disease-related traits are still underdeveloped. We propose a novel multi-dimensional CNV kernel association test (MCKAT) in this paper. We aim to find significant associations between CNVs and disease-related traits using kernel-based methods.ResultsWe address the multi-dimensionality in CNV characteristics. We first design a single pair CNV kernel, which contains three sub-kernels to summarize the similarity between two CNVs considering all CNV characteristics. Then, aggregate single pair CNV kernel to the whole chromosome CNV kernel, which summarizes the similarity between CNVs in two or more chromosomes. Finally, the association between the CNVs and disease-related traits is evaluated by comparing the similarity in the trait with kernel-based similarity using a score test in a random effect model. We apply MCKAT on genome-wide CNV datasets to examine the association between CNVs and disease-related traits, which demonstrates the potential usefulness the proposed method has for the CNV association tests. We compare the performance of MCKAT with CKAT, a uni-dimensional kernel method. Based on the results, MCKAT indicates stronger evidence, smaller p-value, in detecting significant associations between CNVs and disease-related traits in both rare and common CNV datasets.ConclusionA multi-dimensional copy number variant kernel association test can detect significantly associated CNVs with any disease-related trait. MCKAT can help biologists detect significantly associated CNVs with any disease-related trait across a patient group instead of examining the CNVs case by case in each subject.


Symmetry ◽  
2021 ◽  
Vol 13 (11) ◽  
pp. 2164
Author(s):  
Héctor J. Gómez ◽  
Diego I. Gallardo ◽  
Karol I. Santoro

In this paper, we present an extension of the truncated positive normal (TPN) distribution to model positive data with a high kurtosis. The new model is defined as the quotient between two random variables: the TPN distribution (numerator) and the power of a standard uniform distribution (denominator). The resulting model has greater kurtosis than the TPN distribution. We studied some properties of the distribution, such as moments, asymmetry, and kurtosis. Parameter estimation is based on the moments method, and maximum likelihood estimation uses the expectation-maximization algorithm. We performed some simulation studies to assess the recovery parameters and illustrate the model with a real data application related to body weight. The computational implementation of this work was included in the tpn package of the R software.


Author(s):  
Xiaozhou Wang ◽  
Xi Chen ◽  
Qihang Lin ◽  
Weidong Liu

The performance of clustering depends on an appropriately defined similarity between two items. When the similarity is measured based on human perception, human workers are often employed to estimate a similarity score between items in order to support clustering, leading to a procedure called crowdsourced clustering. Assuming a monetary reward is paid to a worker for each similarity score and assuming the similarities between pairs and workers' reliability have a large diversity, when the budget is limited, it is critical to wisely assign pairs of items to different workers to optimize the clustering result. We model this budget allocation problem as a Markov decision process where item pairs are dynamically assigned to workers based on the historical similarity scores they provided. We propose an optimistic knowledge gradient policy where the assignment of items in each stage is based on the minimum-weight K-cut defined on a similarity graph. We provide simulation studies and real data analysis to demonstrate the performance of the proposed method.


Author(s):  
Nada M. Alhakkak

BigGIS is a new product that resulted from developing GIS in the “Big Data” area, which is used in storing and processing big geographical data and helps in solving its issues. This chapter describes an optimized Big GIS framework in Map Reduce Environment M2BG. The suggested framework has been integrated into Map Reduce Environment in order to solve the storage issues and get the benefit of the Hadoop environment. M2BG include two steps: Big GIS warehouse and Big GIS Map Reduce. The first step contains three main layers: Data Source and Storage Layer (DSSL), Data Processing Layer (DPL), and Data Analysis Layer (DAL). The second layer is responsible for clustering using swarms as inputs for the Hadoop phase. Then it is scheduled in the mapping part with the use of a preempted priority scheduling algorithm; some data types are classified as critical and some others are ordinary data type; the reduce part used, merge sort algorithm M2BG, should solve security and be implemented with real data in the simulated environment and later in the real world.


Stats ◽  
2019 ◽  
Vol 2 (1) ◽  
pp. 111-120 ◽  
Author(s):  
Dewi Rahardja

We construct a point and interval estimation using a Bayesian approach for the difference of two population proportion parameters based on two independent samples of binomial data subject to one type of misclassification. Specifically, we derive an easy-to-implement closed-form algorithm for drawing from the posterior distributions. For illustration, we applied our algorithm to a real data example. Finally, we conduct simulation studies to demonstrate the efficiency of our algorithm for Bayesian inference.


Sign in / Sign up

Export Citation Format

Share Document