scholarly journals A joint regression modeling framework for analyzing bivariate binary data in R

2017 ◽  
Vol 5 (1) ◽  
pp. 268-294 ◽  
Author(s):  
Giampiero Marra ◽  
Rosalba Radice

Abstract We discuss some of the features of the R add-on package GJRM which implements a flexible joint modeling framework for fitting a number of multivariate response regression models under various sampling schemes. In particular,we focus on the case inwhich the user wishes to fit bivariate binary regression models in the presence of several forms of selection bias. The framework allows for Gaussian and non-Gaussian dependencies through the use of copulae, and for the association and mean parameters to depend on flexible functions of covariates. We describe some of the methodological details underpinning the bivariate binary models implemented in the package and illustrate them by fitting interpretable models of different complexity on three data-sets.

mBio ◽  
2020 ◽  
Vol 11 (4) ◽  
Author(s):  
John A. Lees ◽  
T. Tien Mai ◽  
Marco Galardini ◽  
Nicole E. Wheeler ◽  
Samuel T. Horsfield ◽  
...  

ABSTRACT Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.


2016 ◽  
Vol 25 (4) ◽  
pp. 1661-1676 ◽  
Author(s):  
Edmund N Njagi ◽  
Geert Molenberghs ◽  
Dimitris Rizopoulos ◽  
Geert Verbeke ◽  
Michael G Kenward ◽  
...  

2017 ◽  
Vol 16 (06) ◽  
pp. 1707-1727 ◽  
Author(s):  
Morteza Mashayekhi ◽  
Robin Gras

Decision trees are examples of easily interpretable models whose predictive accuracy is normally low. In comparison, decision tree ensembles (DTEs) such as random forest (RF) exhibit high predictive accuracy while being regarded as black-box models. We propose three new rule extraction algorithms from DTEs. The RF[Formula: see text]DHC method, a hill climbing method with downhill moves (DHC), is used to search for a rule set that decreases the number of rules dramatically. In the RF[Formula: see text]SGL and RF[Formula: see text]MSGL methods, the sparse group lasso (SGL) method, and the multiclass SGL (MSGL) method are employed respectively to find a sparse weight vector corresponding to the rules generated by RF. Experimental results with 24 data sets show that the proposed methods outperform similar state-of-the-art methods, in terms of human comprehensibility, by greatly reducing the number of rules and limiting the number of antecedents in the retained rules, while preserving the same level of accuracy.


2006 ◽  
Vol 42 (8-9) ◽  
pp. 617-638 ◽  
Author(s):  
Nikolaj Tatti
Keyword(s):  

2021 ◽  
Vol 72 ◽  
pp. 901-942
Author(s):  
Aliaksandr Hubin ◽  
Geir Storvik ◽  
Florian Frommlet

Regression models are used in a wide range of applications providing a powerful scientific tool for researchers from different fields. Linear, or simple parametric, models are often not sufficient to describe complex relationships between input variables and a response. Such relationships can be better described through  flexible approaches such as neural networks, but this results in less interpretable models and potential overfitting. Alternatively, specific parametric nonlinear functions can be used, but the specification of such functions is in general complicated. In this paper, we introduce a  flexible approach for the construction and selection of highly  flexible nonlinear parametric regression models. Nonlinear features are generated hierarchically, similarly to deep learning, but have additional  flexibility on the possible types of features to be considered. This  flexibility, combined with variable selection, allows us to find a small set of important features and thereby more interpretable models. Within the space of possible functions, a Bayesian approach, introducing priors for functions based on their complexity, is considered. A genetically modi ed mode jumping Markov chain Monte Carlo algorithm is adopted to perform Bayesian inference and estimate posterior probabilities for model averaging. In various applications, we illustrate how our approach is used to obtain meaningful nonlinear models. Additionally, we compare its predictive performance with several machine learning algorithms.  


2020 ◽  
Vol 16 (12) ◽  
pp. e1008473
Author(s):  
Pamela N. Luna ◽  
Jonathan M. Mansbach ◽  
Chad A. Shaw

Changes in the composition of the microbiome over time are associated with myriad human illnesses. Unfortunately, the lack of analytic techniques has hindered researchers’ ability to quantify the association between longitudinal microbial composition and time-to-event outcomes. Prior methodological work developed the joint model for longitudinal and time-to-event data to incorporate time-dependent biomarker covariates into the hazard regression approach to disease outcomes. The original implementation of this joint modeling approach employed a linear mixed effects model to represent the time-dependent covariates. However, when the distribution of the time-dependent covariate is non-Gaussian, as is the case with microbial abundances, researchers require different statistical methodology. We present a joint modeling framework that uses a negative binomial mixed effects model to determine longitudinal taxon abundances. We incorporate these modeled microbial abundances into a hazard function with a parameterization that not only accounts for the proportional nature of microbiome data, but also generates biologically interpretable results. Herein we demonstrate the performance improvements of our approach over existing alternatives via simulation as well as a previously published longitudinal dataset studying the microbiome during pregnancy. The results demonstrate that our joint modeling framework for longitudinal microbiome count data provides a powerful methodology to uncover associations between changes in microbial abundances over time and the onset of disease. This method offers the potential to equip researchers with a deeper understanding of the associations between longitudinal microbial composition changes and disease outcomes. This new approach could potentially lead to new diagnostic biomarkers or inform clinical interventions to help prevent or treat disease.


2019 ◽  
Author(s):  
John A. Lees ◽  
T. Tien Mai ◽  
Marco Galardini ◽  
Nicole E. Wheeler ◽  
Jukka Corander

ABSTRACTDiscovery of influential genetic variants and prediction of phenotypes such as antibiotic resistance are becoming routine tasks in bacterial genomics. Genome-wide association study (GWAS) methods can be applied to study bacterial populations, with a particular emphasis on alignment-free approaches, which are necessitated by the more plastic nature of bacterial genomes. Here we advance bacterial GWAS by introducing a computationally scalable joint modeling framework, where genetic variants covering the entire pangenome are compactly represented by unitigs, and the model fitting is achieved using elastic net penalization. In contrast to current leading GWAS approaches, which test each genotype-phenotype association separately for each variant, our joint modelling approach is shown to lead to increased statistical power while maintaining control of the false positive rate. Our inference procedure also delivers an estimate of the narrow-sense heritability, which is gaining considerable interest in studies of bacteria. Using an extensive set of state-of-the-art bacterial population genomic datasets we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. We expect that these advances will pave the way for the next generation of high-powered association and prediction studies for an increasing number of bacterial species.


1985 ◽  
Vol 65 (1) ◽  
pp. 109-122 ◽  
Author(s):  
L. M. DWYER ◽  
H. N. HAYHOE

Estimates of monthly soil temperatures under short-grass cover across Canada using a macroclimatic model (Ouellet 1973a) were compared to monthly averages of soil temperatures monitored over winter at Ottawa between November 1959 and April 1981. Although the fit between monthly estimates and Ottawa observations was generally good (R for all months and depths 0.10, 0.20, 0.50, 1.00 and 1.50 m was 0.90), it was noted that midwinter estimates were generally below observed temperatures at all soil depths. Data sets used in the development of the original Ouellet (1973a) multiple regression equations were collected from stations across Canada, many of which have reduced snow cover. It was found that the buffering capability of the snow cover accumulated at Ottawa during the winter months was underestimated by the pertinent partial regression coefficients in these equations. The coefficients were therefore modified for the Ottawa station during the winter months. The resultant regression models were used to estimate soil temperature during the winters of 1981–1982 and 1982–1983. Although the Ottawa-based models included fewer variables because of the smaller data base available from a single site, comparisons of model estimates and observations were good (R = 0.84 and 0.91) and midwinter estimates were not consistently underestimated as they were using the original Ouellet (1973a) model. Reliable monthly estimates of soil temperatures are important since they are a necessary input to more detailed predictive models of daily soil temperatures. Key words: Regression model, snowcover, stepwise regression, variable selection


2021 ◽  
pp. 096228022110605
Author(s):  
Ujjwal Das ◽  
Ranojoy Basu

We consider partially observed binary matched-pair data. We assume that the incomplete subjects are missing at random. Within this missing framework, we propose an EM-algorithm based approach to construct an interval estimator of the proportion difference incorporating all the subjects. In conjunction with our proposed method, we also present two improvements to the interval estimator through some correction factors. The performances of the three competing methods are then evaluated through extensive simulation. Recommendation for the method is given based on the ability to preserve type-I error for various sample sizes. Finally, the methods are illustrated in two real-world data sets. An R-function is developed to implement the three proposed methods.


Sign in / Sign up

Export Citation Format

Share Document