gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework

Background Model building is a crucial part of omics based biomedical research to transfer classifications and obtain insights into underlying mechanisms. Feature selection is often based on minimizing error between model predictions and given classification (maximizing accuracy). Human ratings/classifications, however, might be error prone, with discordance rates between experts of 5–15%. We therefore evaluate if a feature pre-filtering step might improve identification of features associated with true underlying groups. Methods Data was simulated for up to 100 samples and up to 10,000 features, 10% of which were associated with the ground truth comprising 2–10 normally distributed populations. Binary and semi-quantitative ratings with varying error probabilities were used as classification. For feature preselection standard cross-validation (V2) was compared to a novel heuristic (V1) applying univariate testing, multiplicity adjustment and cross-validation on switched dependent (classification) and independent (features) variables. Preselected features were used to train logistic regression/linear models (backward selection, AIC). Predictions were compared against the ground truth (ROC, multiclass-ROC). As use case, multiple feature selection/classification methods were benchmarked against the novel heuristic to identify prognostically different G-CIMP negative glioblastoma tumors from the TCGA-GBM 450 k methylation array data cohort, starting from a fuzzy umap based rough and erroneous separation. Results V1 yielded higher median AUC ranks for two true groups (ground truth), with smaller differences for true graduated differences (3–10 groups). Lower fractions of models were successfully fit with V1. Median AUCs for binary classification and two true groups were 0.91 (range: 0.54–1.00) for V1 (Benjamini-Hochberg) and 0.70 (0.28–1.00) for V2, 13% (n = 616) of V2 models showed AUCs < = 50% for 25 samples and 100 features. For larger numbers of features and samples, median AUCs were 0.75 (range 0.59–1.00) for V1 and 0.54 (range 0.32–0.75) for V2. In the TCGA-GBM data, modelBuildR allowed best prognostic separation of patients with highest median overall survival difference (7.51 months) followed a difference of 6.04 months for a random forest based method. Conclusions The proposed heuristic is beneficial for the retrieval of features associated with two true groups classified with errors. We provide the R package modelBuildR to simplify (comparative) evaluation/application of the proposed heuristic (http://github.com/mknoll/modelBuildR).

Download Full-text

Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0065 ◽

2019 ◽

Vol 18 (6) ◽

Author(s):

Oliver M. Crook ◽

Laurent Gatto ◽

Paul D. W. Kirk

Keyword(s):

Variable Selection ◽

Dirichlet Process ◽

Bayesian Model ◽

Bayesian Model Averaging ◽

Model Averaging ◽

R Package ◽

The Cancer Genome Atlas ◽

Fast Method ◽

Model Based Clustering ◽

Pan Cancer

Abstract The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel

Download Full-text

NonpModelCheck: An R Package for Nonparametric Lack-of-Fit Testing and Variable Selection

Journal of Statistical Software ◽

10.18637/jss.v077.i10 ◽

2017 ◽

Vol 77 (10) ◽

Cited By ~ 1

Author(s):

Adriano Zanin Zambom ◽

Michael G. Akritas

Keyword(s):

Variable Selection ◽

R Package ◽

Lack Of Fit ◽

Fit Testing

Download Full-text

BayICE: A hierarchical Bayesian deconvolution model with stochastic search variable selection

10.1101/732743 ◽

2019 ◽

Author(s):

An-Shun Tai ◽

George C. Tseng ◽

Wen-Ping Hsieh

Keyword(s):

Gene Expression ◽

Variable Selection ◽

Immune Cell ◽

Expression Profiles ◽

Gene Expression Profiles ◽

R Package ◽

Stochastic Search ◽

Hierarchical Bayesian ◽

Stochastic Search Variable Selection ◽

Search Variable

AbstractGene expression deconvolution is a powerful tool for exploring the microenvironment of complex tissues comprised of multiple cell groups using transcriptomic data. Characterizing cell activities for a particular condition has been regarded as a primary mission against diseases. For example, cancer immunology aims to clarify the role of the immune system in the progression and development of cancer through analyzing the immune cell components of tumors. To that end, many deconvolution methods have been proposed for inferring cell subpopulations within tissues. Nevertheless, two problems limit the practicality of current approaches. First, all approaches use external purified data to preselect cell type-specific genes that contribute to deconvolution. However, some types of cells cannot be found in purified profiles and the genes specifically over- or under-expressed in them cannot be identified. This is particularly a problem in cancer studies. Hence, a preselection strategy that is independent from deconvolution is inappropriate. The second problem is that existing approaches do not recover the expression profiles of unknown cells present in bulk tissues, which results in biased estimation of unknown cell proportions. Furthermore, it causes the shift-invariant property of deconvolution to fail, which then affects the estimation performance. To address these two problems, we propose a novel deconvolution approach, BayICE, which employs hierarchical Bayesian modeling with stochastic search variable selection. We develop a comprehensive Markov chain Monte Carlo procedure through Gibbs sampling to estimate cell proportions, gene expression profiles, and signature genes. Simulation and validation studies illustrate that BayICE outperforms existing deconvolution approaches in estimating cell proportions. Subsequently, we demonstrate an application of BayICE in the RNA sequencing of patients with non-small cell lung cancer. The model is implemented in the R package “BayICE” and the algorithm is available for download.

Download Full-text

RavenR v2.1.4: an open source R package to support flexible hydrologic modelling

10.5194/gmd-2021-336 ◽

2021 ◽

Author(s):

Robert Chlumsky ◽

James R. Craig ◽

Simon G. M. Lin ◽

Sarah Grass ◽

Leland Scantlebury ◽

...

Keyword(s):

Model Building ◽

R Package ◽

Learning Curves ◽

Use Cases ◽

Hydrologic Models ◽

Hydrologic Modelling ◽

Modelling Framework ◽

Building Process ◽

Model Configuration ◽

Modelling Studies

Abstract. In recent decades, advances in the flexibility and complexity of hydrologic models has enhanced their utility in scientific studies and practice alike. However, the increasing complexity of these tools leads to a number of challenges, including steep learning curves for new users and in the reproducibility of modelling studies. Here, we present the RavenR package, an R package that leverages the power of scripting to both enhance the usability of the Raven hydrologic modelling framework and provide complimentary analyses that are useful for modellers. The RavenR package contains functions that may be useful in each step of the model-building process, particularly for preparing input files and analyzing model outputs, and these tools may be useful even for non-Raven users. The utility of the RavenR package is demonstrated with the presentation of six use cases for a model of the Liard River basin in Canada. These use cases provide examples of visually reviewing the model configuration, preparing input files for observation and forcing data, simplifying the model discretization, performing reality checks on the model output, and evaluating the performance of the model. All of the use cases are fully reproducible, with additional reproducible examples of RavenR functions included with the package distribution itself. It is anticipated that the RavenR package will continue to evolve with the Raven project, and will provide a useful tool to new and experienced users of Raven alike.

Download Full-text

MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection - an R package

BMC Bioinformatics ◽

10.1186/s12859-020-03640-0 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Matthew D. Koslovsky ◽

Marina Vannucci

Keyword(s):

Variable Selection ◽

Compositional Data ◽

Human Microbiome ◽

R Package ◽

Bayesian Variable Selection ◽

Multinomial Regression ◽

Phylogenetic Structure ◽

Prior Probabilities ◽

Abundance Data ◽

Model Selection Uncertainty

Abstract Background Understanding the relation between the human microbiome and modulating factors, such as diet, may help researchers design intervention strategies that promote and maintain healthy microbial communities. Numerous analytical tools are available to help identify these relations, oftentimes via automated variable selection methods. However, available tools frequently ignore evolutionary relations among microbial taxa, potential relations between modulating factors, as well as model selection uncertainty. Results We present MicroBVS, an R package for Dirichlet-tree multinomial models with Bayesian variable selection, for the identification of covariates associated with microbial taxa abundance data. The underlying Bayesian model accommodates phylogenetic structure in the abundance data and various parameterizations of covariates’ prior probabilities of inclusion. Conclusion While developed to study the human microbiome, our software can be employed in various research applications, where the aim is to generate insights into the relations between a set of covariates and compositional data with or without a known tree-like structure.

Download Full-text