GPSeqClus: an r package for sequential clustering of animal location data for model building, model application, and field site investigations

Background Model building is a crucial part of omics based biomedical research to transfer classifications and obtain insights into underlying mechanisms. Feature selection is often based on minimizing error between model predictions and given classification (maximizing accuracy). Human ratings/classifications, however, might be error prone, with discordance rates between experts of 5–15%. We therefore evaluate if a feature pre-filtering step might improve identification of features associated with true underlying groups. Methods Data was simulated for up to 100 samples and up to 10,000 features, 10% of which were associated with the ground truth comprising 2–10 normally distributed populations. Binary and semi-quantitative ratings with varying error probabilities were used as classification. For feature preselection standard cross-validation (V2) was compared to a novel heuristic (V1) applying univariate testing, multiplicity adjustment and cross-validation on switched dependent (classification) and independent (features) variables. Preselected features were used to train logistic regression/linear models (backward selection, AIC). Predictions were compared against the ground truth (ROC, multiclass-ROC). As use case, multiple feature selection/classification methods were benchmarked against the novel heuristic to identify prognostically different G-CIMP negative glioblastoma tumors from the TCGA-GBM 450 k methylation array data cohort, starting from a fuzzy umap based rough and erroneous separation. Results V1 yielded higher median AUC ranks for two true groups (ground truth), with smaller differences for true graduated differences (3–10 groups). Lower fractions of models were successfully fit with V1. Median AUCs for binary classification and two true groups were 0.91 (range: 0.54–1.00) for V1 (Benjamini-Hochberg) and 0.70 (0.28–1.00) for V2, 13% (n = 616) of V2 models showed AUCs < = 50% for 25 samples and 100 features. For larger numbers of features and samples, median AUCs were 0.75 (range 0.59–1.00) for V1 and 0.54 (range 0.32–0.75) for V2. In the TCGA-GBM data, modelBuildR allowed best prognostic separation of patients with highest median overall survival difference (7.51 months) followed a difference of 6.04 months for a random forest based method. Conclusions The proposed heuristic is beneficial for the retrieval of features associated with two true groups classified with errors. We provide the R package modelBuildR to simplify (comparative) evaluation/application of the proposed heuristic (http://github.com/mknoll/modelBuildR).

Download Full-text

sympl (v. 0.4.0) and climt (v. 0.15.3) – towards a flexible framework for building model hierarchies in Python

Geoscientific Model Development ◽

10.5194/gmd-11-3781-2018 ◽

2018 ◽

Vol 11 (9) ◽

pp. 3781-3794 ◽

Cited By ~ 9

Author(s):

Joy Merwin Monteiro ◽

Jeremy McGibbon ◽

Rodrigo Caballero

Keyword(s):

Model Building ◽

Relevant Information ◽

System Modelling ◽

Climate Modelling ◽

Earth System ◽

Wide Audience ◽

Fine Grained ◽

Building Model ◽

Building Models ◽

Model Components

Abstract. sympl (System for Modelling Planets) and climt (Climate Modelling and Diagnostics Toolkit) are an attempt to rethink climate modelling frameworks from the ground up. The aim is to use expressive data structures available in the scientific Python ecosystem along with best practices in software design to allow scientists to easily and reliably combine model components to represent the climate system at a desired level of complexity and to enable users to fully understand what the model is doing. sympl is a framework which formulates the model in terms of a state that gets evolved forward in time or modified within a specific time by well-defined components. sympl's design facilitates building models that are self-documenting, are highly interoperable, and provide fine-grained control over model components and behaviour. sympl components contain all relevant information about the input they expect and output that they provide. Components are designed to be easily interchanged, even when they rely on different units or array configurations. sympl provides basic functions and objects which could be used in any type of Earth system model. climt is an Earth system modelling toolkit that contains scientific components built using sympl base objects. These include both pure Python components and wrapped Fortran libraries. climt provides functionality requiring model-specific assumptions, such as state initialization and grid configuration. climt's programming interface designed to be easy to use and thus appealing to a wide audience. Model building, configuration and execution are performed through a Python script (or Jupyter Notebook), enabling researchers to build an end-to-end Python-based pipeline along with popular Python data analysis and visualization tools.

Download Full-text

RavenR v2.1.4: an open source R package to support flexible hydrologic modelling

10.5194/gmd-2021-336 ◽

2021 ◽

Author(s):

Robert Chlumsky ◽

James R. Craig ◽

Simon G. M. Lin ◽

Sarah Grass ◽

Leland Scantlebury ◽

...

Keyword(s):

Model Building ◽

R Package ◽

Learning Curves ◽

Use Cases ◽

Hydrologic Models ◽

Hydrologic Modelling ◽

Modelling Framework ◽

Building Process ◽

Model Configuration ◽

Modelling Studies

Abstract. In recent decades, advances in the flexibility and complexity of hydrologic models has enhanced their utility in scientific studies and practice alike. However, the increasing complexity of these tools leads to a number of challenges, including steep learning curves for new users and in the reproducibility of modelling studies. Here, we present the RavenR package, an R package that leverages the power of scripting to both enhance the usability of the Raven hydrologic modelling framework and provide complimentary analyses that are useful for modellers. The RavenR package contains functions that may be useful in each step of the model-building process, particularly for preparing input files and analyzing model outputs, and these tools may be useful even for non-Raven users. The utility of the RavenR package is demonstrated with the presentation of six use cases for a model of the Liard River basin in Canada. These use cases provide examples of visually reviewing the model configuration, preparing input files for observation and forcing data, simplifying the model discretization, performing reality checks on the model output, and evaluating the performance of the model. All of the use cases are fully reproducible, with additional reproducible examples of RavenR functions included with the package distribution itself. It is anticipated that the RavenR package will continue to evolve with the Raven project, and will provide a useful tool to new and experienced users of Raven alike.

Download Full-text

Model Building, Model Testing and Model Fitting

The Practical Handbook of Genetic Algorithms ◽

10.1201/9781420035568.fmatt1 ◽

2000 ◽

Cited By ~ 1

Author(s):

J Everett

Keyword(s):

Model Building ◽

Model Fitting ◽

Model Testing ◽

Building Model

Download Full-text

Simulation modeling of a manufacturing process using Tecnomatix plant simulation software

Journal of Modern Manufacturing Systems and Technology ◽

10.15282/jmmst.v5i1.6083 ◽

2021 ◽

Vol 5 (1) ◽

pp. 56-62

Author(s):

Mebrahitom Asmelash ◽

Nurul Najihaha ◽

Azmir Azharic ◽

Freselam Mulubrhan

Keyword(s):

Simulation Modeling ◽

Model Building ◽

Constant Pressure ◽

Data Fitting ◽

Time Frame ◽

Simulation Software ◽

Simulation Tools ◽

Building Model ◽

Collection Data ◽

Plant Simulation

Industries in our community are under constant pressure and have high demands of customer orders for their products and there is the need to maximize the output for the same input of resources. In the case of lagging orders, it is very difficult for companies to manage and optimize the process flow for simultaneously coming orders. Process simulation can be suitably applied for studying and analyzing the system which can provide a framework for predicting and optimizing the process based on mathematical models. This work presents how to implement simulation tools in the real production planning so that an increase in the number of throughput in the time frame is achieved. The procedure starts with input data collection, data fitting, simulation model building, model validation, identification of the number throughput, and developed improvement system.

Download Full-text

hdnom: Building Nomograms for Penalized Cox Models with High-Dimensional Survival Data

10.1101/065524 ◽

2016 ◽

Cited By ~ 5

Author(s):

Nan Xiao ◽

Qing-Song Xu ◽

Miao-Zhu Li

Keyword(s):

Survival Data ◽

Web Application ◽

Model Building ◽

Cox Regression ◽

Cox Model ◽

R Package ◽

High Dimensional ◽

Cox Models ◽

Link Type ◽

Open Source Software Package

AbstractSummaryWe developed hdnom, an R package for survival modeling with high-dimensional data. The package is the first free and open-source software package that streamlines the workflow of penalized Cox model building, validation, calibration, comparison, and nomogram visualization, with nine types of penalized Cox regression methods fully supported. A web application and an online prediction tool maker are offered to enhance interac-tivity and flexibility in high-dimensional survival analysis.AvailabilityThe hdnom R package is available from CRAN:https://cran.r-project.org/package=hdnomunder GPL. The hdnom web application can be accessed athttp://hdnom.io. The web application maker is available fromhttp://hdnom.org/appmaker. The hdnom project website:http://[email protected]@duke.edu

Download Full-text

gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework

Journal of Statistical Software ◽

10.18637/jss.v074.i01 ◽

2016 ◽

Vol 74 (1) ◽

Cited By ~ 16

Author(s):

Benjamin Hofner ◽

Andreas Mayr ◽

Matthias Schmid

Keyword(s):

Variable Selection ◽

Model Building ◽

R Package

Download Full-text

Statistical downscaling with the downscaleR package: Contribution to the VALUE intercomparison experiment

10.5194/gmd-2019-224 ◽

2019 ◽

Cited By ~ 1

Author(s):

Joaquín Bedia ◽

Jorge Baño-Medina ◽

Mikel N. Legasa ◽

Maialen Iturbide ◽

Rodrigo Manzanas ◽

...

Keyword(s):

Statistical Downscaling ◽

Model Building ◽

Climate Model ◽

Comprehensive Evaluation ◽

Regional Climate ◽

Model Development ◽

Computational Cost ◽

Data Retrieval ◽

R Package ◽

Climate Information

Abstract. The increasing demand for high-resolution climate information has attracted a growing attention for statistical downscaling methods (SD), due in part to their relative advantages and merits as compared to dynamical approaches (based on regional climate model simulations), such as their much lower computational cost and their fitness-for-purpose for many local-scale applications. As a result, a plethora of SD methods is nowadays available for climate scientists, which has motivated recent efforts for their comprehensive evaluation, like the VALUE Project (http://www.value-cost.eu). The systematic intercomparison of a large number of SD techniques undertaken in VALUE, many of them independently developed by different authors and modeling centers in a variety of languages/environments, has shown a compelling need for new tools allowing for their application within an integrated framework. With this regard, downscaleR is an R package for statistical downscaling of climate information which covers the most popular approaches (Model Output Statistics – including the so called 'bias correction' methods – and Perfect Prognosis) and state-of-the-art techniques. It has been conceived to work primarily with daily data and can be used in the framework of both seasonal forecasting and climate change studies. Its full integration within the climate4R framework (Iturbide et al. 2019) makes possible the development of end-to-end downscaling applications, from data retrieval to model building, validation and prediction, bringing to climate scientists and practitioners a unique comprehensive framework for SD model development. In this article the main features of downscaleR are showcased through the replication of some of the results obtained in the VALUE Project, making an emphasis in the most technically complex stages of perfect-prog model calibration (predictor screening, cross-validation and model selection) that are accomplished through simple commands allowing for extremely flexible model tuning, tailored to the needs of users requiring an easy interface for different levels of experimental complexity. As part of the open-source climate4R framework, downscaleR is freely available and the necessary data and R scripts to fully replicate the experiments included in this paper are also provided as a companion notebook.

Download Full-text

Phylogeographic Estimation and Simulation of Global Diffusive Dispersal

Systematic Biology ◽

10.1093/sysbio/syaa061 ◽

2020 ◽

Author(s):

Stilianos Louca

Keyword(s):

Brownian Motion ◽

Geographic Location ◽

R Package ◽

Estimation Methods ◽

Influenza B ◽

Independent Contrasts ◽

Time Resolved ◽

Location Data ◽

Brownian Motion Model ◽

Random Walk Simulation

Abstract The analysis of time-resolved phylogenies (timetrees) and geographic location data allows estimation of dispersal rates, for example, for invasive species and infectious diseases. Many estimation methods are based on the Brownian Motion model for diffusive dispersal on a 2D plane; however, the accuracy of these methods deteriorates substantially when dispersal occurs at global scales because spherical Brownian motion (SBM) differs from planar Brownian motion. No statistical method exists for estimating SBM diffusion coefficients from a given timetree and tip coordinates, and no method exists for simulating SBM along a given timetree. Here, I present new methods for simulating SBM along a given timetree, and for estimating SBM diffusivity from a given timetree and tip coordinates using a modification of Felsenstein’s independent contrasts and maximum likelihood. My simulation and fitting methods can accommodate arbitrary time-dependent diffusivities and scale efficiently to trees with millions of tips, thus enabling new analyses even in cases where planar BM would be a sufficient approximation. I demonstrate these methods using a timetree of marine and terrestrial Cyanobacterial genomes, as well as timetrees of two globally circulating Influenza B clades. My methods are implemented in the R package “castor.” [Independent contrasts; phylogenetic; random walk; simulation; spherical Brownian motion.]

Download Full-text

Parametric modeling of quantile regression coefficient functions with count data

Statistical Methods & Applications ◽

10.1007/s10260-021-00557-7 ◽

2021 ◽

Author(s):

Paolo Frumento ◽

Nicola Salvati

Keyword(s):

Quantile Regression ◽

Count Data ◽

Model Building ◽

R Package ◽

Parametric Modeling ◽

Medical Expenditure ◽

National Medical ◽

Discrete Response ◽

The Us ◽

Parametric Functions

AbstractApplying quantile regression to count data presents logical and practical complications which are usually solved by artificially smoothing the discrete response variable through jittering. In this paper, we present an alternative approach in which the quantile regression coefficients are modeled by means of (flexible) parametric functions. The proposed method avoids jittering and presents numerous advantages over standard quantile regression in terms of computation, smoothness, efficiency, and ease of interpretation. Estimation is carried out by minimizing a “simultaneous” version of the loss function of ordinary quantile regression. Simulation results show that the described estimators are similar to those obtained with jittering, but are often preferable in terms of bias and efficiency. To exemplify our approach and provide guidelines for model building, we analyze data from the US National Medical Expenditure Survey. All the necessary software is implemented in the existing R package .

Download Full-text