parallelMCMCcombine: An R Package for Bayesian Methods for Big Data and Analytics

Fitting Cox models in a big data context -on a massive scale in terms of volume, intensity, and complexity exceeding the capacity of usual analytic tools-is often challenging. If some data are missing, it is even more difficult. We proposed algorithms that were able to fit Cox models in high dimensional settings using extensions of partial least squares regression to the Cox models. Some of them were able to cope with missing data. We were recently able to extend our most recent algorithms to big data, thus allowing to fit Cox model for big data with missing values. When cross-validating standard or extended Cox models, the commonly used criterion is the cross-validated partial loglikelihood using a naive or a van Houwelingen scheme —to make efficient use of the death times of the left out data in relation to the death times of all the data. Quite astonishingly, we will show, using a strong simulation study involving three different data simulation algorithms, that these two cross-validation methods fail with the extensions, either straightforward or more involved ones, of partial least squares regression to the Cox model. This is quite an interesting result for at least two reasons. Firstly, several nice features of PLS based models, including regularization, interpretability of the components, missing data support, data visualization thanks to biplots of individuals and variables —and even parsimony or group parsimony for Sparse partial least squares or sparse group SPLS based models, account for a common use of these extensions by statisticians who usually select their hyperparameters using cross-validation. Secondly, they are almost always featured in benchmarking studies to assess the performance of a new estimation technique used in a high dimensional or big data context and often show poor statistical properties. We carried out a vast simulation study to evaluate more than a dozen of potential cross-validation criteria, either AUC or prediction error based. Several of them lead to the selection of a reasonable number of components. Using these newly found cross-validation criteria to fit extensions of partial least squares regression to the Cox model, we performed a benchmark reanalysis that showed enhanced performances of these techniques. In addition, we proposed sparse group extensions of our algorithms and defined a new robust measure based on the Schmid score and the R coefficient of determination for least absolute deviation: the integrated R Schmid Score weighted. The R-package used in this article is available on the CRAN, http://cran.r-project.org/web/packages/plsRcox/index.html. The R package bigPLS will soon be available on the CRAN and, until then, is available on Github https://github.com/fbertran/bigPLS.

Download Full-text

bWGR: Bayesian whole-genome regression

Bioinformatics ◽

10.1093/bioinformatics/btz794 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alencar Xavier ◽

William M Muir ◽

Katy M Rainey

Keyword(s):

Bayesian Methods ◽

Expectation Maximization ◽

Complex Traits ◽

Hierarchical Models ◽

R Package ◽

Supplementary Information ◽

Whole Genome ◽

Regression Methods ◽

Genome Wide ◽

User Friendly

AbstractMotivationWhole-genome regressions methods represent a key framework for genome-wide prediction, cross-validation studies and association analysis. The bWGR offers a compendium of Bayesian methods with various priors available, allowing users to predict complex traits with different genetic architectures.ResultsHere we introduce bWGR, an R package that enables users to efficient fit and cross-validate Bayesian and likelihood whole-genome regression methods. It implements a series of methods referred to as the Bayesian alphabet under the traditional Gibbs sampling and optimized expectation-maximization. The package also enables fitting efficient multivariate models and complex hierarchical models. The package is user-friendly and computational efficient.Availability and implementationbWGR is an R package available in the CRAN repository. It can be installed in R by typing: install.packages(‘bWGR’).Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

BAYESIAN ANALYSIS OF BIG DATA IN INSURANCE PREDICTIVE MODELING USING DISTRIBUTED COMPUTING

Astin Bulletin ◽

10.1017/asb.2017.15 ◽

2017 ◽

Vol 47 (3) ◽

pp. 943-961 ◽

Cited By ~ 2

Author(s):

Yanwei Zhang

Keyword(s):

Big Data ◽

Distributed Computing ◽

Predictive Modeling ◽

Bayesian Methods ◽

Distributed Algorithm ◽

Bayesian Computation ◽

Actuarial Science ◽

Parallel Method ◽

Bayesian Hierarchical ◽

The Empirical Analysis

AbstractWhile Bayesian methods have attracted considerable interest in actuarial science, they are yet to be embraced in large-scaled insurance predictive modeling applications, due to inefficiencies of Bayesian estimation procedures. The paper presents an efficient method that parallelizes Bayesian computation using distributed computing on Apache Spark across a cluster of computers. The distributed algorithm dramatically boosts the speed of Bayesian computation and expands the scope of applicability of Bayesian methods in insurance modeling. The empirical analysis applies a Bayesian hierarchical Tweedie model to a big data of 13 million insurance claim records. The distributed algorithm achieves as much as 65 times performance gain over the non-parallel method in this application. The analysis demonstrates that Bayesian methods can be of great value to large-scaled insurance predictive modeling.

Download Full-text

Big Data Analysis of the Dynamic Relationship between Stock Prices and Business Cycles Via Bayesian Methods

International Journal of Trade Economics and Finance ◽

10.18178/ijtef.2018.9.6.620 ◽

2018 ◽

Vol 9 (6) ◽

pp. 224-230

Author(s):

Koki Kyo ◽

Keyword(s):

Big Data ◽

Data Analysis ◽

Business Cycles ◽

Bayesian Methods ◽

Stock Prices ◽

Big Data Analysis ◽

Dynamic Relationship

Download Full-text

Bayesian Methods for Model Comparison, Selection, and Big Data

Bayesian Econometric Methods ◽

10.1017/9781108525947.018 ◽

2019 ◽

pp. 321-342

Keyword(s):

Big Data ◽

Bayesian Methods ◽

Model Comparison

Download Full-text

Perspectives on Bayesian Methods and Big Data

Customer Needs and Solutions ◽

10.1007/s40547-014-0017-9 ◽

2014 ◽

Vol 1 (3) ◽

pp. 169-175 ◽

Cited By ~ 14

Author(s):

Greg M. Allenby ◽

Eric T. Bradlow ◽

Edward I. George ◽

John Liechty ◽

Robert E. McCulloch

Keyword(s):

Big Data ◽

Bayesian Methods

Download Full-text

Detailed Bibliographic Big data Analysis Using R package to identify Trends in Scientific Inquiry Education Research

Korean Association For Learner-Centered Curriculum And Instruction ◽

10.22251/jlcci.2020.20.6.859 ◽

2020 ◽

Vol 20 (6) ◽

pp. 859-883

Author(s):

Sangwoo Ha ◽

Hunkoog Jho

Keyword(s):

Big Data ◽

Education Research ◽

Data Analysis ◽

Scientific Inquiry ◽

R Package ◽

Big Data Analysis

Download Full-text

Improving Bayesian statistics understanding in the age of Big Data with the bayesvl R package

Software Impacts ◽

10.1016/j.simpa.2020.100016 ◽

2020 ◽

Vol 4 ◽

pp. 100016 ◽

Cited By ~ 5

Author(s):

Quan-Hoang Vuong ◽

Viet-Phuong La ◽

Minh-Hoang Nguyen ◽

Manh-Toan Ho ◽

Manh-Tung Ho ◽

...

Keyword(s):

Big Data ◽

Bayesian Statistics ◽

R Package

Download Full-text

A primer on high-dimensional data analysis workflows for studying visual cortex development and plasticity

10.1101/554378 ◽

2019 ◽

Cited By ~ 1

Author(s):

Justin L. Balsor ◽

David G. Jones ◽

Kathryn M. Murphy

Keyword(s):

Big Data ◽

Visual Cortex ◽

Clustering Algorithms ◽

High Dimensional Data ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Dimensional Changes ◽

Or Genes

AbstractNew techniques for quantifying large numbers of proteins or genes are transforming the study of plasticity mechanisms in visual cortex (V1) into the era of big data. With those changes comes the challenge of applying new analytical methods designed for high-dimensional data. Studies of V1, however, can take advantage of the known functions that many proteins have in regulating experience-dependent plasticity to facilitate linking big data analyses with neurobiological functions. Here we discuss two workflows and provide example R code for analyzing high-dimensional changes in a group of proteins (or genes) using two data sets. The first data set includes 7 neural proteins, 9 visual conditions, and 3 regions in V1 from an animal model for amblyopia. The second data set includes 23 neural proteins and 31 ages (20d-80yrs) from human post-mortem samples of V1. Each data set presents different challenges and we describe using PCA, tSNE, and various clustering algorithms including sparse high-dimensional clustering. Also, we describe a new approach for identifying high-dimensional features and using them to construct a plasticity phenotype that identifies neurobiological differences among clusters. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.

Download Full-text