DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization

A commonly used tool for estimating the parameters of a mixture model is the Expectation–Maximization (EM) algorithm, which is an iterative procedure that can serve as a maximum-likelihood estimator. The EM algorithm has well-documented drawbacks, such as the need for good initial values and the possibility of being trapped in local optima. Nevertheless, because of its appealing properties, EM plays an important role in estimating the parameters of mixture models. To overcome these initialization problems with EM, in this paper, we propose the Rough-Enhanced-Bayes mixture estimation (REBMIX) algorithm as a more effective initialization algorithm. Three different strategies are derived for dealing with the unknown number of components in the mixture model. These strategies are thoroughly tested on artificial datasets, density–estimation datasets and image–segmentation problems and compared with state-of-the-art initialization methods for the EM. Our proposal shows promising results in terms of clustering and density-estimation performance as well as in terms of computational efficiency. All the improvements are implemented in the rebmix R package.

Download Full-text

Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models

Frontiers in Big Data ◽

10.3389/fdata.2021.684794 ◽

2021 ◽

Vol 4 ◽

Author(s):

Frédéric Bertrand ◽

Myriam Maumy-Bertrand

Keyword(s):

Big Data ◽

Least Squares ◽

Partial Least Squares ◽

Partial Least Squares Regression ◽

Missing Values ◽

Cross Validation ◽

Cox Model ◽

R Package ◽

Least Squares Regression ◽

Cox Models

Fitting Cox models in a big data context -on a massive scale in terms of volume, intensity, and complexity exceeding the capacity of usual analytic tools-is often challenging. If some data are missing, it is even more difficult. We proposed algorithms that were able to fit Cox models in high dimensional settings using extensions of partial least squares regression to the Cox models. Some of them were able to cope with missing data. We were recently able to extend our most recent algorithms to big data, thus allowing to fit Cox model for big data with missing values. When cross-validating standard or extended Cox models, the commonly used criterion is the cross-validated partial loglikelihood using a naive or a van Houwelingen scheme —to make efficient use of the death times of the left out data in relation to the death times of all the data. Quite astonishingly, we will show, using a strong simulation study involving three different data simulation algorithms, that these two cross-validation methods fail with the extensions, either straightforward or more involved ones, of partial least squares regression to the Cox model. This is quite an interesting result for at least two reasons. Firstly, several nice features of PLS based models, including regularization, interpretability of the components, missing data support, data visualization thanks to biplots of individuals and variables —and even parsimony or group parsimony for Sparse partial least squares or sparse group SPLS based models, account for a common use of these extensions by statisticians who usually select their hyperparameters using cross-validation. Secondly, they are almost always featured in benchmarking studies to assess the performance of a new estimation technique used in a high dimensional or big data context and often show poor statistical properties. We carried out a vast simulation study to evaluate more than a dozen of potential cross-validation criteria, either AUC or prediction error based. Several of them lead to the selection of a reasonable number of components. Using these newly found cross-validation criteria to fit extensions of partial least squares regression to the Cox model, we performed a benchmark reanalysis that showed enhanced performances of these techniques. In addition, we proposed sparse group extensions of our algorithms and defined a new robust measure based on the Schmid score and the R coefficient of determination for least absolute deviation: the integrated R Schmid Score weighted. The R-package used in this article is available on the CRAN, http://cran.r-project.org/web/packages/plsRcox/index.html. The R package bigPLS will soon be available on the CRAN and, until then, is available on Github https://github.com/fbertran/bigPLS.

Download Full-text

Expectation-maximization algorithm for topic modeling on big data streams

2016 IEEE 7th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) ◽

10.1109/uemcon.2016.7777818 ◽

2016 ◽

Cited By ~ 2

Author(s):

Walisa Romsaiyud

Keyword(s):

Big Data ◽

Data Streams ◽

Expectation Maximization ◽

Topic Modeling ◽

Expectation Maximization Algorithm ◽

Big Data Streams

Download Full-text

bWGR: Bayesian whole-genome regression

Bioinformatics ◽

10.1093/bioinformatics/btz794 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alencar Xavier ◽

William M Muir ◽

Katy M Rainey

Keyword(s):

Bayesian Methods ◽

Expectation Maximization ◽

Complex Traits ◽

Hierarchical Models ◽

R Package ◽

Supplementary Information ◽

Whole Genome ◽

Regression Methods ◽

Genome Wide ◽

User Friendly

AbstractMotivationWhole-genome regressions methods represent a key framework for genome-wide prediction, cross-validation studies and association analysis. The bWGR offers a compendium of Bayesian methods with various priors available, allowing users to predict complex traits with different genetic architectures.ResultsHere we introduce bWGR, an R package that enables users to efficient fit and cross-validate Bayesian and likelihood whole-genome regression methods. It implements a series of methods referred to as the Bayesian alphabet under the traditional Gibbs sampling and optimized expectation-maximization. The package also enables fitting efficient multivariate models and complex hierarchical models. The package is user-friendly and computational efficient.Availability and implementationbWGR is an R package available in the CRAN repository. It can be installed in R by typing: install.packages(‘bWGR’).Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

DETECTING PHYLODIVERSITY-DEPENDENT DIVERSIFICATION WITH A GENERAL PHYLOGENETIC INFERENCE FRAMEWORK

10.1101/2021.07.01.450729 ◽

2021 ◽

Author(s):

Francisco Richter ◽

Ernst C. Wit ◽

Rampal S. Etienne ◽

Thijs Janzen ◽

Hanno Hildenbrandt

Keyword(s):

Species Richness ◽

Expectation Maximization ◽

Phylogenetic Diversity ◽

R Package ◽

Inference Method ◽

Likelihood Methods ◽

Species Diversification ◽

As Species ◽

Monte Carlo Expectation Maximization ◽

Phylogenetic Divergence

Diversity-dependent diversification models have been extensively used to study the effect of ecological limits and feedback of community structure on species diversification processes, such as speciation and extinction. Current diversity-dependent diversification models characterise ecological limits by carrying capacities for species richness. Such ecological limits have been justified by niche filling arguments: as species diversity increases, the number of available niches for diversification decreases. However, as species diversify they may diverge from one another phenotypically, which may open new niches for new species. Alternatively, this phenotypic divergence may not affect the species diversification process or even inhibit further diversification. Hence, it seems natural to explore the consequences of phylogenetic diversity-dependent (or phylodiversity-dependent) diversification. Current likelihood methods for estimating diversity-dependent diversification parameters cannot be used for this, as phylodiversity is continuously changing as time progresses and species form and become extinct. Here, we present a new method based on Monte Carlo Expectation-Maximization (MCEM), designed to perform statistical inference on a general class of species diversification models and implemented in the R package emphasis. We use the method to fit phylodiversity-dependent diversification models to 14 phylogenies, and compare the results to the fit of a richness-dependent diversification model. We find that in a number of phylogenies, phylogenetic divergence indeed spurs speciation even though species richness reduces it. Not only do we thus shine a new light on diversity-dependent diversification, we also argue that our inference framework can handle a large class of diversification models for which currently no inference method exists.

Download Full-text

Fast Bayesian Estimation for the Four-Parameter Logistic Model (4PLM)

SAGE Open ◽

10.1177/21582440211052556 ◽

2021 ◽

Vol 11 (4) ◽

pp. 215824402110525

Author(s):

Chanjin Zheng ◽

Shaoyang Guo ◽

Justin L Kern

Keyword(s):

Item Response ◽

Expectation Maximization ◽

Logistic Model ◽

Sampling Method ◽

Real Data ◽

R Package ◽

Response Model ◽

Item Response Model ◽

The Em Algorithm ◽

Bayes Modal

There is a rekindled interest in the four-parameter logistic item response model (4PLM) after three decades of neglect among the psychometrics community. Recent breakthroughs in item calibration include the Gibbs sampler specially made for 4PLM and the Bayes modal estimation (BME) method as implemented in the R package mirt. Unfortunately, the MCMC is often time-consuming, while the BME method suffers from instability due to the prior settings. This paper proposes an alternative BME method, the Bayesian Expectation-Maximization-Maximization-Maximization (BE3M) method, which is developed from by combining an augmented variable formulation of the 4PLM and a mixture model conceptualization of the 3PLM. The simulation shows that the BE3M can produce estimates as accurately as the Gibbs sampling method and as fast as the EM algorithm. A real data example is also provided.

Download Full-text

An Enhanced Unsupervised Fuzzy Expectation Maximization Clustering for Deduplication of Records in Big data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c1269.1083s219 ◽

2019 ◽

Vol 8 (3S2) ◽

pp. 988-993

Keyword(s):

Big Data ◽

Data Warehouse ◽

Expectation Maximization ◽

Cloud Storage ◽

Detection Rate ◽

Storage Capacity ◽

Main Issue ◽

Computation Complexity ◽

Proposed Model ◽

Simulation Results

The main issue while handling records in data warehouse or cloud storage is the presence of duplicate records which may unnecessarily test the storage capacity and computation complexity. This is an issue while integrating various databases. This paper focuses on discovering records, entirely and partly replicated, before storing them in cloud storage. This work converts whole content of data to numeric values for applying deduplication using radix method. Fuzzy Expectation Maximization (FEM) is used to cluster the numerals, so that the time taken for comparison between records is reduced. To discover and eliminate the duplicate records, this paper used divided-and-conquer-algorithm to match records among intra-clusters, which further enhances the performance of the model. The simulation results have proved that the performance of the proposed model achieves higher detection rate of duplicate records.

Download Full-text