The Information Bottleneck and Geometric Clustering

DJ Strouse; David J. Schwab

doi:10.1162/neco_a_01136

The Information Bottleneck and Geometric Clustering

Neural Computation ◽

10.1162/neco_a_01136 ◽

2019 ◽

Vol 31 (3) ◽

pp. 596-612 ◽

Cited By ~ 6

Author(s):

DJ Strouse ◽

David J. Schwab

Keyword(s):

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Selection Procedure ◽

Gaussian Mixture ◽

Model Parameters ◽

Data Set ◽

Information Bottleneck ◽

Geometric Clustering ◽

Data Points ◽

Intuitive Method

The information bottleneck (IB) approach to clustering takes a joint distribution [Formula: see text] and maps the data [Formula: see text] to cluster labels [Formula: see text], which retain maximal information about [Formula: see text] (Tishby, Pereira, & Bialek, 1999 ). This objective results in an algorithm that clusters data points based on the similarity of their conditional distributions [Formula: see text]. This is in contrast to classic geometric clustering algorithms such as [Formula: see text]-means and gaussian mixture models (GMMs), which take a set of observed data points [Formula: see text] and cluster them based on their geometric (typically Euclidean) distance from one another. Here, we show how to use the deterministic information bottleneck (DIB) (Strouse & Schwab, 2017 ), a variant of IB, to perform geometric clustering by choosing cluster labels that preserve information about data point location on a smoothed data set. We also introduce a novel intuitive method to choose the number of clusters via kinks in the information curve. We apply this approach to a variety of simple clustering problems, showing that DIB with our model selection procedure recovers the generative cluster labels. We also show that, in particular limits of our model parameters, clustering with DIB and IB is equivalent to [Formula: see text]-means and EM fitting of a GMM with hard and soft assignments, respectively. Thus, clustering with (D)IB generalizes and provides an information-theoretic perspective on these classic algorithms.

Download Full-text

Penalized Probabilistic Clustering

Neural Computation ◽

10.1162/neco.2007.19.6.1528 ◽

2007 ◽

Vol 19 (6) ◽

pp. 1528-1567 ◽

Cited By ~ 31

Author(s):

Zhengdong Lu ◽

Todd K. Leen

Keyword(s):

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Model Parameters ◽

Clustering Methods ◽

Probabilistic Clustering ◽

Starting Point ◽

Special Cases ◽

Out Of Sample ◽

Data Points ◽

Semisupervised Clustering

While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on gaussian mixture models (GMM) of the data distribution. We express clustering preferences in a prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. The model parameters are fit with the expectation-maximization (EM) algorithm. Our model provides a flexible framework that encompasses several other semisupervised clustering models as its special cases. Experiments on artificial and real-world problems show that our model can consistently improve clustering results when pairwise relations are incorporated. The experiments also demonstrate the superiority of our model to other semisupervised clustering methods on handling noisy pairwise relations.

Download Full-text

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

Download Full-text

GMM with parameters initialization based on SVD for network threat detection

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200066 ◽

2021 ◽

Vol 40 (1) ◽

pp. 477-490

Author(s):

Yanping Xu ◽

Tingcong Ye ◽

Xin Wang ◽

Yuping Lai ◽

Jian Qiu ◽

...

Keyword(s):

Gaussian Mixture Models ◽

Singular Values ◽

Gaussian Mixture ◽

Singular Value ◽

Optimal Parameters ◽

Clustering Methods ◽

Data Set ◽

Detection Model ◽

Clustering Model ◽

Threat Behavior

In the field of security, the data labels are unknown or the labels are too expensive to label, so that clustering methods are used to detect the threat behavior contained in the big data. The most widely used probabilistic clustering model is Gaussian Mixture Models(GMM), which is flexible and powerful to apply prior knowledge for modelling the uncertainty of the data. Therefore, in this paper, we use GMM to build the threat behavior detection model. Commonly, Expectation Maximization (EM) and Variational Inference (VI) are used to estimate the optimal parameters of GMM. However, both EM and VI are quite sensitive to the initial values of the parameters. Therefore, we propose to use Singular Value Decomposition (SVD) to initialize the parameters. Firstly, SVD is used to factorize the data set matrix to get the singular value matrix and singular matrices. Then we calculate the number of the components of GMM by the first two singular values in the singular value matrix and the dimension of the data. Next, other parameters of GMM, such as the mixing coefficients, the mean and the covariance, are calculated based on the number of the components. After that, the initialization values of the parameters are input into EM and VI to estimate the optimal parameters of GMM. The experiment results indicate that our proposed method performs well on the parameters initialization of GMM clustering using EM and VI for estimating parameters.

Download Full-text

Video Image Segmentation Using Gaussian Mixture Models Based on the Differential Evolution-Based Parameter Estimation

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.474-476.442 ◽

2011 ◽

Vol 474-476 ◽

pp. 442-447

Author(s):

Zhi Gao Zeng ◽

Li Xin Ding ◽

Sheng Qiu Yi ◽

San You Zeng ◽

Zi Hua Qiu

Keyword(s):

Image Segmentation ◽

Differential Evolution ◽

Mixture Models ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Expectation Maximization Algorithm ◽

Image Data ◽

Gaussian Mixture ◽

Parameters Estimation ◽

Video Image

In order to improve the accuracy of the image segmentation in video surveillance sequences and to overcome the limits of the traditional clustering algorithms that can not accurately model the image data sets which Contains noise data, the paper presents an automatic and accurate video image segmentation algorithm, according to the spatial properties, which uses the Gaussian mixture models to segment the image. But the expectation-maximization algorithm is very sensitive to initial values, and easy to fall into local optimums, so the paper presents a differential evolution-based parameters estimation for Gaussian mixture models. The experiment result shows that the segmentation accuracy has been improved greatly than by the traditional segmentation algorithms.

Download Full-text

Modeling with Gaussian mixture regression for lactationmilk yield in Anatolian buffaloes

Indian Journal of Animal Research ◽

10.18805/ijar.v0iof.4545 ◽

2016 ◽

Author(s):

Abdullah Yesilova ◽

Ayhan Yilmaz ◽

Gazel Ser ◽

Baris Kaki

Keyword(s):

Milk Yield ◽

Environmental Effects ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Lactation Period ◽

Multivariate Statistical ◽

Yield Data ◽

Data Set ◽

Mixture Regression ◽

Lactation Duration

The purpose of this study was to classify Anatolian buffalo using Gaussian mixture regression model according to discrete and continuous environmental effects. Gaussian mixture model performs separately regression analysis both within and between groups. This is an important property of Gaussian mixture models which makes it different from other multivariate statistical methods. The data were obtained from 1455 Anatolian buffalo lactation milk yield records reared in seven different locations in Bitlis province, Turkey. Age of dam, lactation duration and locations were considered as environmental effects on lactation milk yield. Data set was divided into three homogenous subgroups with respect to AIC and BIC in the Gaussian mixture regression, based on environmental effects on lactation milk yield. Estimated mean for lactation milk yields and mixing probabilities for the first, second and third subgroups were determined as 1494.33 kg (16.9%), 540.33 kg (45.2%) and 847.61 (37.9%), respectively. The numbers of buffalo in each subgroup according to mixing probability were obtained as 159, 756, and 540 for the first, second, and third groups, respectively. The effects of lactation period, age of dam and villages were found statistically significant on lactation milk yield in subgroup 1 that was highest mean for lactation milk yield (p less than 0.01). In conclusion, results showed that Gaussian mixture regression was an important tool for classifying quantitative traits considering environmental effects in animal breeding.

Download Full-text

SPSM: A NEW HYBRID DATA CLUSTERING ALGORITHM FOR NONLINEAR DATA ANALYSIS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007685 ◽

2009 ◽

Vol 23 (08) ◽

pp. 1701-1737 ◽

Cited By ~ 3

Author(s):

UREERAT WATTANACHON ◽

CHIDCHANOK LURSINSAP

Keyword(s):

Clustering Algorithm ◽

Color Image ◽

Clustering Algorithms ◽

Noisy Data ◽

Second Phase ◽

Data Sets ◽

Data Set ◽

Cluster Distance ◽

Data Points ◽

Hybrid Data

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on predefined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyper-spherical clusters. In this paper, a new hybrid clustering algorithm called Self-Partition and Self-Merging (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density. Several examples for color image show the versatility of the proposed method and compare with results described in the literature for the same images. The computational complexity of the SPSM algorithm is O(N2), where N is the number of data points.

Download Full-text

A Riemannian Newton trust-region method for fitting Gaussian mixture models

Statistics and Computing ◽

10.1007/s11222-021-10071-1 ◽

2021 ◽

Vol 32 (1) ◽

Author(s):

Lena Sembach ◽

Jan Pablo Burgard ◽

Volker Schulz

Keyword(s):

Mixture Models ◽

Gaussian Mixture Models ◽

Trust Region Method ◽

Trust Region ◽

Gaussian Mixture ◽

Model Parameters ◽

Density Approximation ◽

Hidden Information ◽

Large Share ◽

Region Method

AbstractGaussian Mixture Models are a powerful tool in Data Science and Statistics that are mainly used for clustering and density approximation. The task of estimating the model parameters is in practice often solved by the expectation maximization (EM) algorithm which has its benefits in its simplicity and low per-iteration costs. However, the EM converges slowly if there is a large share of hidden information or overlapping clusters. Recent advances in Manifold Optimization for Gaussian Mixture Models have gained increasing interest. We introduce an explicit formula for the Riemannian Hessian for Gaussian Mixture Models. On top, we propose a new Riemannian Newton Trust-Region method which outperforms current approaches both in terms of runtime and number of iterations. We apply our method on clustering problems and density approximation tasks. Our method is very powerful for data with a large share of hidden information compared to existing methods.

Download Full-text

Clustering algorithms subjected to K-mean and gaussian mixture model on multidimensional data set

Periodicals of Engineering and Natural Sciences (PEN) ◽

10.21533/pen.v7i2.484 ◽

2019 ◽

Vol 7 (2) ◽

pp. 448 ◽

Cited By ~ 1

Author(s):

Saadaldeen Rashid Ahmed Ahmed ◽

Israa Al Barazanchi ◽

Zahraa A. Jaaz ◽

Haider Rasheed Abdulshaheed

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Clustering Algorithms ◽

Gaussian Mixture ◽

Multidimensional Data ◽

Data Set ◽

Multidimensional Data Set

Download Full-text

Linear inversion of body‐wave data—Part III: Model parameterization

Geophysics ◽

10.1190/1.1441624 ◽

1984 ◽

Vol 49 (12) ◽

pp. 2088-2093 ◽

Cited By ~ 5

Author(s):

M. Bée ◽

R. S. Jacobson

Keyword(s):

Velocity Gradient ◽

Model Parameters ◽

Model Resolution ◽

Data Set ◽

Model Parameterization ◽

Number Of Layers ◽

Gradient Model ◽

Data Points ◽

Constrained Model ◽

Constrained Models

A velocity gradient model parameterized with the tau‐zeta inversion for seismic refraction data is examined with respect to a synthetic traveltime data set. The velocity‐depth model consists of a stack of laterally homogeneous layers, each with a constant velocity gradient. The free model parameters are the velocities of the layer bounds and the number of layers. The best velocity gradient solutions, i.e., with the least deviation from the true model, were obtained from “constrained” models in which the velocities of the layer bounds are the velocities of the observed refracted waves. An arbitrary selection of layer bound velocities was found to be a suboptimal choice of model parameterization for the tau‐zeta inversion. A trade‐off curve between model resolution and solution variance was constructed with the constrained model parameterization from examination of numerous solutions with a diverse number of layers. A constrained model with as many layers as observed data points represents a satisfactory compromise between model resolution and solution variance. Constrained models with more layers than observed data points, however, can increase the resolution of the velocity gradient model. If model resolution is favored over solution variance, a constrained model with many more layers than observed data points is therefore the best model parameterization with the tau‐zeta inversion technique.

Download Full-text

Independent Vector Analysis for Source Separation Using a Mixture of Gaussians Prior

Neural Computation ◽

10.1162/neco.2010.11-08-906 ◽

2010 ◽

Vol 22 (6) ◽

pp. 1646-1673 ◽

Cited By ~ 17

Author(s):

Jiucang Hao ◽

Intae Lee ◽

Te-Won Lee ◽

Terrence J. Sejnowski

Keyword(s):

Em Algorithm ◽

Music Performance ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Vector Analysis ◽

Model Parameters ◽

Sensor Noise ◽

Independent Vector ◽

Convolutive Mixtures ◽

Independent Vector Analysis

Convolutive mixtures of signals, which are common in acoustic environments, can be difficult to separate into their component sources. Here we present a uniform probabilistic framework to separate convolutive mixtures of acoustic signals using independent vector analysis (IVA), which is based on a joint distribution for the frequency components originating from the same source and is capable of preventing permutation disorder. Different gaussian mixture models (GMM) served as source priors, in contrast to the original IVA model, where all sources were modeled by identical multivariate Laplacian distributions. This flexible source prior enabled the IVA model to separate different type of signals. Three classes of models were derived and tested: noiseless IVA, online IVA, and noisy IVA. In the IVA model without sensor noise, the unmixing matrices were efficiently estimated by the expectation maximization (EM) algorithm. An online EM algorithm was derived for the online IVA algorithm to track the movement of the sources and separate them under nonstationary conditions. The noisy IVA model included the sensor noise and combined denoising with separation. An EM algorithm was developed that found the model parameters and separated the sources simultaneously. These algorithms were applied to separate mixtures of speech and music. Performance as measured by the signal-to-interference ratio (SIR) was substantial for all three models.

Download Full-text