Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data

Clustering high-dimensional data, such as images or biological measurements, is a long-standing problem and has been studied extensively. Recently, Deep Clustering has gained popularity due to its flexibility in fitting the specific peculiarities of complex data. Here we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE), a novel generative clustering model. The model can learn multi-modal distributions of high-dimensional data and use these to generate realistic data with high efficacy and efficiency. MoE-Sim-VAE is based on a Variational Autoencoder (VAE), where the decoder consists of a Mixture-of-Experts (MoE) architecture. This specific architecture allows for various modes of the data to be automatically learned by means of the experts. Additionally, we encourage the lower dimensional latent representation of our model to follow a Gaussian mixture distribution and to accurately represent the similarities between the data points. We assess the performance of our model on the MNIST benchmark data set and challenging real-world tasks of clustering mouse organs from single-cell RNA-sequencing measurements and defining cell subpopulations from mass cytometry (CyTOF) measurements on hundreds of different datasets. MoE-Sim-VAE exhibits superior clustering performance on all these tasks in comparison to the baselines as well as competitor methods.

Download Full-text

The Generalized Bayes Method for High-Dimensional Data Recognition with Applications to Audio Signal Recognition

Symmetry ◽

10.3390/sym13010019 ◽

2020 ◽

Vol 13 (1) ◽

pp. 19

Author(s):

Hsiuying Wang

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Conventional Method ◽

High Dimensional Data ◽

Audio Signal ◽

Gaussian Mixture ◽

High Dimensional ◽

Signal Recognition ◽

Bayes Method ◽

Generalized Bayes

High-dimensional data recognition problem based on the Gaussian Mixture model has useful applications in many area, such as audio signal recognition, image analysis, and biological evolution. The expectation-maximization algorithm is a popular approach to the derivation of the maximum likelihood estimators of the Gaussian mixture model (GMM). An alternative solution is to adopt a generalized Bayes estimator for parameter estimation. In this study, an estimator based on the generalized Bayes approach is established. A simulation study shows that the proposed approach has a performance competitive to that of the conventional method in high-dimensional Gaussian mixture model recognition. We use a musical data example to illustrate this recognition problem. Suppose that we have audio data of a piece of music and know that the music is from one of four compositions, but we do not know exactly which composition it comes from. The generalized Bayes method shows a higher average recognition rate than the conventional method. This result shows that the generalized Bayes method is a competitor to the conventional method in this real application.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

GMM with parameters initialization based on SVD for network threat detection

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200066 ◽

2021 ◽

Vol 40 (1) ◽

pp. 477-490

Author(s):

Yanping Xu ◽

Tingcong Ye ◽

Xin Wang ◽

Yuping Lai ◽

Jian Qiu ◽

...

Keyword(s):

Gaussian Mixture Models ◽

Singular Values ◽

Gaussian Mixture ◽

Singular Value ◽

Optimal Parameters ◽

Clustering Methods ◽

Data Set ◽

Detection Model ◽

Clustering Model ◽

Threat Behavior

In the field of security, the data labels are unknown or the labels are too expensive to label, so that clustering methods are used to detect the threat behavior contained in the big data. The most widely used probabilistic clustering model is Gaussian Mixture Models(GMM), which is flexible and powerful to apply prior knowledge for modelling the uncertainty of the data. Therefore, in this paper, we use GMM to build the threat behavior detection model. Commonly, Expectation Maximization (EM) and Variational Inference (VI) are used to estimate the optimal parameters of GMM. However, both EM and VI are quite sensitive to the initial values of the parameters. Therefore, we propose to use Singular Value Decomposition (SVD) to initialize the parameters. Firstly, SVD is used to factorize the data set matrix to get the singular value matrix and singular matrices. Then we calculate the number of the components of GMM by the first two singular values in the singular value matrix and the dimension of the data. Next, other parameters of GMM, such as the mixing coefficients, the mean and the covariance, are calculated based on the number of the components. After that, the initialization values of the parameters are input into EM and VI to estimate the optimal parameters of GMM. The experiment results indicate that our proposed method performs well on the parameters initialization of GMM clustering using EM and VI for estimating parameters.

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Outlier Detection Algorithm Basing on Similarity Measurement Relation

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.6-7.621 ◽

2012 ◽

Vol 6-7 ◽

pp. 621-624

Author(s):

Hong Bin Fang

Keyword(s):

Outlier Detection ◽

Credit Card ◽

High Dimensional Data ◽

Detection Algorithm ◽

Experimental Result ◽

Similarity Measurement ◽

High Dimensional ◽

Data Set ◽

Network Intrusion ◽

Metric Function

Outlier detection is an important field of data mining, which is widely used in credit card fraud detection, network intrusion detection ,etc. A kind of high dimensional data similarity metric function and the concept of class density are given in the paper, basing on the combination of hierarchical clustering and similarity, as well as outlier detection algorithm about similarity measurement is presented after the redefinition of high dimension density outliers is put. The algorithm has some value for outliers detection of high dimensional data set in view of experimental result.

Download Full-text

How to visualize high-dimensional data: a roadmap

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.5594 ◽

2020 ◽

Vol Special issue on... ◽

Author(s):

Hermann Moisl

Keyword(s):

Cluster Analysis ◽

High Dimensional Data ◽

Latent Structure ◽

High Dimensional ◽

Graphical Methods ◽

Data Set ◽

The Third ◽

Historical Text ◽

International Audience ◽

And Cluster Analysis

International audience Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis.

Download Full-text

Inter and Intra-Clonal Heterogeneity in Multiple Myeloma and Waldenstrom Macroglobulinemia

Blood ◽

10.1182/blood.v124.21.2070.2070 ◽

2014 ◽

Vol 124 (21) ◽

pp. 2070-2070

Author(s):

Jana Jakubikova ◽

Danka Cholujova ◽

Teru Hideshima ◽

Jacob P. Laubach ◽

Nikhil C. Munshi ◽

...

Keyword(s):

Multiple Myeloma ◽

B Cells ◽

B Cell ◽

Single Cell ◽

High Dimensional Data ◽

Signaling Molecules ◽

High Dimensional ◽

Disease Pathogenesis ◽

Clonal Heterogeneity ◽

Cytogenetic Abnormalities

Abstract Introduction: Intra-clonal heterogeneity in malignant plasma (PC) cells and B-cells has recently been reported in both multiple myeloma (MM) and Waldenstrom macroglobulinemia (WM). Further phenotypic and molecular characterization of inter- and intra-clonal genetic complexity will enhance our understanding of disease pathogenesis and identify novel therapeutic strategies. Methods: In this study, we compared normal and malignant PC maturation-associated B-cell subsets using bone marrow samples from individuals with monoclonal gammopathy of undetermined significance (MGUS), smoldering MM (SMM), newly diagnosed MM, and relapsed/refractory MM versus age-matched healthy donors (HD). We also similarly analyzed WM. In addition to corrupted B-cell lineage, we examined phenotypic and molecular features of intra-clonal architecture (complexity) of malignant PC in MM and clonal B-cells in WM on a single cell level using time-of-flight mass cytometry (CyTOF) technology. CyTOF technology is based on rare stable earth elemental isotopes-bound to antibodies to target epitopes on and within cells: up to 40 different markers on a single cell can simultaneously assess including phenotype, transcription factors, regulatory signaling molecules and enzymes, as well as activation of signaling molecules. The resulting high-dimensional data were analyzed by SPADE, viSNE and Wanderlust software. Results: Our high-dimensional data of clustered analyses showed significantly decreased CD19+CD27- patient cells in MM with cytogenetic abnormalities (cytog+) including del(13q), t(4;14), t(14;16), t(3;14), +1q or t(11;14) versus patient cells in MM without any cytogenetic abnormalities (cytog-; P=0.013). In contrast, there was a significant increase of transitional B cells (CD19+CD27-IgM+CD10-IgDlow) in patients with MM cytog+ vs. MM cytog- (P=0.028). A significant increase of mature (naïve) B cells (CD19+CD27-IgM+CD10-IgD+) was also detected in MM cytog+ versus MM cytog- patients (P=0.013), but not in WM cytog- vs. WM cytog+ (46XY, -Y, +18q, +6p, 14q). Clonal PC (CD19-CD38++CD138+CD45-/dim; either cyk or cyl +) were significantly upregulated in MM cytog+ compared to MM cytog- (P=0.021) by CyTOF analyses. To investigate phenotypic profiles and molecular signature of intra-clonal heterogeneity of PC in MM, high-dimensional analyses by SPADE and viSNE revealed that clonal PC clustered separately from B cells by, virtue of high CD319 and CD47 expression; variable expression of CD52, CD56, CD81, CD44, CD200; and low expression of CD28, CD117, CD338, CD325, and CD243. For example, adhesion CD56 and anti-adhesion CD52 molecules were significantly increased in MM cytog+ compared to MM cytog-. Clonal PC highly expressed IFR4 and Notch1; variously expressed FGFR3, sXBP-1, KLF4 and c-Myc; and only minimally expressed Bcl-6, WHSC-1 (MMSET) and RARa2. sXBP-1 was significantly upregulated in all MM stages compared to HD. Furthermore, expression of stem cell markers including Sox-2, Oct3/4 and Nestin was detected only at low level in clonal PC, except for higher expression of Nanog. In WM, clonal B cells expressed Bcl-6 (4-36%) and MYD88 (2-27%) by CyTOF analyses. Finally, cluster analyses by SPADE and viSNE allows for detection of phenotypic and molecular changes not only in clonal populations but also at distinct B-lineage maturation stages, such as expression of Pax-5 and Bcl-2 on early B cell progenitors. This data represents a cohort of MM (N=35) and WM (N=15) patients; a significantly larger data set of MM (N=100) and WM (N=50) will be presented. Conclusion: This study characterizes the molecular and phenotypic profile associated with inter- and intra-clonal heterogeneity in MM and WM. It not only enhances our understanding of disease pathogenesis, but may allow for individualized targeted therapy. Disclosures No relevant conflicts of interest to declare.

Download Full-text

A System for Outlier Detection of High Dimensional Data

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2012.1037 ◽

2012 ◽

pp. 197-201

Author(s):

Bharat Gupta ◽

Durga Toshniwal

Keyword(s):

Outlier Detection ◽

High Dimensional Data ◽

Research Problem ◽

High Dimensional ◽

Full Data ◽

Data Set ◽

Detection Techniques ◽

New Concepts ◽

Low Dimensional ◽

Important Research Problem

In high dimensional data large no of outliers are embedded in low dimensional subspaces known as projected outliers, but most of existing outlier detection techniques are unable to find these projected outliers, because these methods perform detection of abnormal patterns in full data space. So, outlier detection in high dimensional data becomes an important research problem. In this paper we are proposing an approach for outlier detection of high dimensional data. Here we are modifying the existing SPOT approach by adding three new concepts namely Adaption of Sparse Sub-Space Template (SST), Different combination of PCS parameters and set of non outlying cells for testing data set.

Download Full-text

Restraint Method Research for Coupling Random Error Based on High Dimensional Data Set Multiscale Analysis

Proceedings of the 2015 International Conference on Artificial Intelligence and Industrial Engineering ◽

10.2991/aiie-15.2015.61 ◽

2015 ◽

Author(s):

X.Y. Zhou ◽

J.Q. Wang ◽

Z.M. Wang ◽

Y.Y. Jiao ◽

X.G. Pan ◽

...

Keyword(s):

Random Error ◽

Multiscale Analysis ◽

High Dimensional Data ◽

High Dimensional ◽

Data Set

Download Full-text

A Novel Convex Clustering Method for High-Dimensional Data Using Semiproximal ADMM

Mathematical Problems in Engineering ◽

10.1155/2020/9216351 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Huangyue Chen ◽

Lingchen Kong ◽

Yan Li

Keyword(s):

High Dimensional Data ◽

Group Lasso ◽

High Dimensional ◽

Clustering Methods ◽

Finite Sample ◽

Clustering Method ◽

Sparse Group Lasso ◽

Clustering Model ◽

Sample Error ◽

Convex Clustering

Clustering is an important ingredient of unsupervised learning; classical clustering methods include K-means clustering and hierarchical clustering. These methods may suffer from instability because of their tendency prone to sink into the local optimal solutions of the nonconvex optimization model. In this paper, we propose a new convex clustering method for high-dimensional data based on the sparse group lasso penalty, which can simultaneously group observations and eliminate noninformative features. In this method, the number of clusters can be learned from the data instead of being given in advance as a parameter. We theoretically prove that the proposed method has desirable statistical properties, including a finite sample error bound and feature screening consistency. Furthermore, the semiproximal alternating direction method of multipliers is designed to solve the sparse group lasso convex clustering model, and its convergence analysis is established without any conditions. Finally, the effectiveness of the proposed method is thoroughly demonstrated through simulated experiments and real applications.

Download Full-text