Machine learning friendly set version of Johnson–Lindenstrauss lemma

Abstract The widely discussed and applied Johnson–Lindenstrauss (JL) Lemma has an existential form saying that for each set of data points Q in n-dimensional space, there exists a transformation f into an $$n'$$n′-dimensional space ($$n'<n$$n′<n) such that for each pair $$\mathbf {u},\mathbf {v} \in Q$$u,v∈Q$$(1-\delta )\Vert \mathbf {u}-\mathbf {v}\Vert ^2 \le \Vert f(\mathbf {u})-f(\mathbf {v})\Vert ^2 \le (1+\delta )\Vert \mathbf {u}-\mathbf {v}\Vert ^2 $$(1-δ)‖u-v‖2≤‖f(u)-f(v)‖2≤(1+δ)‖u-v‖2 for a user-defined error parameter $$\delta $$δ. Furthermore, it is asserted that with some finite probability the transformation f may be found as a random projection (with scaling) onto the $$n'$$n′ dimensional subspace so that after sufficiently many repetitions of random projection, f will be found with user-defined success rate $$1-\epsilon $$1-ϵ. In this paper, we make a novel use of the JL Lemma. We prove a theorem stating that we can choose the target dimensionality in a random projection-type JL linear transformation in such a way that with probability $$1-\epsilon $$1-ϵ all of data points from Q fall into predefined error range $$\delta $$δ for any user-predefined failure probability $$\epsilon $$ϵ when performing a single random projection. This result is important for applications such as data clustering where we want to have a priori dimensionality reducing transformation instead of attempting a (large) number of them, as with traditional Johnson–Lindenstrauss Lemma. Furthermore, we investigate an important issue whether or not the projection according to JL Lemma is really useful when conducting data processing, that is whether the solutions to the clustering in the projected space apply to the original space. In particular, we take a closer look at the k-means algorithm and prove that a good solution in the projected space is also a good solution in the original space. Furthermore, under proper assumptions local optima in the original space are also ones in the projected space. We investigate also a broader issue of preserving clusterability under JL Lemma projection. We define the conditions for which clusterability property of the original space is transmitted to the projected space, so that a broad class of clustering algorithms for the original space is applicable in the projected space.

Download Full-text

Projected Clustering for Biological Data Analysis

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch247 ◽

2011 ◽

pp. 1617-1622

Author(s):

Ping Deng ◽

Qingkai Ma ◽

Weili Wu

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Clustering Algorithms ◽

Biological Data ◽

High Dimensional ◽

Projected Clustering ◽

Cluster Data ◽

Biological Data Analysis ◽

Data Points ◽

Entire Dataset

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.

Download Full-text

Discovering a sparse set of pairwise discriminating features in high-dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btaa690 ◽

2020 ◽

Author(s):

Samuel Melton ◽

Sharad Ramanathan

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Dimensional Subspace ◽

Supplementary Information ◽

High Dimensional ◽

Technological Advances ◽

Data Points ◽

Low Dimensional ◽

Sparse Set

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis

BMC Bioinformatics ◽

10.1186/s12859-019-3179-5 ◽

2019 ◽

Vol 20 (S19) ◽

Cited By ~ 5

Author(s):

Thomas A. Geddes ◽

Taiyun Kim ◽

Lihao Nan ◽

James G. Burchfield ◽

Jean Y. H. Yang ◽

...

Keyword(s):

Data Analysis ◽

Single Cell ◽

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

Random Projection ◽

Computational Technique ◽

Cell Type ◽

Cluster Ensemble ◽

Cell Type Specific

Abstract Background Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification. Results Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used. Conclusions Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS

Download Full-text

Efficient High-Dimensional Kernel k-Means with Random Projection++

Applied Sciences ◽

10.3390/app11156963 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6963

Author(s):

Jan Y. K. Chan ◽

Alex Po Leung ◽

Yunbo Xie

Keyword(s):

Kernel Methods ◽

Dimensional Space ◽

Random Projection ◽

High Dimensional ◽

Random Projections ◽

Input Feature ◽

Data Points ◽

Speed Up ◽

Lower Dimensional Space ◽

Lower Dimensional

Using random projection, a method to speed up both kernel k-means and centroid initialization with k-means++ is proposed. We approximate the kernel matrix and distances in a lower-dimensional space Rd before the kernel k-means clustering motivated by upper error bounds. With random projections, previous work on bounds for dot products and an improved bound for kernel methods are considered for kernel k-means. The complexities for both kernel k-means with Lloyd’s algorithm and centroid initialization with k-means++ are known to be O(nkD) and Θ(nkD), respectively, with n being the number of data points, the dimensionality of input feature vectors D and the number of clusters k. The proposed method reduces the computational complexity for the kernel computation of kernel k-means from O(n2D) to O(n2d) and the subsequent computation for k-means with Lloyd’s algorithm and centroid initialization from O(nkD) to O(nkd). Our experiments demonstrate that the speed-up of the clustering method with reduced dimensionality d=200 is 2 to 26 times with very little performance degradation (less than one percent) in general.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

Biomedical Document Clustering Based on Accelerated Symbiotic Organisms Search Algorithm

International Journal of Swarm Intelligence Research ◽

10.4018/ijsir.2021100109 ◽

2021 ◽

Vol 12 (4) ◽

pp. 169-185

Author(s):

Saida Ishak Boushaki ◽

Omar Bendjeghaba ◽

Nadjet Kamel

Keyword(s):

Clustering Algorithm ◽

Search Algorithm ◽

Clustering Algorithms ◽

Document Clustering ◽

Latent Semantic Indexing ◽

Research Area ◽

Semantic Indexing ◽

Local Optima ◽

Symbiotic Organisms Search ◽

Symbiotic Organisms

Clustering is an important unsupervised analysis technique for big data mining. It finds its application in several domains including biomedical documents of the MEDLINE database. Document clustering algorithms based on metaheuristics is an active research area. However, these algorithms suffer from the problems of getting trapped in local optima, need many parameters to adjust, and the documents should be indexed by a high dimensionality matrix using the traditional vector space model. In order to overcome these limitations, in this paper a new documents clustering algorithm (ASOS-LSI) with no parameters is proposed. It is based on the recent symbiotic organisms search metaheuristic (SOS) and enhanced by an acceleration technique. Furthermore, the documents are represented by semantic indexing based on the famous latent semantic indexing (LSI). Conducted experiments on well-known biomedical documents datasets show the significant superiority of ASOS-LSI over five famous algorithms in terms of compactness, f-measure, purity, misclassified documents, entropy, and runtime.

Download Full-text

Benchmarking and Field-Testing of the Distributed Quasi-Newton Derivative-Free Optimization Method for Field Development Optimization

10.2118/206267-ms ◽

2021 ◽

Author(s):

Faruk Alpak ◽

Yixuan Wang ◽

Guohua Gao ◽

Vivek Jain

Keyword(s):

Optimization Problems ◽

Field Testing ◽

Optimization Method ◽

Training Data ◽

Local Optima ◽

Field Development ◽

Derivative Free Optimization ◽

Derivative Free ◽

Data Points ◽

Quasi Newton

Abstract Recently, a novel distributed quasi-Newton (DQN) derivative-free optimization (DFO) method was developed for generic reservoir performance optimization problems including well-location optimization (WLO) and well-control optimization (WCO). DQN is designed to effectively locate multiple local optima of highly nonlinear optimization problems. However, its performance has neither been validated by realistic applications nor compared to other DFO methods. We have integrated DQN into a versatile field-development optimization platform designed specifically for iterative workflows enabled through distributed-parallel flow simulations. DQN is benchmarked against alternative DFO techniques, namely, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method hybridized with Direct Pattern Search (BFGS-DPS), Mesh Adaptive Direct Search (MADS), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). DQN is a multi-thread optimization method that distributes an ensemble of optimization tasks among multiple high-performance-computing nodes. Thus, it can locate multiple optima of the objective function in parallel within a single run. Simulation results computed from one DQN optimization thread are shared with others by updating a unified set of training data points composed of responses (implicit variables) of all successful simulation jobs. The sensitivity matrix at the current best solution of each optimization thread is approximated by a linear-interpolation technique using all or a subset of training-data points. The gradient of the objective function is analytically computed using the estimated sensitivities of implicit variables with respect to explicit variables. The Hessian matrix is then updated using the quasi-Newton method. A new search point for each thread is solved from a trust-region subproblem for the next iteration. In contrast, other DFO methods rely on a single-thread optimization paradigm that can only locate a single optimum. To locate multiple optima, one must repeat the same optimization process multiple times starting from different initial guesses for such methods. Moreover, simulation results generated from a single-thread optimization task cannot be shared with other tasks. Benchmarking results are presented for synthetic yet challenging WLO and WCO problems. Finally, DQN method is field-tested on two realistic applications. DQN identifies the global optimum with the least number of simulations and the shortest run time on a synthetic problem with known solution. On other benchmarking problems without a known solution, DQN identified compatible local optima with reasonably smaller numbers of simulations compared to alternative techniques. Field-testing results reinforce the auspicious computational attributes of DQN. Overall, the results indicate that DQN is a novel and effective parallel algorithm for field-scale development optimization problems.

Download Full-text

MC2: a two-phase algorithm for leveraged matrix completion

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iax020 ◽

2018 ◽

Vol 7 (3) ◽

pp. 581-604 ◽

Cited By ~ 1

Author(s):

Armin Eftekhari ◽

Michael B Wakin ◽

Rachel A Ward

Keyword(s):

Broad Class ◽

Matrix Completion ◽

A Priori ◽

Total Sample ◽

Second Phase ◽

Sample Complexity ◽

Two Phase ◽

The Matrix ◽

Leverage Scores ◽

Uniform Random Sampling

Abstract Leverage scores, loosely speaking, reflect the importance of the rows and columns of a matrix. Ideally, given the leverage scores of a rank-r matrix $M\in \mathbb{R}^{n\times n}$, that matrix can be reliably completed from just $O (rn\log ^{2}n )$ samples if the samples are chosen randomly from a non-uniform distribution induced by the leverage scores. In practice, however, the leverage scores are often unknown a priori. As such, the sample complexity in uniform matrix completion—using uniform random sampling—increases to $O(\eta (M)\cdot rn\log ^{2}n)$, where η(M) is the largest leverage score of M. In this paper, we propose a two-phase algorithm called MC2 for matrix completion: in the first phase, the leverage scores are estimated based on uniform random samples, and then in the second phase the matrix is resampled non-uniformly based on the estimated leverage scores and then completed. For well-conditioned matrices, the total sample complexity of MC2 is no worse than uniform matrix completion, and for certain classes of well-conditioned matrices—namely, reasonably coherent matrices whose leverage scores exhibit mild decay—MC2 requires substantially fewer samples. Numerical simulations suggest that the algorithm outperforms uniform matrix completion in a broad class of matrices and, in particular, is much less sensitive to the condition number than our theory currently requires.

Download Full-text

New Nonlinear Adaptive Filters with Applications to Chaos

International Journal of Bifurcation and Chaos ◽

10.1142/s0218127497001370 ◽

1997 ◽

Vol 07 (08) ◽

pp. 1791-1809 ◽

Cited By ~ 4

Author(s):

Fawad Rauf ◽

Hassan M. Ahmed

Keyword(s):

Adaptive Filters ◽

Broad Class ◽

Chaotic Map ◽

New Class ◽

Chaotic Signals ◽

Data Points ◽

Modeling Prediction ◽

Polynomial Filters ◽

And Performance ◽

Nonlinear Adaptive Filtering

We present a new approach to nonlinear adaptive filtering based on Successive Linearization. Our approach provides a simple, modular and unified implementation for a broad class of polynomial filters. We refer to this implementation as the layered structure and note that it offers substantial computational efficiency over previous methods. A new class of Polynomial Autoregressive filters is introduced which can model limit cycle and chaotic dynamics. Existing geometric methods for modeling and characterizing chaotic processes suffer from several drawbacks. They require a huge number of data points to reconstruct the attractor geometry and performance is severely limited by noisy experimental measurements. We present a new method for processing chaotic signals using nonlinear adaptive filters. We demonstrate the modeling, prediction and filtering of these signals. We also show how the prediction error growth rate can be used to estimate the effective Lyapunov exponent of the chaotic map. Our approach requires orders of magnitude fewer data points and is robust to noise in the experimental data. Although reconstruction of the attractor geometry is unnecessary, the adaptive filter contains most of the geometric information.

Download Full-text