scholarly journals Machine learning friendly set version of Johnson–Lindenstrauss lemma

2019 ◽  
Vol 62 (5) ◽  
pp. 1961-2009
Author(s):  
Mieczysław A. Kłopotek

Abstract The widely discussed and applied Johnson–Lindenstrauss (JL) Lemma has an existential form saying that for each set of data points Q in n-dimensional space, there exists a transformation f into an $$n'$$n′-dimensional space ($$n'<n$$n′<n) such that for each pair $$\mathbf {u},\mathbf {v} \in Q$$u,v∈Q$$(1-\delta )\Vert \mathbf {u}-\mathbf {v}\Vert ^2 \le \Vert f(\mathbf {u})-f(\mathbf {v})\Vert ^2 \le (1+\delta )\Vert \mathbf {u}-\mathbf {v}\Vert ^2 $$(1-δ)‖u-v‖2≤‖f(u)-f(v)‖2≤(1+δ)‖u-v‖2 for a user-defined error parameter $$\delta $$δ. Furthermore, it is asserted that with some finite probability the transformation f may be found as a random projection (with scaling) onto the $$n'$$n′ dimensional subspace so that after sufficiently many repetitions of random projection, f will be found with user-defined success rate $$1-\epsilon $$1-ϵ. In this paper, we make a novel use of the JL Lemma. We prove a theorem stating that we can choose the target dimensionality in a random projection-type JL linear transformation in such a way that with probability $$1-\epsilon $$1-ϵ all of data points from Q fall into predefined error range $$\delta $$δ for any user-predefined failure probability $$\epsilon $$ϵ when performing a single random projection. This result is important for applications such as data clustering where we want to have a priori dimensionality reducing transformation instead of attempting a (large) number of them, as with traditional Johnson–Lindenstrauss Lemma. Furthermore, we investigate an important issue whether or not the projection according to JL Lemma is really useful when conducting data processing, that is whether the solutions to the clustering in the projected space apply to the original space. In particular, we take a closer look at the k-means algorithm and prove that a good solution in the projected space is also a good solution in the original space. Furthermore, under proper assumptions local optima in the original space are also ones in the projected space. We investigate also a broader issue of preserving clusterability under JL Lemma projection. We define the conditions for which clusterability property of the original space is transmitted to the projected space, so that a broad class of clustering algorithms for the original space is applicable in the projected space.

Author(s):  
Ping Deng ◽  
Qingkai Ma ◽  
Weili Wu

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.


Author(s):  
Samuel Melton ◽  
Sharad Ramanathan

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2019 ◽  
Vol 20 (S19) ◽  
Author(s):  
Thomas A. Geddes ◽  
Taiyun Kim ◽  
Lihao Nan ◽  
James G. Burchfield ◽  
Jean Y. H. Yang ◽  
...  

Abstract Background Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification. Results Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used. Conclusions Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS


2021 ◽  
Vol 11 (15) ◽  
pp. 6963
Author(s):  
Jan Y. K. Chan ◽  
Alex Po Leung ◽  
Yunbo Xie

Using random projection, a method to speed up both kernel k-means and centroid initialization with k-means++ is proposed. We approximate the kernel matrix and distances in a lower-dimensional space Rd before the kernel k-means clustering motivated by upper error bounds. With random projections, previous work on bounds for dot products and an improved bound for kernel methods are considered for kernel k-means. The complexities for both kernel k-means with Lloyd’s algorithm and centroid initialization with k-means++ are known to be O(nkD) and Θ(nkD), respectively, with n being the number of data points, the dimensionality of input feature vectors D and the number of clusters k. The proposed method reduces the computational complexity for the kernel computation of kernel k-means from O(n2D) to O(n2d) and the subsequent computation for k-means with Lloyd’s algorithm and centroid initialization from O(nkD) to O(nkd). Our experiments demonstrate that the speed-up of the clustering method with reduced dimensionality d=200 is 2 to 26 times with very little performance degradation (less than one percent) in general.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Baicheng Lyu ◽  
Wenhua Wu ◽  
Zhiqiang Hu

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.


2021 ◽  
Vol 12 (4) ◽  
pp. 169-185
Author(s):  
Saida Ishak Boushaki ◽  
Omar Bendjeghaba ◽  
Nadjet Kamel

Clustering is an important unsupervised analysis technique for big data mining. It finds its application in several domains including biomedical documents of the MEDLINE database. Document clustering algorithms based on metaheuristics is an active research area. However, these algorithms suffer from the problems of getting trapped in local optima, need many parameters to adjust, and the documents should be indexed by a high dimensionality matrix using the traditional vector space model. In order to overcome these limitations, in this paper a new documents clustering algorithm (ASOS-LSI) with no parameters is proposed. It is based on the recent symbiotic organisms search metaheuristic (SOS) and enhanced by an acceleration technique. Furthermore, the documents are represented by semantic indexing based on the famous latent semantic indexing (LSI). Conducted experiments on well-known biomedical documents datasets show the significant superiority of ASOS-LSI over five famous algorithms in terms of compactness, f-measure, purity, misclassified documents, entropy, and runtime.


2021 ◽  
Author(s):  
Faruk Alpak ◽  
Yixuan Wang ◽  
Guohua Gao ◽  
Vivek Jain

Abstract Recently, a novel distributed quasi-Newton (DQN) derivative-free optimization (DFO) method was developed for generic reservoir performance optimization problems including well-location optimization (WLO) and well-control optimization (WCO). DQN is designed to effectively locate multiple local optima of highly nonlinear optimization problems. However, its performance has neither been validated by realistic applications nor compared to other DFO methods. We have integrated DQN into a versatile field-development optimization platform designed specifically for iterative workflows enabled through distributed-parallel flow simulations. DQN is benchmarked against alternative DFO techniques, namely, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method hybridized with Direct Pattern Search (BFGS-DPS), Mesh Adaptive Direct Search (MADS), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). DQN is a multi-thread optimization method that distributes an ensemble of optimization tasks among multiple high-performance-computing nodes. Thus, it can locate multiple optima of the objective function in parallel within a single run. Simulation results computed from one DQN optimization thread are shared with others by updating a unified set of training data points composed of responses (implicit variables) of all successful simulation jobs. The sensitivity matrix at the current best solution of each optimization thread is approximated by a linear-interpolation technique using all or a subset of training-data points. The gradient of the objective function is analytically computed using the estimated sensitivities of implicit variables with respect to explicit variables. The Hessian matrix is then updated using the quasi-Newton method. A new search point for each thread is solved from a trust-region subproblem for the next iteration. In contrast, other DFO methods rely on a single-thread optimization paradigm that can only locate a single optimum. To locate multiple optima, one must repeat the same optimization process multiple times starting from different initial guesses for such methods. Moreover, simulation results generated from a single-thread optimization task cannot be shared with other tasks. Benchmarking results are presented for synthetic yet challenging WLO and WCO problems. Finally, DQN method is field-tested on two realistic applications. DQN identifies the global optimum with the least number of simulations and the shortest run time on a synthetic problem with known solution. On other benchmarking problems without a known solution, DQN identified compatible local optima with reasonably smaller numbers of simulations compared to alternative techniques. Field-testing results reinforce the auspicious computational attributes of DQN. Overall, the results indicate that DQN is a novel and effective parallel algorithm for field-scale development optimization problems.


2018 ◽  
Vol 7 (3) ◽  
pp. 581-604 ◽  
Author(s):  
Armin Eftekhari ◽  
Michael B Wakin ◽  
Rachel A Ward

Abstract Leverage scores, loosely speaking, reflect the importance of the rows and columns of a matrix. Ideally, given the leverage scores of a rank-r matrix $M\in \mathbb{R}^{n\times n}$, that matrix can be reliably completed from just $O (rn\log ^{2}n )$ samples if the samples are chosen randomly from a non-uniform distribution induced by the leverage scores. In practice, however, the leverage scores are often unknown a priori. As such, the sample complexity in uniform matrix completion—using uniform random sampling—increases to $O(\eta (M)\cdot rn\log ^{2}n)$, where η(M) is the largest leverage score of M. In this paper, we propose a two-phase algorithm called MC2 for matrix completion: in the first phase, the leverage scores are estimated based on uniform random samples, and then in the second phase the matrix is resampled non-uniformly based on the estimated leverage scores and then completed. For well-conditioned matrices, the total sample complexity of MC2 is no worse than uniform matrix completion, and for certain classes of well-conditioned matrices—namely, reasonably coherent matrices whose leverage scores exhibit mild decay—MC2 requires substantially fewer samples. Numerical simulations suggest that the algorithm outperforms uniform matrix completion in a broad class of matrices and, in particular, is much less sensitive to the condition number than our theory currently requires.


1997 ◽  
Vol 07 (08) ◽  
pp. 1791-1809 ◽  
Author(s):  
Fawad Rauf ◽  
Hassan M. Ahmed

We present a new approach to nonlinear adaptive filtering based on Successive Linearization. Our approach provides a simple, modular and unified implementation for a broad class of polynomial filters. We refer to this implementation as the layered structure and note that it offers substantial computational efficiency over previous methods. A new class of Polynomial Autoregressive filters is introduced which can model limit cycle and chaotic dynamics. Existing geometric methods for modeling and characterizing chaotic processes suffer from several drawbacks. They require a huge number of data points to reconstruct the attractor geometry and performance is severely limited by noisy experimental measurements. We present a new method for processing chaotic signals using nonlinear adaptive filters. We demonstrate the modeling, prediction and filtering of these signals. We also show how the prediction error growth rate can be used to estimate the effective Lyapunov exponent of the chaotic map. Our approach requires orders of magnitude fewer data points and is robust to noise in the experimental data. Although reconstruction of the attractor geometry is unnecessary, the adaptive filter contains most of the geometric information.


Sign in / Sign up

Export Citation Format

Share Document