scholarly journals Dimensionality reduction for k-distance applied to persistent homology

Author(s):  
Shreya Arya ◽  
Jean-Daniel Boissonnat ◽  
Kunal Dutta ◽  
Martin Lotz

AbstractGiven a set P of n points and a constant k, we are interested in computing the persistent homology of the Čech filtration of P for the k-distance, and investigate the effectiveness of dimensionality reduction for this problem, answering an open question of Sheehy (The persistent homology of distance functions under random projection. In Cheng, Devillers (eds), 30th Annual Symposium on Computational Geometry, SOCG’14, Kyoto, Japan, June 08–11, p 328, ACM, 2014). We show that any linear transformation that preserves pairwise distances up to a $$(1\pm {\varepsilon })$$ ( 1 ± ε ) multiplicative factor, must preserve the persistent homology of the Čech filtration up to a factor of $$(1-{\varepsilon })^{-1}$$ ( 1 - ε ) - 1 . Our results also show that the Vietoris-Rips and Delaunay filtrations for the k-distance, as well as the Čech filtration for the approximate k-distance of Buchet et al. [J Comput Geom, 58:70–96, 2016] are preserved up to a $$(1\pm {\varepsilon })$$ ( 1 ± ε ) factor. We also prove extensions of our main theorem, for point sets (i) lying in a region of bounded Gaussian width or (ii) on a low-dimensional submanifold, obtaining embeddings having the dimension bounds of Lotz (Proc R Soc A Math Phys Eng Sci, 475(2230):20190081, 2019) and Clarkson (Tighter bounds for random projections of manifolds. In Teillaud (ed) Proceedings of the 24th ACM Symposium on Computational Geom- etry, College Park, MD, USA, June 9–11, pp 39–48, ACM, 2008) respectively. Our results also work in the terminal dimensionality reduction setting, where the distance of any point in the original ambient space, to any point in P, needs to be approximately preserved.

Author(s):  
Ewa Skubalska-Rafajłowicz

Random Projection RBF Nets for Multidimensional Density EstimationThe dimensionality and the amount of data that need to be processed when intensive data streams are observed grow rapidly together with the development of sensors arrays, CCD and CMOS cameras and other devices. The aim of this paper is to propose an approach to dimensionality reduction as a first stage of training RBF nets. As a vehicle for presenting the ideas, the problem of estimating multivariate probability densities is chosen. The linear projection method is briefly surveyed. Using random projections as the first (additional) layer, we are able to reduce the dimensionality of input data. Bounds on the accuracy of RBF nets equipped with a random projection layer in comparison to RBF nets without dimensionality reduction are established. Finally, the results of simulations concerning multidimensional density estimation are briefly reported.


Author(s):  
Ata Kabán

Learning from high dimensional data is challenging in general – however, often the data is not truly high dimensional in the sense that it may have some hidden low complexity geometry. We give new, user-friendly PAC-bounds that are able to take advantage of such benign geometry to reduce dimensional-dependence of error-guarantees in settings where such dependence is known to be essential in general. This is achieved by employing random projection as an analytic tool, and exploiting its structure-preserving compression ability. We introduce an auxiliary function class that operates on reduced dimensional inputs, and a new complexity term, as the distortion of the loss under random projections. The latter is a hypothesis-dependent data-complexity, whose analytic estimates turn out to recover various regularisation schemes in parametric models, and a notion of intrinsic dimension, as quantified by the Gaussian width of the input support in the case of the nearest neighbour rule. If there is benign geometry present, then the bounds become tighter, otherwise they recover the original dimension-dependent bounds.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Joshua T. Vogelstein ◽  
Eric W. Bridgeford ◽  
Minh Tang ◽  
Da Zheng ◽  
Christopher Douville ◽  
...  

AbstractTo solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.


2021 ◽  
Vol 11 (3) ◽  
pp. 1013
Author(s):  
Zvezdan Lončarević ◽  
Rok Pahič ◽  
Aleš Ude ◽  
Andrej Gams

Autonomous robot learning in unstructured environments often faces the problem that the dimensionality of the search space is too large for practical applications. Dimensionality reduction techniques have been developed to address this problem and describe motor skills in low-dimensional latent spaces. Most of these techniques require the availability of a sufficiently large database of example task executions to compute the latent space. However, the generation of many example task executions on a real robot is tedious, and prone to errors and equipment failures. The main result of this paper is a new approach for efficient database gathering by performing a small number of task executions with a real robot and applying statistical generalization, e.g., Gaussian process regression, to generate more data. We have shown in our experiments that the data generated this way can be used for dimensionality reduction with autoencoder neural networks. The resulting latent spaces can be exploited to implement robot learning more efficiently. The proposed approach has been evaluated on the problem of robotic throwing at a target. Simulation and real-world results with a humanoid robot TALOS are provided. They confirm the effectiveness of generalization-based database acquisition and the efficiency of learning in a low-dimensional latent space.


2020 ◽  
Author(s):  
Aristidis G. Vrahatis ◽  
Sotiris Tasoulis ◽  
Spiros Georgakopoulos ◽  
Vassilis Plagianakos

AbstractNowadays the biomedical data are generated exponentially, creating datasets for analysis with ultra-high dimensionality and complexity. This revolution, which has been caused by recent advents in biotechnologies, has driven to big-data and data-driven computational approaches. An indicative example is the emerging single-cell RNA-sequencing (scRNA-seq) technology, which isolates and measures individual cells. Although scRNA-seq has revolutionized the biotechnology domain, such data computational analysis is a major challenge because of their ultra-high dimensionality and complexity. Following this direction, in this work we study the properties, effectiveness and generalization of the recently proposed MRPV algorithm for single cell RNA-seq data. MRPV is an ensemble classification technique utilizing multiple ultra-low dimensional Random Projected spaces. A given classifier determines the class for each sample for all independent spaces while a majority voting scheme defines their predominant class. We show that Random Projection ensembles offer a platform not only for a low computational time analysis but also for enhancing classification performance. The developed methodologies were applied to four real biomedical high dimensional data from single-cell RNA-seq studies and compared against well-known and similar classification tools. Experimental results showed that based on simplistic tools we can create a computationally fast, simple, yet effective approach for single cell RNA-seq data with ultra-high dimensionality.


2020 ◽  
Vol 49 (3) ◽  
pp. 421-437
Author(s):  
Genggeng Liu ◽  
Lin Xie ◽  
Chi-Hua Chen

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.


Biostatistics ◽  
2021 ◽  
Author(s):  
Theresa A Alexander ◽  
Rafael A Irizarry ◽  
Héctor Corrada Bravo

Summary High-dimensional biological data collection across heterogeneous groups of samples has become increasingly common, creating high demand for dimensionality reduction techniques that capture underlying structure of the data. Discovering low-dimensional embeddings that describe the separation of any underlying discrete latent structure in data is an important motivation for applying these techniques since these latent classes can represent important sources of unwanted variability, such as batch effects, or interesting sources of signal such as unknown cell types. The features that define this discrete latent structure are often hard to identify in high-dimensional data. Principal component analysis (PCA) is one of the most widely used methods as an unsupervised step for dimensionality reduction. This reduction technique finds linear transformations of the data which explain total variance. When the goal is detecting discrete structure, PCA is applied with the assumption that classes will be separated in directions of maximum variance. However, PCA will fail to accurately find discrete latent structure if this assumption does not hold. Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), attempt to mitigate these problems with PCA by creating a low-dimensional space where similar objects are modeled by nearby points in the low-dimensional embedding and dissimilar objects are modeled by distant points with high probability. However, since t-SNE and UMAP are computationally expensive, often a PCA reduction is done before applying them which makes it sensitive to PCAs downfalls. Also, tSNE is limited to only two or three dimensions as a visualization tool, which may not be adequate for retaining discriminatory information. The linear transformations of PCA are preferable to non-linear transformations provided by methods like t-SNE and UMAP for interpretable feature weights. Here, we propose iterative discriminant analysis (iDA), a dimensionality reduction technique designed to mitigate these limitations. iDA produces an embedding that carries discriminatory information which optimally separates latent clusters using linear transformations that permit post hoc analysis to determine features that define these latent structures.


2019 ◽  
Vol 11 (1) ◽  
pp. 168781401881917
Author(s):  
Fang Lv ◽  
Yuliang Wei ◽  
Xixian Han ◽  
Bailing Wang

With the explosive growth of surveillance data, exact match queries become much more difficult for its high dimension and high volume. Owing to its good balance between the retrieval performance and the computational cost, hash learning technique is widely used in solving approximate nearest neighbor search problems. Dimensionality reduction plays a critical role in hash learning, as its target is to preserve the most original information into low-dimensional vectors. However, the existing dimensionality reduction methods neglect to unify diverse resources in original space when learning a downsized subspace. In this article, we propose a numeric and semantic consistency semi-supervised hash learning method, which unifies the numeric features and supervised semantic features into a low-dimensional subspace before hash encoding, and improves a multiple table hash method with complementary numeric local distribution structure. A consistency-based learning method, which confers the meaning of semantic to numeric features in dimensionality reduction, is presented. The experiments are conducted on two public datasets, that is, a web image NUS-WIDE and text dataset DBLP. Experimental results demonstrate that the semi-supervised hash learning method, with the consistency-based information subspace, is more effective in preserving useful information for hash encoding than state-of-the-art methods and achieves high-quality retrieval performance in multi-table context.


Sign in / Sign up

Export Citation Format

Share Document