Computing Dense Cubes Embedded in Sparse Data

Author(s):  
Lixin Fu

In high-dimensional data sets, both the number of dimensions and the cardinalities of the dimensions are large and data is often very sparse, that is, most cubes are empty. For such large data sets, it is a well-known challenging problem to compute the aggregation of a measure over arbitrary combinations of dimensions efficiently. However, in real-world applications, users are usually not interested in all the sparse cubes, most of which are empty or contain only one or few tuples. Instead, they focus more on the “big picture” information the highly aggregated data, where the “where clauses” of the SQL queries involve only few dimensions. Although the input data set is sparse, this aggregate data is dense. The existing multi-pass, full-cube computation algorithms are prohibitively slow for this type of application involving very large input data sets. We propose a new dynamic data structure called Restricted Sparse Statistics Tree (RSST) and a novel cube evaluation algorithm, which are especially well suited for efficiently computing dense sub-cubes imbedded in high-dimensional sparse data sets. RSST only computes the aggregations of non-empty cube cells where the number of non-star coordinates (i.e., the number of group by attributes) is restricted to be no more than a user-specified threshold. Our innovative algorithms are scalable and I/O efficient. RSST is incrementally maintainable, which makes it suitable for data warehousing and the analysis of streaming data. We have compared our algorithms with top, state-of-the-art cube computation algorithms such as Dwarf and QCT in construction times, query response times, and data compression. Experiments demonstrate the excellent performance and good scalability of our approach.

2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


Author(s):  
Justin L. Balsor ◽  
David G. Jones ◽  
Kathryn M. Murphy

AbstractMany neural mechanisms regulate experience-dependent plasticity in the visual cortex (V1) and new techniques for quantifying large numbers of proteins or genes are transforming how plasticity is studied into the era of big data. With those large data sets comes the challenge of extracting biologically meaningful results about visual plasticity from data-driven analytical methods designed for high-dimensional data. In other areas of neuroscience, high-information content methodologies are revealing more subtle aspects of neural development and individual variations that give rise to a richer picture of brain disorders. We have developed an approach for studying V1 plasticity that takes advantage of the known functions of many synaptic proteins for regulating visual plasticity and using that to rebrand the results of high-dimensional analyses into a plasticity phenotype. Here we provide a primer for analyzing experience-dependent plasticity in V1 using example R code to identify high-dimensional changes in a group of proteins. We describe using PCA to classify high-dimensional plasticity features and use them to construct a plasticity phenotype. In the examples, we show how the plasticity phenotype can be visualized and used to identify neurobiological features in V1 that change during development or after different visual rearing conditions. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.


2017 ◽  
Vol 10 (13) ◽  
pp. 355 ◽  
Author(s):  
Reshma Remesh ◽  
Pattabiraman. V

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network  have been studied.


2002 ◽  
Vol 1 (1) ◽  
pp. 20-34 ◽  
Author(s):  
Daniel A. Keim ◽  
Ming C. Hao ◽  
Umesh Dayal ◽  
Meichun Hsu

Simple presentation graphics are intuitive and easy-to-use, but show only highly aggregated data presenting only a very small number of data values (as in the case of bar charts) and may have a high degree of overlap occluding a significant portion of the data values (as in the case of the x-y plots). In this article, the authors therefore propose a generalization of traditional bar charts and x-y plots, which allows the visualization of large amounts of data. The basic idea is to use the pixels within the bars to present detailed information of the data records. The so-called pixel bar charts retain the intuitiveness of traditional bar charts while allowing very large data sets to be visualized in an effective way. It is shown that, for an effective pixel placement, a complex optimization problem has to be solved. The authors then present an algorithm which efficiently solves the problem. The application to a number of real-world e-commerce data sets shows the wide applicability and usefulness of this new idea, and a comparison to other well-known visualization techniques (parallel coordinates and spiral techniques) shows a number of clear advantages.


2019 ◽  
Vol 19 (1) ◽  
pp. 1-4 ◽  
Author(s):  
Ivan Gavrilyuk ◽  
Boris N. Khoromskij

AbstractMost important computational problems nowadays are those related to processing of the large data sets and to numerical solution of the high-dimensional integral-differential equations. These problems arise in numerical modeling in quantum chemistry, material science, and multiparticle dynamics, as well as in machine learning, computer simulation of stochastic processes and many other applications related to big data analysis. Modern tensor numerical methods enable solution of the multidimensional partial differential equations (PDE) in {\mathbb{R}^{d}} by reducing them to one-dimensional calculations. Thus, they allow to avoid the so-called “curse of dimensionality”, i.e. exponential growth of the computational complexity in the dimension size d, in the course of numerical solution of high-dimensional problems. At present, both tensor numerical methods and multilinear algebra of big data continue to expand actively to further theoretical and applied research topics. This issue of CMAM is devoted to the recent developments in the theory of tensor numerical methods and their applications in scientific computing and data analysis. Current activities in this emerging field on the effective numerical modeling of temporal and stationary multidimensional PDEs and beyond are presented in the following ten articles, and some future trends are highlighted therein.


Sign in / Sign up

Export Citation Format

Share Document