Computing Dense Cubes Embedded in Sparse Data

Research and Trends in Data Mining Technologies and Applications ◽

10.4018/978-1-59904-271-8.ch002 ◽

2007 ◽

pp. 29-52

Author(s):

Lixin Fu

Keyword(s):

Input Data ◽

Response Times ◽

Large Data ◽

Sparse Data ◽

High Dimensional ◽

Data Sets ◽

Aggregated Data ◽

Evaluation Algorithm ◽

Big Picture ◽

Computation Algorithms

In high-dimensional data sets, both the number of dimensions and the cardinalities of the dimensions are large and data is often very sparse, that is, most cubes are empty. For such large data sets, it is a well-known challenging problem to compute the aggregation of a measure over arbitrary combinations of dimensions efficiently. However, in real-world applications, users are usually not interested in all the sparse cubes, most of which are empty or contain only one or few tuples. Instead, they focus more on the “big picture” information the highly aggregated data, where the “where clauses” of the SQL queries involve only few dimensions. Although the input data set is sparse, this aggregate data is dense. The existing multi-pass, full-cube computation algorithms are prohibitively slow for this type of application involving very large input data sets. We propose a new dynamic data structure called Restricted Sparse Statistics Tree (RSST) and a novel cube evaluation algorithm, which are especially well suited for efficiently computing dense sub-cubes imbedded in high-dimensional sparse data sets. RSST only computes the aggregations of non-empty cube cells where the number of non-star coordinates (i.e., the number of group by attributes) is restricted to be no more than a user-specified threshold. Our innovative algorithms are scalable and I/O efficient. RSST is incrementally maintainable, which makes it suitable for data warehousing and the analysis of streaming data. We have compared our algorithms with top, state-of-the-art cube computation algorithms such as Dwarf and QCT in construction times, query response times, and data compression. Experiments demonstrate the excellent performance and good scalability of our approach.

Download Full-text

Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Data Mining for Scientific and Engineering Applications - Massive Computing ◽

10.1007/978-1-4615-1733-7_2 ◽

2001 ◽

pp. 23-34

Author(s):

Jagdish Chandra

Keyword(s):

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Challenges And Opportunities

Download Full-text

Mining Frequent Itemsets in Large Data Warehouses: A Novel Approach Proposed for Sparse Data Sets

Intelligent Data Engineering and Automated Learning - IDEAL 2007 - Lecture Notes in Computer Science ◽

10.1007/978-3-540-77226-2_53 ◽

2007 ◽

pp. 517-526 ◽

Cited By ~ 1

Author(s):

S. M. Fakhrahmad ◽

M. Zolghadri Jahromi ◽

M. H. Sadreddini

Keyword(s):

Large Data ◽

Sparse Data ◽

Frequent Itemsets ◽

Data Sets ◽

Data Warehouses ◽

Novel Approach ◽

Sparse Data Sets ◽

Mining Frequent Itemsets

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

An Advanced Group Contribution Method for High-Dimensional, Sparse Data Sets

Molecular Informatics ◽

10.1002/minf.201100111 ◽

2011 ◽

Vol 31 (1) ◽

pp. 41-52 ◽

Cited By ~ 2

Author(s):

Chang Jun Lee ◽

Jong Min Lee

Keyword(s):

Sparse Data ◽

Group Contribution ◽

High Dimensional ◽

Data Sets ◽

Group Contribution Method ◽

Sparse Data Sets

Download Full-text

Constructing plasticity phenotypes to classify experience-dependent development of the visual cortex

10.1101/2020.01.07.896191 ◽

2020 ◽

Cited By ~ 1

Author(s):

Justin L. Balsor ◽

David G. Jones ◽

Kathryn M. Murphy

Keyword(s):

Visual Cortex ◽

Neural Development ◽

Large Data ◽

R Package ◽

Synaptic Proteins ◽

High Dimensional ◽

Data Sets ◽

Visual Plasticity ◽

Dependent Plasticity ◽

Dimensional Changes

AbstractMany neural mechanisms regulate experience-dependent plasticity in the visual cortex (V1) and new techniques for quantifying large numbers of proteins or genes are transforming how plasticity is studied into the era of big data. With those large data sets comes the challenge of extracting biologically meaningful results about visual plasticity from data-driven analytical methods designed for high-dimensional data. In other areas of neuroscience, high-information content methodologies are revealing more subtle aspects of neural development and individual variations that give rise to a richer picture of brain disorders. We have developed an approach for studying V1 plasticity that takes advantage of the known functions of many synaptic proteins for regulating visual plasticity and using that to rebrand the results of high-dimensional analyses into a plasticity phenotype. Here we provide a primer for analyzing experience-dependent plasticity in V1 using example R code to identify high-dimensional changes in a group of proteins. We describe using PCA to classify high-dimensional plasticity features and use them to construct a plasticity phenotype. In the examples, we show how the plasticity phenotype can be visualized and used to identify neurobiological features in V1 that change during development or after different visual rearing conditions. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Big Data ◽

10.1201/b18050-8 ◽

2015 ◽

pp. 76-107

Keyword(s):

Singular Value Decomposition ◽

Similarity Search ◽

Large Data ◽

Singular Value ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Value Decomposition

Download Full-text

Pixel Bar Charts: A Visualization Technique for Very Large Multi-Attribute Data Sets

Information Visualization ◽

10.1057/palgrave.ivs.9500003 ◽

2002 ◽

Vol 1 (1) ◽

pp. 20-34 ◽

Cited By ~ 37

Author(s):

Daniel A. Keim ◽

Ming C. Hao ◽

Umesh Dayal ◽

Meichun Hsu

Keyword(s):

Optimization Problem ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Visualization Technique ◽

Aggregated Data ◽

Attribute Data ◽

Complex Optimization ◽

Visualization Techniques ◽

High Degree

Simple presentation graphics are intuitive and easy-to-use, but show only highly aggregated data presenting only a very small number of data values (as in the case of bar charts) and may have a high degree of overlap occluding a significant portion of the data values (as in the case of the x-y plots). In this article, the authors therefore propose a generalization of traditional bar charts and x-y plots, which allows the visualization of large amounts of data. The basic idea is to use the pixels within the bars to present detailed information of the data records. The so-called pixel bar charts retain the intuitiveness of traditional bar charts while allowing very large data sets to be visualized in an effective way. It is shown that, for an effective pixel placement, a complex optimization problem has to be solved. The authors then present an algorithm which efficiently solves the problem. The application to a number of real-world e-commerce data sets shows the wide applicability and usefulness of this new idea, and a comparison to other well-known visualization techniques (parallel coordinates and spiral techniques) shows a number of clear advantages.

Download Full-text

The Big Picture: Multiplicity Control in Large Data Sets Presents New Challenges and Opportunites

CHANCE ◽

10.1080/09332480.2012.685368 ◽

2012 ◽

Vol 25 (2) ◽

pp. 37-40 ◽

Cited By ~ 1

Author(s):

Nicole Lazar

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Big Picture ◽

New Challenges

Download Full-text

Tensor Numerical Methods: Actual Theory and Recent Applications

Computational Methods in Applied Mathematics ◽

10.1515/cmam-2018-0014 ◽

2019 ◽

Vol 19 (1) ◽

pp. 1-4 ◽

Cited By ~ 1

Author(s):

Ivan Gavrilyuk ◽

Boris N. Khoromskij

Keyword(s):

Numerical Modeling ◽

Big Data ◽

Data Analysis ◽

Numerical Methods ◽

Differential Equations ◽

Numerical Solution ◽

Large Data ◽

High Dimensional ◽

Data Sets ◽

Integral Differential Equations

AbstractMost important computational problems nowadays are those related to processing of the large data sets and to numerical solution of the high-dimensional integral-differential equations. These problems arise in numerical modeling in quantum chemistry, material science, and multiparticle dynamics, as well as in machine learning, computer simulation of stochastic processes and many other applications related to big data analysis. Modern tensor numerical methods enable solution of the multidimensional partial differential equations (PDE) in {\mathbb{R}^{d}} by reducing them to one-dimensional calculations. Thus, they allow to avoid the so-called “curse of dimensionality”, i.e. exponential growth of the computational complexity in the dimension size d, in the course of numerical solution of high-dimensional problems. At present, both tensor numerical methods and multilinear algebra of big data continue to expand actively to further theoretical and applied research topics. This issue of CMAM is devoted to the recent developments in the theory of tensor numerical methods and their applications in scientific computing and data analysis. Current activities in this emerging field on the effective numerical modeling of temporal and stationary multidimensional PDEs and beyond are presented in the following ten articles, and some future trends are highlighted therein.

Download Full-text