NetCube

2009 ◽  
pp. 2011-2036
Author(s):  
Dimitris Margaritis ◽  
Christos Faloutsos ◽  
Sebastian Thrun

We present a novel method for answering count queries from a large database approximately and quickly. Our method implements an approximate DataCube of the application domain, which can be used to answer any conjunctive count query that can be formed by the user. The DataCube is a conceptual device that in principle stores the number of matching records for all possible such queries. However, because its size and generation time are inherently exponential, our approach uses one or more Bayesian networks to implement it approximately. Bayesian networks are statistical graphical models that can succinctly represent the underlying joint probability distribution of the domain, and can therefore be used to calculate approximate counts for any conjunctive query combination of attribute values and “don’t cares.” The structure and parameters of these networks are learned from the database in a preprocessing stage. By means of such a network, the proposed method, called NetCube, exploits correlations and independencies among attributes to answer a count query quickly without accessing the database. Our preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. We give an algorithm for estimating the count result of arbitrary queries that is fast (constant) on the database size. Our experimental results show that NetCubes have fast generation and use, achieve excellent compression and have low reconstruction error. Moreover, they naturally allow for visualization and data mining, at no extra cost.

Author(s):  
Dimitris Margaritis ◽  
Christos Faloutsos ◽  
Sebastian Thrun

We present a novel method for answering count queries from a large database approximately and quickly. Our method implements an approximate DataCube of the application domain, which can be used to answer any conjunctive count query that can be formed by the user. The DataCube is a conceptual device that in principle stores the number of matching records for all possible such queries. However, because its size and generation time are inherently exponential, our approach uses one or more Bayesian networks to implement it approximately. Bayesian networks are statistical graphical models that can succinctly represent the underlying joint probability distribution of the domain, and can therefore be used to calculate approximate counts for any conjunctive query combination of attribute values and “don’t cares.” The structure and parameters of these networks are learned from the database in a preprocessing stage. By means of such a network, the proposed method, called NetCube, exploits correlations and independencies among attributes to answer a count query quickly without accessing the database. Our preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. We give an algorithm for estimating the count result of arbitrary que ries that is fast (constant) on the database size. Our experimental results show that NetCubes have fast generation and use, achieve excellent compression and have low reconstruction error. Moreover, they naturally allow for visualization and data mining, at no extra cost.


Author(s):  
Dimitris Margaritis ◽  
Christos Faloutsos ◽  
Sebastian Thrun

We present a novel method for answering count queries from a large database approximately and quickly. Our method implements an approximate DataCube of the application domain, which can be used to answer any conjunctive count query that can be formed by the user. The DataCube is a conceptual device that in principle stores the number of matching records for all possible such queries. However, because its size and generation time are inherently exponential, our approach uses one or more Bayesian networks to implement it approximately. Bayesian networks are statistical graphical models that can succinctly represent the underlying joint probability distribution of the domain, and can therefore be used to calculate approximate counts for any conjunctive query combination of attribute values and “don’t cares.” The structure and parameters of these networks are learned from the database in a preprocessing stage. By means of such a network, the proposed method, called NetCube, exploits correlations and independencies among attributes to answer a count query quickly without accessing the database. Our preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. We give an algorithm for estimating the count result of arbitrary queries that is fast (constant) on the database size. Our experimental results show that NetCubes have fast generation and use, achieve excellent compression and have low reconstruction error. Moreover, they naturally allow for visualization and data mining, at no extra cost.


Author(s):  
Marco F. Ramoni ◽  
Paola Sebastiani

Born at the intersection of artificial intelligence, statistics, and probability, Bayesian networks (Pearl, 1988) are a representation formalism at the cutting edge of knowledge discovery and data mining (Heckerman, 1997). Bayesian networks belong to a more general class of models called probabilistic graphical models (Whittaker, 1990; Lauritzen, 1996) that arise from the combination of graph theory and probability theory, and their success rests on their ability to handle complex probabilistic models by decomposing them into smaller, amenable components. A probabilistic graphical model is defined by a graph, where nodes represent stochastic variables and arcs represent dependencies among such variables. These arcs are annotated by probability distribution shaping the interaction between the linked variables. A probabilistic graphical model is called a Bayesian network, when the graph connecting its variables is a directed acyclic graph (DAG). This graph represents conditional independence assumptions that are used to factorize the joint probability distribution of the network variables, thus making the process of learning from a large database amenable to computations. A Bayesian network induced from data can be used to investigate distant relationships between variables, as well as making prediction and explanation, by computing the conditional probability distribution of one variable, given the values of some others.


Author(s):  
Marco F. Ramoni ◽  
Paola Sebastiani

Born at the intersection of artificial intelligence, statistics, and probability, Bayesian networks (Pearl, 1988) are a representation formalism at the cutting edge of knowledge discovery and data mining (Heckerman, 1997). Bayesian networks belong to a more general class of models called probabilistic graphical models (Whittaker, 1990; Lauritzen, 1996) that arise from the combination of graph theory and probability theory, and their success rests on their ability to handle complex probabilistic models by decomposing them into smaller, amenable components. A probabilistic graphical model is defined by a graph, where nodes represent stochastic variables and arcs represent dependencies among such variables. These arcs are annotated by probability distribution shaping the interaction between the linked variables. A probabilistic graphical model is called a Bayesian network, when the graph connecting its variables is a directed acyclic graph (DAG). This graph represents conditional independence assumptions that are used to factorize the joint probability distribution of the network variables, thus making the process of learning from a large database amenable to computations. A Bayesian network induced from data can be used to investigate distant relationships between variables, as well as making prediction and explanation, by computing the conditional probability distribution of one variable, given the values of some others.


Author(s):  
Marco F. Ramoni ◽  
Paola Sebastiani

Born at the intersection of artificial intelligence, statistics, and probability, Bayesian networks (Pearl, 1988) are a representation formalism at the cutting edge of knowledge discovery and data mining (Heckerman, 1997). Bayesian networks belong to a more general class of models called probabilistic graphical models (Whittaker, 1990; Lauritzen, 1996) that arise from the combination of graph theory and probability theory, and their success rests on their ability to handle complex probabilistic models by decomposing them into smaller, amenable components. A probabilistic graphical model is defined by a graph, where nodes represent stochastic variables and arcs represent dependencies among such variables. These arcs are annotated by probability distribution shaping the interaction between the linked variables. A probabilistic graphical model is called a Bayesian network, when the graph connecting its variables is a directed acyclic graph (DAG). This graph represents conditional independence assumptions that are used to factorize the joint probability distribution of the network variables, thus making the process of learning from a large database amenable to computations. A Bayesian network induced from data can be used to investigate distant relationships between variables, as well as making prediction and explanation, by computing the conditional probability distribution of one variable, given the values of some others.


2014 ◽  
Vol 926-930 ◽  
pp. 3594-3597
Author(s):  
Cai Chang Ding ◽  
Wen Xiu Peng ◽  
Wei Ming Wang

Estimation of Distribution Algorithms (EDAs) are a set of algorithms that belong to the field of Evolutionary Computation. In EDAs there are neither crossover nor mutation operators. Instead, the new population of individuals is sampled from a probability distribution, which is estimated from a database that contains the selected individuals from the previous generation. Thus, the interrelations between the different variables that represent the individuals may be explicitly expressed through the joint probability distribution associated with the individuals selected at each generation.


Author(s):  
Yang Xiang

Graphical models such as Bayesian networks (BNs) (Pearl, 1988) and decomposable Markov networks (DMNs) (Xiang, Wong & Cercone, 1997) have been applied widely to probabilistic reasoning in intelligent systems. Figure1 illustrates a BN and a DMN on a trivial uncertain domain: A virus can damage computer files, and so can a power glitch. A power glitch also causes a VCR to reset. The BN in (a) has four nodes, corresponding to four binary variables taking values from {true, false}. The graph structure encodes a set of dependence and independence assumptions (e.g., that f is directly dependent on v, and p but is independent of r, once the value of p is known). Each node is associated with a conditional probability distribution conditioned on its parent nodes (e.g., P(f | v, p)). The joint probability distribution is the product P(v, p, f, r) = P(f | v, p) P(r | p) P(v) P(p). The DMN in (b) has two groups of nodes that are maximally pair-wise connected, called cliques. Each clique is associated with a probability distribution (e.g., clique {v, p, f} is assigned P(v, p, f)). The joint probability distribution is P(v, p, f, r) = P(v, p, f) P(r, p) / P(p), where P(p) can be derived from one of the clique distributions. The networks, for instance, can be used to reason about whether there are viruses in the computer system, after observations on f and r are made.


Information ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 211
Author(s):  
David Kinney

This article considers the extent to which Bayesian networks with imprecise probabilities, which are used in statistics and computer science for predictive purposes, can be used to represent causal structure. It is argued that the adequacy conditions for causal representation in the precise context—the Causal Markov Condition and Minimality—do not readily translate into the imprecise context. Crucial to this argument is the fact that the independence relation between random variables can be understood in several different ways when the joint probability distribution over those variables is imprecise, none of which provides a compelling basis for the causal interpretation of imprecise Bayes nets. I conclude that there are serious limits to the use of imprecise Bayesian networks to represent causal structure.


Author(s):  
Juan I. Alonso-Barba ◽  
Jens D. Nielsen ◽  
Luis de la Ossa ◽  
Jose M. Puerta

Probabilistic Graphical Models (PGM) are a class of statistical models that use a graph structure over a set of variables to encode independence relations between those variables. By augmenting the graph by local parameters, a PGM allows for a compact representation of a joint probability distribution over the variables of the graph, which allows for efficient inference algorithms. PGMs are often used for modeling physical and biological systems, and such models are then in turn used to both answer probabilistic queries concerning the variables and to represent certain causal and/or statistical relations in the domain. In this chapter, the authors give an overview of common techniques used for automatic construction of such models from a dataset of observations (usually referred to as learning), and they also review some important applications. The chapter guides the reader to the relevant literature for further study.


Author(s):  
Andrés Cano ◽  
Manuel Gómez-Olmedo ◽  
Serafín Moral ◽  
Serafín Moral-García

Given a set of uncertain discrete variables with a joint probability distribution and a set of observations for some of them, the most probable explanation is a set or configuration of values for non-observed variables maximizing the conditional probability of these variables given the observations. This is a hard problem which can be solved by a deletion algorithm with max marginalization, having a complexity similar to the one of computing conditional probabilities. When this approach is unfeasible, an alternative is to carry out an approximate deletion algorithm, which can be used to guide the search of the most probable explanation, by using A* or branch and bound (the approximate+search approach). The most common approximation procedure has been the mini-bucket approach. In this paper it is shown that the use of probability trees as representation of potentials with a pruning of branches with similar values can improve the performance of this procedure. This is corroborated with an experimental study in which computation times are compared using randomly generated and benchmark Bayesian networks from UAI competitions.


Sign in / Sign up

Export Citation Format

Share Document