scholarly journals Discovering dependencies with reliable mutual information

2020 ◽  
Vol 62 (11) ◽  
pp. 4223-4253
Author(s):  
Panagiotis Mandros ◽  
Mario Boley ◽  
Jilles Vreeken

Abstract We consider the task of discovering functional dependencies in data for target attributes of interest. To solve it, we have to answer two questions: How do we quantify the dependency in a model-agnostic and interpretable way as well as reliably against sample size and dimensionality biases? How can we efficiently discover the exact or $$\alpha $$ α -approximate top-k dependencies? We address the first question by adopting information-theoretic notions. Specifically, we consider the mutual information score, for which we propose a reliable estimator that enables robust optimization in high-dimensional data. To address the second question, we then systematically explore the algorithmic implications of using this measure for optimization. We show the problem is NP-hard and justify worst-case exponential-time as well as heuristic search methods. We propose two bounding functions for the estimator, which we use as pruning criteria in branch-and-bound search to efficiently mine dependencies with approximation guarantees. Empirical evaluation shows that the derived estimator has desirable statistical properties, the bounding functions lead to effective exact and greedy search algorithms, and when combined, qualitative experiments show the framework indeed discovers highly informative dependencies.

Author(s):  
Panagiotis Mandros ◽  
Mario Boley ◽  
Jilles Vreeken

The reliable fraction of information is an attractive score for quantifying (functional) dependencies in high-dimensional data. In this paper, we systematically explore the algorithmic implications of using this measure for optimization. We show that the problem is NP-hard, justifying worst-case exponential-time as well as heuristic search methods. We then substantially improve the practical performance for both optimization styles by deriving a novel admissible bounding function that has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate the approximation ratio of the greedy algorithm and show that it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search.


2020 ◽  
Vol 501 (1) ◽  
pp. 994-1001
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey ◽  
Snehasish Bhattacharjee

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.


2021 ◽  
Vol 68 (4) ◽  
pp. 1-25
Author(s):  
Thodoris Lykouris ◽  
Sergei Vassilvitskii

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution, as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work, we develop a framework for augmenting online algorithms with a machine learned predictor to achieve competitive ratios that provably improve upon unconditional worst-case lower bounds when the predictor has low error. Our approach treats the predictor as a complete black box and is not dependent on its inner workings or the exact distribution of its errors. We apply this framework to the traditional caching problem—creating an eviction strategy for a cache of size k . We demonstrate that naively following the oracle’s recommendations may lead to very poor performance, even when the average error is quite low. Instead, we show how to modify the Marker algorithm to take into account the predictions and prove that this combined approach achieves a competitive ratio that both (i) decreases as the predictor’s error decreases and (ii) is always capped by O (log k ), which can be achieved without any assistance from the predictor. We complement our results with an empirical evaluation of our algorithm on real-world datasets and show that it performs well empirically even when using simple off-the-shelf predictions.


2021 ◽  
Author(s):  
Gourab Das

LitRev is a novel robust data driven approach, devel-oped for quick literature review on a particular topic of interest. This method identifies common biological phrases that follow a power law distribution and important phrases which have the normalized point wise mutual information score greater than zero.


2021 ◽  
Vol 2021 (9) ◽  
Author(s):  
Alex May

Abstract We prove a theorem showing that the existence of “private” curves in the bulk of AdS implies two regions of the dual CFT share strong correlations. A private curve is a causal curve which avoids the entanglement wedge of a specified boundary region $$ \mathcal{U} $$ U . The implied correlation is measured by the conditional mutual information $$ I\left({\mathcal{V}}_1:\left.{\mathcal{V}}_2\right|\mathcal{U}\right) $$ I V 1 : V 2 U , which is O(1/GN) when a private causal curve exists. The regions $$ {\mathcal{V}}_1 $$ V 1 and $$ {\mathcal{V}}_2 $$ V 2 are specified by the endpoints of the causal curve and the placement of the region $$ \mathcal{U} $$ U . This gives a causal perspective on the conditional mutual information in AdS/CFT, analogous to the causal perspective on the mutual information given by earlier work on the connected wedge theorem. We give an information theoretic argument for our theorem, along with a bulk geometric proof. In the geometric perspective, the theorem follows from the maximin formula and entanglement wedge nesting. In the information theoretic approach, the theorem follows from resource requirements for sending private messages over a public quantum channel.


Author(s):  
Greg Ver Steeg

Learning by children and animals occurs effortlessly and largely without obvious supervision. Successes in automating supervised learning have not translated to the more ambiguous realm of unsupervised learning where goals and labels are not provided. Barlow (1961) suggested that the signal that brains leverage for unsupervised learning is dependence, or redundancy, in the sensory environment. Dependence can be characterized using the information-theoretic multivariate mutual information measure called total correlation. The principle of Total Cor-relation Ex-planation (CorEx) is to learn representations of data that "explain" as much dependence in the data as possible. We review some manifestations of this principle along with successes in unsupervised learning problems across diverse domains including human behavior, biology, and language.


2021 ◽  
Vol 12 ◽  
Author(s):  
Richard Futrell

I present a computational-level model of semantic interference effects in online word production within a rate–distortion framework. I consider a bounded-rational agent trying to produce words. The agent's action policy is determined by maximizing accuracy in production subject to computational constraints. These computational constraints are formalized using mutual information. I show that semantic similarity-based interference among words falls out naturally from this setup, and I present a series of simulations showing that the model captures some of the key empirical patterns observed in Stroop and Picture–Word Interference paradigms, including comparisons to human data from previous experiments.


Author(s):  
Yang Xu ◽  
Ronghao Zheng ◽  
Meiqin Liu ◽  
Senlin Zhang

Sign in / Sign up

Export Citation Format

Share Document