scholarly journals Mutual Information between Discrete and Continuous Data Sets

PLoS ONE ◽  
2014 ◽  
Vol 9 (2) ◽  
pp. e87357 ◽  
Author(s):  
Brian C. Ross
2020 ◽  
Vol 501 (1) ◽  
pp. 994-1001
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey ◽  
Snehasish Bhattacharjee

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.


2020 ◽  
Vol 497 (4) ◽  
pp. 4077-4090 ◽  
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey

ABSTRACT A non-zero mutual information between morphology of a galaxy and its large-scale environment is known to exist in Sloan Digital Sky Survey (SDSS) upto a few tens of Mpc. It is important to test the statistical significance of these mutual information if any. We propose three different methods to test the statistical significance of these non-zero mutual information and apply them to SDSS and Millennium run simulation. We randomize the morphological information of SDSS galaxies without affecting their spatial distribution and compare the mutual information in the original and randomized data sets. We also divide the galaxy distribution into smaller subcubes and randomly shuffle them many times keeping the morphological information of galaxies intact. We compare the mutual information in the original SDSS data and its shuffled realizations for different shuffling lengths. Using a t-test, we find that a small but statistically significant (at $99.9{{\ \rm per\ cent}}$ confidence level) mutual information between morphology and environment exists upto the entire length-scale probed. We also conduct another experiment using mock data sets from a semi-analytic galaxy catalogue where we assign morphology to galaxies in a controlled manner based on the density at their locations. The experiment clearly demonstrates that mutual information can effectively capture the physical correlations between morphology and environment. Our analysis suggests that physical association between morphology and environment may extend to much larger length-scales than currently believed, and the information theoretic framework presented here can serve as a sensitive and useful probe of the assembly bias and large-scale environmental dependence of galaxy properties.


2014 ◽  
Vol 14 (4) ◽  
pp. 815-829 ◽  
Author(s):  
G. Anderson ◽  
D. Klugmann

Abstract. The Met Office has operated a very low frequency (VLF) lightning location network since 1987. The long-range capabilities of this network, referred to in its current form as ATDnet, allow for relatively continuous detection efficiency across Europe with only a limited number of sensors. The wide coverage and continuous data obtained by Arrival Time Differing NETwork (ATDnet) are here used to create data sets of lightning density across Europe. Results of annual and monthly detected lightning density using data from 2008–2012 are presented, along with more detailed analysis of statistics and features of interest. No adjustment has been made to the data for regional variations in detection efficiency.


2015 ◽  
Vol 2015 ◽  
pp. 1-6 ◽  
Author(s):  
Huaqiang Wang ◽  
Lijun Liang ◽  
Zhanwen Niu ◽  
Zhen He

The identification of CTQs for complex products is the first step to implement quality control. To improve the efficiency and accuracy of CTQs identification, we propose a novel hybrid approach based on mutual information and improved gravitational search algorithm, which has advantages of filter and wrapper. At first, the information relevance and redundancy are measured by mutual information. Then, the improved gravitational search algorithm is used to search the CTQs. Experimentation is carried out using 2 UCI data sets, and the classification capability of CTQs is tested by SVM and tenfold cross validation. The results show that the presented method is verified to be effective and practically applicable.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1357
Author(s):  
Katrin Sophie Bohnsack ◽  
Marika Kaden ◽  
Julia Abel ◽  
Sascha Saralajew ◽  
Thomas Villmann

In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.


2019 ◽  
Vol 9 (2) ◽  
pp. 123-147 ◽  
Author(s):  
Ryotaro Kamimura

Abstract The present paper1 aims to propose a new type of information-theoretic method to maximize mutual information between inputs and outputs. The importance of mutual information in neural networks is well known, but the actual implementation of mutual information maximization has been quite difficult to undertake. In addition, mutual information has not extensively been used in neural networks, meaning that its applicability is very limited. To overcome the shortcoming of mutual information maximization, we present it here in a very simplified manner by supposing that mutual information is already maximized before learning, or at least at the beginning of learning. The method was applied to three data sets (crab data set, wholesale data set, and human resources data set) and examined in terms of generalization performance and connection weights. The results showed that by disentangling connection weights, maximizing mutual information made it possible to explicitly interpret the relations between inputs and outputs.


BMC Genomics ◽  
2019 ◽  
Vol 20 (S9) ◽  
Author(s):  
Chaowang Lan ◽  
Hui Peng ◽  
Gyorgy Hutvagner ◽  
Jinyan Li

Abstract Background A long noncoding RNA (lncRNA) can act as a competing endogenous RNA (ceRNA) to compete with an mRNA for binding to the same miRNA. Such an interplay between the lncRNA, miRNA, and mRNA is called a ceRNA crosstalk. As an miRNA may have multiple lncRNA targets and multiple mRNA targets, connecting all the ceRNA crosstalks mediated by the same miRNA forms a ceRNA network. Methods have been developed to construct ceRNA networks in the literature. However, these methods have limits because they have not explored the expression characteristics of total RNAs. Results We proposed a novel method for constructing ceRNA networks and applied it to a paired RNA-seq data set. The first step of the method takes a competition regulation mechanism to derive candidate ceRNA crosstalks. Second, the method combines a competition rule and pointwise mutual information to compute a competition score for each candidate ceRNA crosstalk. Then, ceRNA crosstalks which have significant competition scores are selected to construct the ceRNA network. The key idea, pointwise mutual information, is ideally suitable for measuring the complex point-to-point relationships embedded in the ceRNA networks. Conclusion Computational experiments and results demonstrate that the ceRNA networks can capture important regulatory mechanism of breast cancer, and have also revealed new insights into the treatment of breast cancer. The proposed method can be directly applied to other RNA-seq data sets for deeper disease understanding.


Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 351
Author(s):  
Nezamoddin N. Kachouie ◽  
Meshal Shutaywi

Background: A common task in machine learning is clustering data into different groups based on similarities. Clustering methods can be divided in two groups: linear and nonlinear. A commonly used linear clustering method is K-means. Its extension, kernel K-means, is a non-linear technique that utilizes a kernel function to project the data to a higher dimensional space. The projected data will then be clustered in different groups. Different kernels do not perform similarly when they are applied to different datasets. Methods: A kernel function might be relevant for one application but perform poorly to project data for another application. In turn choosing the right kernel for an arbitrary dataset is a challenging task. To address this challenge, a potential approach is aggregating the clustering results to obtain an impartial clustering result regardless of the selected kernel function. To this end, the main challenge is how to aggregate the clustering results. A potential solution is to combine the clustering results using a weight function. In this work, we introduce Weighted Mutual Information (WMI) for calculating the weights for different clustering methods based on their performance to combine the results. The performance of each method is evaluated using a training set with known labels. Results: We applied the proposed Weighted Mutual Information to four data sets that cannot be linearly separated. We also tested the method in different noise conditions. Conclusions: Our results show that the proposed Weighted Mutual Information method is impartial, does not rely on a single kernel, and performs better than each individual kernel specially in high noise.


2011 ◽  
Vol 19 (04) ◽  
pp. 725-746 ◽  
Author(s):  
P. GANESH KUMAR ◽  
T. ARULDOSS ALBERT VICTOIRE

An important issue in the design of gene selection algorithm for microarray data analysis is the formation of suitable criterion function for measuring the relevance between different gene expressions. Mutual information (MI) is a widely used criterion function but it calculates the relevance on the entire samples only once which cannot exactly identify the informative genes. This paper proposes a novel idea of computing MI in stages. The proposed multistage mutual information (MSMI) computes MI, initially using all the samples and based on the classification performance produced by artificial neural network (ANN), MI is repeatedly calculated using only the unclassified samples until there is no improvement in the classification accuracy. The performance of the proposed approach is evaluated using ten gene expression data sets. Simulation result shows that the proposed approach helps to improve the discriminate power of the genes with regard to the target disease of a microarray sample. Statistical analysis of the test result shows that the proposed method selects highly informative genes and produces comparable classification accuracy than the other approaches reported in the literature.


Sign in / Sign up

Export Citation Format

Share Document