Do galactic bars depend on environment?: an information theoretic analysis of Galaxy Zoo 2

Suman Sarkar; Biswajit Pandey; Snehasish Bhattacharjee

doi:10.1093/mnras/staa3665

Do galactic bars depend on environment?: an information theoretic analysis of Galaxy Zoo 2

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3665 ◽

2020 ◽

Vol 501 (1) ◽

pp. 994-1001

Author(s):

Suman Sarkar ◽

Biswajit Pandey ◽

Snehasish Bhattacharjee

Keyword(s):

Spatial Distribution ◽

Mutual Information ◽

Local Density ◽

Statistical Significance ◽

Distribution Functions ◽

Cumulative Distribution ◽

Host Galaxy ◽

Data Sets ◽

Data Set ◽

Information Theoretic

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.

A study on the statistical significance of mutual information between morphology of a galaxy and its large-scale environment

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa2236 ◽

2020 ◽

Vol 497 (4) ◽

pp. 4077-4090 ◽

Cited By ~ 1

Author(s):

Suman Sarkar ◽

Biswajit Pandey

Keyword(s):

Mutual Information ◽

Large Scale ◽

Statistical Significance ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Information Theoretic ◽

Galaxy Distribution ◽

Sky Survey ◽

The Galaxy ◽

Physical Correlations

ABSTRACT A non-zero mutual information between morphology of a galaxy and its large-scale environment is known to exist in Sloan Digital Sky Survey (SDSS) upto a few tens of Mpc. It is important to test the statistical significance of these mutual information if any. We propose three different methods to test the statistical significance of these non-zero mutual information and apply them to SDSS and Millennium run simulation. We randomize the morphological information of SDSS galaxies without affecting their spatial distribution and compare the mutual information in the original and randomized data sets. We also divide the galaxy distribution into smaller subcubes and randomly shuffle them many times keeping the morphological information of galaxies intact. We compare the mutual information in the original SDSS data and its shuffled realizations for different shuffling lengths. Using a t-test, we find that a small but statistically significant (at $99.9{{\ \rm per\ cent}}$ confidence level) mutual information between morphology and environment exists upto the entire length-scale probed. We also conduct another experiment using mock data sets from a semi-analytic galaxy catalogue where we assign morphology to galaxies in a controlled manner based on the density at their locations. The experiment clearly demonstrates that mutual information can effectively capture the physical correlations between morphology and environment. Our analysis suggests that physical association between morphology and environment may extend to much larger length-scales than currently believed, and the information theoretic framework presented here can serve as a sensitive and useful probe of the assembly bias and large-scale environmental dependence of galaxy properties.

Supposed Maximum Mutual Information for Improving Generalization and Interpretation of Multi-Layered Neural Networks

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2018-0029 ◽

2019 ◽

Vol 9 (2) ◽

pp. 123-147 ◽

Cited By ~ 5

Author(s):

Ryotaro Kamimura

Keyword(s):

Neural Networks ◽

Mutual Information ◽

Data Sets ◽

Data Set ◽

Information Theoretic ◽

Information Maximization ◽

Maximum Mutual Information ◽

Information Theoretic Method ◽

Mutual Information Maximization ◽

Inputs And Outputs

Abstract The present paper1 aims to propose a new type of information-theoretic method to maximize mutual information between inputs and outputs. The importance of mutual information in neural networks is well known, but the actual implementation of mutual information maximization has been quite difficult to undertake. In addition, mutual information has not extensively been used in neural networks, meaning that its applicability is very limited. To overcome the shortcoming of mutual information maximization, we present it here in a very simplified manner by supposing that mutual information is already maximized before learning, or at least at the beginning of learning. The method was applied to three data sets (crab data set, wholesale data set, and human resources data set) and examined in terms of generalization performance and connection weights. The results showed that by disentangling connection weights, maximizing mutual information made it possible to explicitly interpret the relations between inputs and outputs.

UPPER-TRUNCATED POWER LAW DISTRIBUTIONS

Fractals ◽

10.1142/s0218348x01000658 ◽

2001 ◽

Vol 09 (02) ◽

pp. 209-222 ◽

Cited By ~ 21

Author(s):

STEPHEN M. BURROUGHS ◽

SARAH F. TEBBENS

Keyword(s):

Power Law ◽

Distribution Functions ◽

Cumulative Distribution ◽

Generalized Function ◽

Data Sets ◽

Scaling Exponent ◽

Cumulative Number ◽

Size Distributions ◽

Data Set ◽

Binned Data

Power law cumulative number-size distributions are widely used to describe the scaling properties of data sets and to establish scale invariance. We derive the relationships between the scaling exponents of non-cumulative and cumulative number-size distributions for linearly binned and logarithmically binned data. Cumulative number-size distributions for data sets of many natural phenomena exhibit a "fall-off" from a power law at the largest object sizes. Previous work has often either ignored the fall-off region or described this region with a different function. We demonstrate that when a data set is abruptly truncated at large object size, fall-off from a power law is expected for the cumulative distribution. Functions to describe this fall-off are derived for both linearly and logarithmically binned data. These functions lead to a generalized function, the upper-truncated power law, that is independent of binning method. Fitting the upper-truncated power law to a cumulative number-size distribution determines the parameters of the power law, thus providing the scaling exponent of the data. Unlike previous approaches that employ alternate functions to describe the fall-off region, an upper-truncated power law describes the data set, including the fall-off, with a single function.

Children with 5′-end NF1 gene mutations are more likely to have glioma

Neurology Genetics ◽

10.1212/nxg.0000000000000192 ◽

2017 ◽

Vol 3 (5) ◽

pp. e192 ◽

Cited By ~ 12

Author(s):

Corina Anastasaki ◽

Stephanie M. Morris ◽

Feng Gao ◽

David H. Gutmann

Keyword(s):

Gene Mutation ◽

Statistical Significance ◽

Gene Mutations ◽

Neurofibromatosis Type ◽

Published Data ◽

Data Sets ◽

Nonsense Mutations ◽

Data Set ◽

Nf1 Gene ◽

The Relationship

Objective:To ascertain the relationship between the germline NF1 gene mutation and glioma development in patients with neurofibromatosis type 1 (NF1).Methods:The relationship between the type and location of the germline NF1 mutation and the presence of a glioma was analyzed in 37 participants with NF1 from one institution (Washington University School of Medicine [WUSM]) with a clinical diagnosis of NF1. Odds ratios (ORs) were calculated using both unadjusted and weighted analyses of this data set in combination with 4 previously published data sets.Results:While no statistical significance was observed between the location and type of the NF1 mutation and glioma in the WUSM cohort, power calculations revealed that a sample size of 307 participants would be required to determine the predictive value of the position or type of the NF1 gene mutation. Combining our data set with 4 previously published data sets (n = 310), children with glioma were found to be more likely to harbor 5′-end gene mutations (OR = 2; p = 0.006). Moreover, while not clinically predictive due to insufficient sensitivity and specificity, this association with glioma was stronger for participants with 5′-end truncating (OR = 2.32; p = 0.005) or 5′-end nonsense (OR = 3.93; p = 0.005) mutations relative to those without glioma.Conclusions:Individuals with NF1 and glioma are more likely to harbor nonsense mutations in the 5′ end of the NF1 gene, suggesting that the NF1 mutation may be one predictive factor for glioma in this at-risk population.

Abstract WP291: Stochastic Methods Can Resolve the Dilemma of Emergency Stroke Transport

Stroke ◽

10.1161/str.51.suppl_1.wp291 ◽

2020 ◽

Vol 51 (Suppl_1) ◽

Author(s):

Daniel A Paydarfar ◽

David Paydarfar ◽

Peter J Mucha ◽

Joshua Chang

Keyword(s):

Census Data ◽

Infarct Volume ◽

Statistical Significance ◽

Distribution Functions ◽

Cumulative Distribution ◽

Practical Significance ◽

Stochastic Methods ◽

Model Parameters ◽

Collateral Flow ◽

Acute Stroke Care

Introduction: Drip and Ship (DNS) and Mothership (MS) are well-known emergency transport strategies in acute stroke care, but the criteria for choosing between the two is widely debated. Existing models define time-dependent outcomes but cannot resolve this debate with statistical significance because the independent variables are deterministic. We propose a novel stochastic framework that quantifies statistical significance between DNS and MS in a network of primary and comprehensive stroke centers. Methods: We represented the physiology of ischemic core growth as a stochastic first-order differential equation, enabling infarct volume at time of reperfusion to be calculated and mapped to 90-day mRS. Using Texas as a case study, we configured the state’s stroke network within 15,811 geographic blocks as defined by census data. For each block, we ran Monte Carlo simulations to generate Beta distributions of large- and small-vessel infarct volumes, which were then translated into cumulative distribution functions of mRS. A two-sample Kolmogorov-Smirnov test for significance, and Cohen’s d effect size statistic for practical significance were computed between each DNS and MS pair. Stable effect sizes were assured by sampling > 5,000 total infarct volumes for each block. All model parameters were established from large cohort studies or trials. Results: Of the 13,113 blocks where the primary stroke center is the closest hospital from origin, DNS produces significantly better stroke outcomes than MS in 79.0% (0.3% SEM; P < 0.05; 0.2 < d < 0.5). For the subset of patients with large-vessel strokes, MS produces significantly better outcomes in 44.6% of blocks (1.3% SEM; P < 0.05; 0.4 < d < 0.85). Conclusion: Stochastic methods enable the use of clinically relevant metrics for comparative significance of DNS and MS in a geographic region. This formalism, which has not been incorporated in previous models, can be further generalized beyond stochastic infarct volumes if sufficiently large datasets become available. For example, the kinetic growth model can integrate the statistical distributions of times (pre-hospital and hospital) leading up to intervention, and patient attributes that affect outcomes, such as the degree of collateral flow and comorbidities.

Monofractal or multifractal: a case study of spatial distribution of mining-induced seismic activity

Nonlinear Processes in Geophysics ◽

10.5194/npg-1-182-1994 ◽

1994 ◽

Vol 1 (2/3) ◽

pp. 182-190 ◽

Cited By ~ 16

Author(s):

M. Eneva

Keyword(s):

Spatial Distribution ◽

Seismic Activity ◽

Real Data ◽

Data Sets ◽

Point Sets ◽

Data Set ◽

Limited Size ◽

The Real ◽

Induced Seismic Activity ◽

Generalized Correlation

Abstract. Using finite data sets and limited size of study volumes may result in significant spurious effects when estimating the scaling properties of various physical processes. These effects are examined with an example featuring the spatial distribution of induced seismic activity in Creighton Mine (northern Ontario, Canada). The events studied in the present work occurred during a three-month period, March-May 1992, within a volume of approximate size 400 x 400 x 180 m3. Two sets of microearthquake locations are studied: Data Set 1 (14,338 events) and Data Set 2 (1654 events). Data Set 1 includes the more accurately located events and amounts to about 30 per cent of all recorded data. Data Set 2 represents a portion of the first data set that is formed by the most accurately located and the strongest microearthquakes. The spatial distribution of events in the two data sets is examined for scaling behaviour using the method of generalized correlation integrals featuring various moments q. From these, generalized correlation dimensions are estimated using the slope method. Similar estimates are made for randomly generated point sets using the same numbers of events and the same study volumes as for the real data. Uniform and monofractal random distributions are used for these simulations. In addition, samples from the real data are randomly extracted and the dimension spectra for these are examined as well. The spectra for the uniform and monofractal random generations show spurious multifractality due only to the use of finite numbers of data points and limited size of study volume. Comparing these with the spectra of dimensions for Data Set 1 and Data Set 2 allows us to estimate the bias likely to be present in the estimates for the real data. The strong multifractality suggested by the spectrum for Data Set 2 appears to be largely spurious; the spatial distribution, while different from uniform, could originate from a monofractal process. The spatial distribution of microearthquakes in Data Set 1 is either monofractal as well, or only weakly multifractal. In all similar studies, comparisons of result from real data and simulated point sets may help distinguish between genuine and artificial multifractality, without necessarily resorting to large number of data.

Information Graphs for Binary Predictors

Phytopathology ◽

10.1094/phyto-02-14-0044-r ◽

2015 ◽

Vol 105 (1) ◽

pp. 9-17 ◽

Cited By ~ 5

Author(s):

G. Hughes ◽

N. McRoberts ◽

F. J. Burnett

Keyword(s):

Mutual Information ◽

Crop Protection ◽

Diagnostic Information ◽

Specific Information ◽

Data Set ◽

Information Theoretic ◽

Information Updating ◽

Entropy Information ◽

Wide Range ◽

Crop Disease

Binary predictors are used in a wide range of crop protection decision-making applications. Such predictors provide a simple analytical apparatus for the formulation of evidence related to risk factors, for use in the process of Bayesian updating of probabilities of crop disease. For diagrammatic interpretation of diagnostic probabilities, the receiver operating characteristic is available. Here, we view binary predictors from the perspective of diagnostic information. After a brief introduction to the basic information theoretic concepts of entropy and expected mutual information, we use an example data set to provide diagrammatic interpretations of expected mutual information, relative entropy, information inaccuracy, information updating, and specific information. Our information graphs also illustrate correspondences between diagnostic information and diagnostic probabilities.

MINKOWSKI FUNCTIONALS AND CLUSTER ANALYSIS FOR CMB MAPS

International Journal of Modern Physics D ◽

10.1142/s0218271899000225 ◽

1999 ◽

Vol 08 (03) ◽

pp. 291-306 ◽

Cited By ~ 45

Author(s):

D. NOVIKOV ◽

HUME A. FELDMAN ◽

SERGEI F. SHANDARIN

Keyword(s):

Large Data ◽

Threshold Level ◽

Distribution Functions ◽

Large Data Sets ◽

Cumulative Distribution ◽

Temperature Threshold ◽

Data Sets ◽

Minkowski Functionals ◽

And Cluster Analysis ◽

Non Gaussian

We suggest novel statistics for the CMB maps that are sensitive to non-Gaussian features. These statistics are natural generalizations of the geometrical and topological methods that have been already used in cosmology such as the cumulative distribution function and genus. We compute the distribution functions of the Partial Minkowski Functionals for the excursion set above or bellow a constant temperature threshold. Minkowski Functionals are additive and are translationally and rotationally invariant. Thus, they can be used for patchy and/or incomplete coverage. The technique is highly efficient computationally (it requires only O(N) operations, where N is the number of pixels per one threshold level). Further, the procedure makes it possible to split large data sets into smaller subsets. The full advantage of these statistics can be obtained only on very large data sets. We apply it to the 4-year DMR COBE data corrected for the Galaxy contamination as an illustration of the technique.

Statistical analysis of water vapour and ozone in the UT/LS observed during SPURT and MOZAIC

Atmospheric Chemistry and Physics ◽

10.5194/acp-8-6603-2008 ◽

2008 ◽

Vol 8 (22) ◽

pp. 6603-6615 ◽

Cited By ~ 21

Author(s):

A. Kunz ◽

C. Schiller ◽

F. Rohrer ◽

H. G. J. Smit ◽

P. Nedelec ◽

...

Keyword(s):

Statistical Analysis ◽

Water Vapour ◽

Trace Gases ◽

Distribution Functions ◽

Variance Analysis ◽

Trace Gas ◽

Data Sets ◽

Data Set ◽

Atmospheric Processes ◽

Passenger Aircraft

Abstract. A statistical analysis for the comparability of water (H2O) and ozone (O3) data sets sampled during the SPURT aircraft campaigns and the MOZAIC passenger aircraft flights is presented. The Kolmogoroff-Smirnoff test reveals that the distribution functions from SPURT and MOZAIC trace gases differ from each other with a confidence of 95%. A variance analysis shows a different variability character in both trace gas data sets. While the SPURT H2O data only contain atmospheric processes variable on a diurnal or synoptical timescale, MOZAIC H2O data also reveal processes, which vary on inter-seasonal and seasonal timescales. The SPURT H2O data set does not represent the full MOZAIC H2O variance in the UT/LS for climatological investigations, whereas the variance of O3 is much better represented. SPURT H2O data are better suited in the stratosphere, where the MOZAIC RH sensor looses its sensitivity.

Supernova Host Galaxy Association and Photometric Classification of over 10,000 Light Curves from the Zwicky Transient Facility

Research Notes of the AAS ◽

10.3847/2515-5172/ac416e ◽

2021 ◽

Vol 5 (12) ◽

pp. 283

Author(s):

Braden Garretson ◽

Dan Milisavljevic ◽

Jack Reynolds ◽

Kathryn E. Weil ◽

Bhagya Subrayan ◽

...

Keyword(s):

Value Added ◽

Light Curves ◽

Host Galaxy ◽

Massive Data ◽

Data Sets ◽

Data Set ◽

Scale Modeling ◽

Final Data ◽

Type Ia

Abstract Here we present a catalog of 12,993 photometrically-classified supernova-like light curves from the Zwicky Transient Facility, along with candidate host galaxy associations. By training a random forest classifier on spectroscopically classified supernovae from the Bright Transient Survey, we achieve an accuracy of 80% across four supernova classes resulting in a final data set of 8208 Type Ia, 2080 Type II, 1985 Type Ib/c, and 720 SLSN. Our work represents a pathfinder effort to supply massive data sets of supernova light curves with value-added information that can be used to enable population-scale modeling of explosion parameters and investigate host galaxy environments.