scholarly journals Shrinkage observed-to-expected ratios for robust and transparent large-scale pattern discovery

2011 ◽  
Vol 22 (1) ◽  
pp. 57-69 ◽  
Author(s):  
G Niklas Norén ◽  
Johan Hopstadius ◽  
Andrew Bate

Large observational data sets are a great asset to better understand the effects of medicines in clinical practice and, ultimately, improve patient care. For an empirical pattern in observational data to be of practical relevance, it should represent a substantial deviation from the null model. For the purpose of identifying such deviations, statistical significance tests are inadequate, as they do not on their own distinguish the magnitude of an effect from its data support. The observed-to-expected (OE) ratio on the other hand directly measures strength of association and is an intuitive basis to identify a range of patterns related to event rates, including pairwise associations, higher order interactions and temporal associations between events over time. It is sensitive to random fluctuations for rare events with low expected counts but statistical shrinkage can protect against spurious associations. Shrinkage OE ratios provide a simple but powerful framework for large-scale pattern discovery. In this article, we outline a range of patterns that are naturally viewed in terms of OE ratios and propose a straightforward and effective statistical shrinkage transformation that can be applied to any such ratio. The proposed approach retains emphasis on the practical relevance and transparency of highlighted patterns, while protecting against spurious associations.

2020 ◽  
Vol 497 (4) ◽  
pp. 4077-4090 ◽  
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey

ABSTRACT A non-zero mutual information between morphology of a galaxy and its large-scale environment is known to exist in Sloan Digital Sky Survey (SDSS) upto a few tens of Mpc. It is important to test the statistical significance of these mutual information if any. We propose three different methods to test the statistical significance of these non-zero mutual information and apply them to SDSS and Millennium run simulation. We randomize the morphological information of SDSS galaxies without affecting their spatial distribution and compare the mutual information in the original and randomized data sets. We also divide the galaxy distribution into smaller subcubes and randomly shuffle them many times keeping the morphological information of galaxies intact. We compare the mutual information in the original SDSS data and its shuffled realizations for different shuffling lengths. Using a t-test, we find that a small but statistically significant (at $99.9{{\ \rm per\ cent}}$ confidence level) mutual information between morphology and environment exists upto the entire length-scale probed. We also conduct another experiment using mock data sets from a semi-analytic galaxy catalogue where we assign morphology to galaxies in a controlled manner based on the density at their locations. The experiment clearly demonstrates that mutual information can effectively capture the physical correlations between morphology and environment. Our analysis suggests that physical association between morphology and environment may extend to much larger length-scales than currently believed, and the information theoretic framework presented here can serve as a sensitive and useful probe of the assembly bias and large-scale environmental dependence of galaxy properties.


2018 ◽  
Author(s):  
Miraine Dávila Felipe ◽  
Jean-Baka Domelevo Entfellner ◽  
Frédéric Lemoine ◽  
Jakub Truszkowski ◽  
Olivier Gascuel

AbstractThe transfer distance (TD) was introduced in the classification framework and studied in the context of phylogenetic tree matching. Recently, Lemoine et al. (2018) showed that TD can be a powerful tool to assess the branch support of phylogenies with large data sets, thus providing a relevant alternative to Felsenstein’s bootstrap. This distance allows a reference branch β in a reference tree 𝒯 to be compared to a branch b from another tree T, both on the same set of n taxa. The TD between these branches is the number of taxa that must be transferred from one side of b to the other in order to obtain β. By taking the minimum TD from β to all branches in T we define the transfer index, denoted by ϕ(β, T), measuring the degree of agreement of β with T. Let us consider a reference branch β having p tips on its light side and define the transfer support (TS) as 1 – ϕ(β, T)/(p – 1). The aim of this article is to provide evidence that p 1 is a meaningful normalization constant in the definition of TS, and measure the statistical significance of TS, assuming that β is compared to a tree T drawn according to a null model. We obtain several results that shed light on these questions in a number of settings. In particular, we study the asymptotic behavior of TS when n tends to ∞, and fully characterize the distribution of ϕ when T is a caterpillar tree.


2009 ◽  
Vol 5 (H15) ◽  
pp. 767-767
Author(s):  
C. Pinte ◽  
F. Ménard ◽  
G. Duchěne ◽  
J. C. Augereau

A wide range of high-quality data is becoming available for protoplanetary disks. From these data sets many issues have already been addressed, such as constraining the large scale geometry of disks, finding evidence of dust grain evolution, as well as constraining the kinematics and physico-chemical conditions of the gas phase. Most of these results are based on models that emphasise fitting observations of either the dust component (SEDs or scattered light images or, more recently, interferometric visibilities), or the gas phase (resolved maps in molecular lines). In this contribution, we present a more global approach which aims at interpreting consistently the increasing amount of observational data in the framework of a single model, in order to to better characterize both the dust population and the gas disk properties, as well as their interactions. We present results of such modeling applied to a few disks (e.g. IM Lup, see Figure) with large observational data-sets available (scattered light images, polarisation maps, IR spectroscopy, X-ray spectrum, CO maps). These kinds of multi-wavelengths studies will become very powerful in the context of forthcoming instruments such as Herschel and ALMA.


2019 ◽  
Author(s):  
Samuel Pawel ◽  
Leonhard Held

Throughout the last decade, the so-called replication crisis has stimulated many researchers to conduct large-scale replication projects. With data from four of these projects, we computed probabilistic forecasts of the replication outcomes, which we then evaluated regarding discrimination, calibration and sharpness. A novel model, which can take into account both inflation and heterogeneity of effects, was used and predicted the effect estimate of the replication study with good performance in two of the four data sets. In the other two data sets, predictive performance was still substantially improved compared to the naive model which does not consider inflation and heterogeneity of effects. The results suggest that many of the estimates from the original studies were inflated, possibly caused by publication bias or questionable research practices, and also that some degree of heterogeneity between original and replication effects should be expected. Moreover, the results indicate that the use of statistical significance as the only criterion for replication success may be questionable, since from a predictive viewpoint, non-significant replication results are often compatible with significant results from the original study. The developed statistical methods as well as the data sets are available in the R package ReplicationSuccess.


2017 ◽  
Vol 17 (18) ◽  
pp. 11193-11207 ◽  
Author(s):  
Masakazu Taguchi

Abstract. This study compares large-scale dynamical variability in the extratropical stratosphere, such as major stratospheric sudden warmings (MSSWs), among the Japanese 55-year Reanalysis (JRA-55) family data sets. The JRA-55 family consists of three products: a standard product (STDD) of the JRA-55 reanalysis data and two sub-products of JRA-55C (CONV) and JRA-55AMIP (AMIP). CONV assimilates only conventional surface and upper-air observations without assimilation of satellite observations, whereas AMIP runs the same numerical weather prediction model without assimilation of observational data. A comparison of the occurrence of MSSWs in Northern Hemisphere (NH) winter shows that, compared to STDD, CONV delays several MSSWs by 1 to 4 days and also misses a few MSSWs. CONV also misses the Southern Hemisphere (SH) MSSW in September 2002. AMIP shows significantly fewer MSSWs in Northern Hemisphere winter and especially lacks MSSWs of the high aspect ratio of the polar vortex in which the vortex is highly stretched or split. A further examination of daily geopotential height differences between STDD and CONV reveals occasional peaks in both hemispheres that are separated from MSSWs. The delayed and missed MSSW cases have smaller height differences in magnitude than such peaks. The height differences for those MSSWs include large contributions from the zonal component, which reflects underestimations in the weakening of the zonal mean polar night jet in CONV. We also explore strong planetary wave forcings and associated polar vortex weakenings for STDD and AMIP. We find a lower frequency of strong wave forcings and weaker vortex responses to such wave forcings in AMIP, consistent with the lower MSSW frequency.


2019 ◽  
Vol 20 (2) ◽  
pp. 123-129 ◽  
Author(s):  
Mariana Jesus ◽  
Tânia Silva ◽  
César Cagigal ◽  
Vera Martins ◽  
Carla Silva

Introduction: The field of nutritional psychiatry is a fast-growing one. Although initially, it focused on the effects of vitamins and micronutrients in mental health, in the last decade, its focus also extended to the dietary patterns. The possibility of a dietary cost-effective intervention in the most common mental disorder, depression, cannot be overlooked due to its potential large-scale impact. Method: A classic review of the literature was conducted, and studies published between 2010 and 2018 focusing on the impact of dietary patterns in depression and depressive symptoms were included. Results: We found 10 studies that matched our criteria. Most studies showed an inverse association between healthy dietary patterns, rich in fruits, vegetables, lean meats, nuts and whole grains, and with low intake of processed and sugary foods, and depression and depressive symptoms throughout an array of age groups, although some authors reported statistical significance only in women. While most studies were of cross-sectional design, making it difficult to infer causality, a randomized controlled trial presented similar results. Discussion: he association between dietary patterns and depression is now well-established, although the exact etiological pathways are still unknown. Dietary intervention, with the implementation of healthier dietary patterns, closer to the traditional ones, can play an important role in the prevention and adjunctive therapy of depression and depressive symptoms. Conclusion: More large-scale randomized clinical trials need to be conducted, in order to confirm the association between high-quality dietary patterns and lower risk of depression and depressive symptoms.


2021 ◽  
pp. 1-8
Author(s):  
Regina Sá ◽  
Tiago Pinho-Bandeira ◽  
Guilherme Queiroz ◽  
Joana Matos ◽  
João Duarte Ferreira ◽  
...  

<b><i>Background:</i></b> Ovar was the first Portuguese municipality to declare active community transmission of SARS-CoV-2, with total lockdown decreed on March 17, 2020. This context provided conditions for a large-scale testing strategy, allowing a referral system considering other symptoms besides the ones that were part of the case definition (fever, cough, and dyspnea). This study aims to identify other symptoms associated with COVID-19 since it may clarify the pre-test probability of the occurrence of the disease. <b><i>Methods:</i></b> This case-control study uses primary care registers between March 29 and May 10, 2020 in Ovar municipality. Pre-test clinical and exposure-risk characteristics, reported by physicians, were collected through a form, and linked with their laboratory result. <b><i>Results:</i></b> The study population included a total of 919 patients, of whom 226 (24.6%) were COVID-19 cases and 693 were negative for SARS-CoV-2. Only 27.1% of the patients reporting contact with a confirmed or suspected case tested positive. In the multivariate analysis, statistical significance was obtained for headaches (OR 0.558), odynophagia (OR 0.273), anosmia (OR 2.360), and other symptoms (OR 2.157). The interaction of anosmia and odynophagia appeared as possibly relevant with a borderline statistically significant OR of 3.375. <b><i>Conclusion:</i></b> COVID-19 has a wide range of symptoms. Of the myriad described, the present study highlights anosmia itself and calls for additional studies on the interaction between anosmia and odynophagia. Headaches and odynophagia by themselves are not associated with an increased risk for the disease. These findings may help clinicians in deciding when to test, especially when other diseases with similar symptoms are more prevalent, namely in winter.


2020 ◽  
Vol 501 (1) ◽  
pp. 994-1001
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey ◽  
Snehasish Bhattacharjee

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.


Author(s):  
Lior Shamir

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .


Sign in / Sign up

Export Citation Format

Share Document