scholarly journals Semblance: An empirical similarity kernel on probability spaces

2019 ◽  
Vol 5 (12) ◽  
pp. eaau9630 ◽  
Author(s):  
Divyansh Agarwal ◽  
Nancy R. Zhang

In data science, determining proximity between observations is critical to many downstream analyses such as clustering, classification, and prediction. However, when the data’s underlying probability distribution is unclear, the function used to compute similarity between data points is often arbitrarily chosen. Here, we present a novel definition of proximity, Semblance, that uses the empirical distribution of a feature to inform the pair-wise similarity between observations. The advantage of Semblance lies in its distribution-free formulation and its ability to place greater emphasis on proximity between observation pairs that fall at the outskirts of the data distribution, as opposed to those toward the center. Semblance is a valid Mercer kernel, allowing its principled use in kernel-based learning algorithms, and for any data modality. We demonstrate its consistently improved performance against conventional methods through simulations and real case studies from diverse applications in single-cell transcriptomics, image reconstruction, and financial forecasting.

2013 ◽  
Vol 9 (1) ◽  
pp. 62-74 ◽  
Author(s):  
Robert Hodgson ◽  
Jing Cao

AbstractA test for evaluating wine judge performance is developed. The test is based on the premise that an expert wine judge will award similar scores to an identical wine. The definition of “similar” is parameterized to include varying numbers of adjacent awards on an ordinal scale, from No Award to Gold. For each index of similarity, a probability distribution is developed to determine the likelihood that a judge might pass the test by chance alone. When the test is applied to the results from a major wine competition, few judges pass the test. Of greater interest is that many judges who fail the test have vast professional experience in the wine industry. This leads to us to question the basic premise that experts are able to provide consistent evaluations in wine competitions and, hence, that wine competitions do not provide reliable recommendations of wine quality. (JEL Classifications: C02, C12, D81)


2012 ◽  
Vol 15 (01) ◽  
pp. 1250001 ◽  
Author(s):  
JIM GATHERAL ◽  
TAI-HO WANG

In this article, we derive a new most-likely-path (MLP) approximation for implied volatility in terms of local volatility, based on time-integration of the lowest order term in the heat-kernel expansion. This new approximation formula turns out to be a natural extension of the well-known formula of Berestycki, Busca and Florent. Various other MLP approximations have been suggested in the literature involving different choices of most-likely-path; our work fixes a natural definition of the most-likely-path. We confirm the improved performance of our new approximation relative to existing approximations in an explicit computation using a realistic S&P500 local volatility function.


Author(s):  
Justine Ina Davies ◽  
Adrian W. Gelb ◽  
Julian Gore-Booth ◽  
Janet Martin ◽  
Jannicke Mellin-Olsen ◽  
...  

Background Indicators to evaluate progress towards timely access to safe surgical, anaesthesia, and obstetric (SAO) care were proposed in 2015 by the Lancet Commission on Global Surgery. Despite being rapidly taken up by practitioners, datapoints from which to derive them were not defined, limiting comparability across time or settings. We convened global experts to evaluate and explicitly define - for the first time - the indicators to improve comparability and support achievement of 2030 goals to improve access to safe affordable surgical and anaesthesia care. Methods and findings The Utstein process for developing and reporting guidelines through a consensus building process was followed. In-person discussions at a two day meeting were followed by an iterative process conducted by email and virtual group meetings until consensus was reached. Participants consisted of experts in surgery, anaesthesia, and obstetric care, data science, and health indicators from high, middle, and low income countries. Considering each of the six indicators in turn, we refined overarching descriptions and agreed upon data points needed for construction of each indicator at current time (basic data points), and as each evolves over 2-5 (intermediate) and >5 year (full) timeframes. We removed one of the original six indicators (one of two financial risk protection indicators was eliminated) and refined descriptions and defined data points required to construct the 5 remaining indicators: geospatial access, workforce, surgical volume, perioperative mortality, and catastrophic expenditure. Conclusions To track global progress toward timely access to quality SAO care, these indicators – at the basic level - should be implemented universally. Intermediate and full evolutions will assist in developing national surgical plans, and collecting data for research studies.


2019 ◽  
Author(s):  
Levi John Wolf ◽  
Sergio J. Rey ◽  
Taylor M. Oshan

Open science practices are a large and healthy part of computational geography and the burgeoning field of spatial data science. In many forms, open geospatial cyberinfrastructure adheres to a varying and informal set of practices and codes that empower levels of collaboration that are impossible otherwise. Pathbreaking work in geographical sciences has explicitly brought these concepts into focus for our current model of open science in geography. In practice, however, these blend together into a somewhat ill-advised but easy-to-use working definition of open science: you know open science when you see it (on GitHub). However, open science lags far behind the needs revealed by this level of collaboration. In this paper, we describe the concerns of open geographic data science, in terms of replicability and open science. We discuss the practical techniques that engender community-building in open science communities, and discuss the impacts that these kinds of social changes have on the technological architecture of scientific infrastructure.


CJEM ◽  
2017 ◽  
Vol 19 (S1) ◽  
pp. S79-S80 ◽  
Author(s):  
S. AlQahtani ◽  
P. Menzies ◽  
B. Bigham ◽  
M. Welsford

Introduction: Early recognition of sepsis is key in delivering timely life-saving interventions. The role of paramedics in recognition of these patients is understudied. It is not known if the usual prehospital information gathered is sufficient for severe sepsis recognition. We sought to: 1) evaluate the paramedic medical records (PMRs) of severe sepsis patients to describe epidemiologic characteristics; 2) determine which severe sepsis recognition and prediction scores are routinely captured by paramedics; and 3) determine how these scores perform in the prehospital setting. Methods: We performed a retrospective review of patients ≥18 years who met the definition of severe sepsis in one of two urban Emergency Departments (ED) and had arrived by ambulance over an eighteen-month period. PMRs were evaluated for demographic, physiologic and clinical variables. The information was entered into a database, which auto-filled a tool that determined SIRS criteria, shock index, prehospital critical illness score, NEWS, MEWS, HEWS, MEDS and qSOFA. Descriptive statistics were calculated. Results: We enrolled 298 eligible sepsis patients: male 50.3%, mean age 73 years, and mean prehospital transportation time 30 minutes. Hospital mortality was 37.5%. PMRs captured initial: respiratory rate 88.6%, heart rate 90%, systolic blood pressure 83.2%, oxygen saturation 59%, temperature 18.7%, and Glasgow Coma Scale 89%. Although complete MEWS and HEWS data capture rate was <17%, 98% and 68% patients met the cut-point defining “critically-unwell” (MEWS ≥3) and “trigger score” (HEWS ≥5), respectively. The qSOFA criteria were completely captured in 82% of patients; however, it was positive in only 36%. It performed similarly to SIRS, which was positive in only 34% of patients. The other scores were interim in having complete data captured and performance for sepsis recognition. Conclusion: Patients transported by ambulance with severe sepsis have high mortality. Despite the variable rate of data capture, PMRs include sufficient data points to recognize prehospital severe sepsis. A validated screening tool that can be applied by paramedics is still lacking. qSOFA does not appear to be sensitive enough to be used as a prehospital screening tool for deadly sepsis, however, MEWS or HEWS may be appropriate to evaluate in a large prospective study.


2019 ◽  
Vol 69 (2) ◽  
pp. 453-468
Author(s):  
Demetrios P. Lyberopoulos ◽  
Nikolaos D. Macheras ◽  
Spyridon M. Tzaninis

Abstract Under mild assumptions the equivalence of the mixed Poisson process with mixing parameter a real-valued random variable to the one with mixing probability distribution as well as to the mixed Poisson process in the sense of Huang is obtained, and a characterization of each one of the above mixed Poisson processes in terms of disintegrations is provided. Moreover, some examples of “canonical” probability spaces admitting counting processes satisfying the equivalence of all above statements are given. Finally, it is shown that our assumptions for the characterization of mixed Poisson processes in terms of disintegrations cannot be omitted.


Author(s):  
Tianhang Zheng ◽  
Changyou Chen ◽  
Kui Ren

Recent work on adversarial attack has shown that Projected Gradient Descent (PGD) Adversary is a universal first-order adversary, and the classifier adversarially trained by PGD is robust against a wide range of first-order attacks. It is worth noting that the original objective of an attack/defense model relies on a data distribution p(x), typically in the form of risk maximization/minimization, e.g., max/min Ep(x) L(x) with p(x) some unknown data distribution and L(·) a loss function. However, since PGD generates attack samples independently for each data sample based on L(·), the procedure does not necessarily lead to good generalization in terms of risk optimization. In this paper, we achieve the goal by proposing distributionally adversarial attack (DAA), a framework to solve an optimal adversarial-data distribution, a perturbed distribution that satisfies the L∞ constraint but deviates from the original data distribution to increase the generalization risk maximally. Algorithmically, DAA performs optimization on the space of potential data distributions, which introduces direct dependency between all data points when generating adversarial samples. DAA is evaluated by attacking state-of-the-art defense models, including the adversarially-trained models provided by MIT MadryLab. Notably, DAA ranks the first place on MadryLab’s white-box leaderboards, reducing the accuracy of their secret MNIST model to 88.56% (with l∞ perturbations of ε = 0.3) and the accuracy of their secret CIFAR model to 44.71% (with l∞ perturbations of ε = 8.0). Code for the experiments is released on https://github.com/tianzheng4/Distributionally-Adversarial-Attack.


2019 ◽  
Vol 37 (6) ◽  
pp. 929-951 ◽  
Author(s):  
Laurent Remy ◽  
Dragan Ivanović ◽  
Maria Theodoridou ◽  
Athina Kritsotaki ◽  
Paul Martin ◽  
...  

Purpose The purpose of this paper is to boost multidisciplinary research by the building of an integrated catalogue or research assets metadata. Such an integrated catalogue should enable researchers to solve problems or analyse phenomena that require a view across several scientific domains. Design/methodology/approach There are two main approaches for integrating metadata catalogues provided by different e-science research infrastructures (e-RIs): centralised and distributed. The authors decided to implement a central metadata catalogue that describes, provides access to and records actions on the assets of a number of e-RIs participating in the system. The authors chose the CERIF data model for description of assets available via the integrated catalogue. Analysis of popular metadata formats used in e-RIs has been conducted, and mappings between popular formats and the CERIF data model have been defined using an XML-based tool for description and automatic execution of mappings. Findings An integrated catalogue of research assets metadata has been created. Metadata from e-RIs supporting Dublin Core, ISO 19139, DCAT-AP, EPOS-DCAT-AP, OIL-E and CKAN formats can be integrated into the catalogue. Metadata are stored in CERIF RDF in the integrated catalogue. A web portal for searching this catalogue has been implemented. Research limitations/implications Only five formats are supported at this moment. However, description of mappings between other source formats and the target CERIF format can be defined in the future using the 3M tool, an XML-based tool for describing X3ML mappings that can then be automatically executed on XML metadata records. The approach and best practices described in this paper can thus be applied in future mappings between other metadata formats. Practical implications The integrated catalogue is a part of the eVRE prototype, which is a result of the VRE4EIC H2020 project. Social implications The integrated catalogue should boost the performance of multi-disciplinary research; thus it has the potential to enhance the practice of data science and so contribute to an increasingly knowledge-based society. Originality/value A novel approach for creation of the integrated catalogue has been defined and implemented. The approach includes definition of mappings between various formats. Defined mappings are effective and shareable.


2019 ◽  
Vol 1 (1) ◽  
pp. 359-383 ◽  
Author(s):  
Frank Emmert-Streib ◽  
Matthias Dehmer

Regression models are a form of supervised learning methods that are important for machine learning, statistics, and general data science. Despite the fact that classical ordinary least squares (OLS) regression models have been known for a long time, in recent years there are many new developments that extend this model significantly. Above all, the least absolute shrinkage and selection operator (LASSO) model gained considerable interest. In this paper, we review general regression models with a focus on the LASSO and extensions thereof, including the adaptive LASSO, elastic net, and group LASSO. We discuss the regularization terms responsible for inducing coefficient shrinkage and variable selection leading to improved performance metrics of these regression models. This makes these modern, computational regression models valuable tools for analyzing high-dimensional problems.


Sign in / Sign up

Export Citation Format

Share Document