A Benchmark for Bivariate Causal Discovery Methods

Mapping Intimacies ◽

10.5194/egusphere-egu21-8584 ◽

2021 ◽

Author(s):

Christoph Käding ◽

Jakob Runge

Keyword(s):

Real World ◽

Large Scale ◽

Synthetic Data ◽

Causal Discovery ◽

Real World Data ◽

Dependency Structure ◽

Current State ◽

Data Generating Process ◽

Active Research ◽

It Knowledge

<p>The Earth&#8217;s climate is a highly complex and dynamical system. To better understand and robustly predict it, knowledge about its underlying dynamics and causal dependency structure is required. Since controlled experiments are infeasible in the climate system, observational data-driven approaches are needed. Observational causal inference is a very active research topic and a plethora of methods have been proposed. Each of these approaches comes with inherent strengths, weaknesses, and assumptions about the data generating process as well as further constraints.<br>In this work, we focus on the fundamental case of bivariate causal discovery, i.e., given two data samples X and Y the task is to detect whether X causes Y or Y causes X. We present a large-scale benchmark that represents combinations of various characteristics of data-generating processes and sample sizes. By comparing most of the current state-of-the-art methods, we aim to shed light onto the real-world performance of evaluated methods. Since we employ synthetic data, we are able to precisely control the data characteristics and can unveil the behavior of methods when their underlying assumptions are met or violated. Further, we give a comparison on a set of real-world data with known causal relations to complete our evaluation.</p>

Download Full-text

Causality in Linear Nongaussian Acyclic Models in the Presence of Latent Gaussian Confounders

Neural Computation ◽

10.1162/neco_a_00444 ◽

2013 ◽

Vol 25 (6) ◽

pp. 1605-1641 ◽

Cited By ~ 8

Author(s):

Zhitang Chen ◽

Laiwan Chan

Keyword(s):

Real World ◽

Synthetic Data ◽

Causal Discovery ◽

Causal Network ◽

Finite Sample ◽

Real World Data ◽

Root Finding ◽

Network Discovery ◽

Causal Direction ◽

Finite Sample Size

LiNGAM has been successfully applied to some real-world causal discovery problems. Nevertheless, causal sufficiency is assumed; that is, there is no latent confounder of the observations, which may be unrealistic for real-world problems. Taking into the consideration latent confounders will improve the reliability and accuracy of estimations of the real causal structures. In this letter, we investigate a model called linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC) which can be seen as a specific case of lvLiNGAM. This model includes the latent confounders, which are assumed to be independent gaussian distributed and statistically independent of the disturbances. To tackle the causal discovery problem of this model, first we propose a pairwise cumulant-based measure of causal directions for cause-effect pairs. We prove that in spite of the presence of latent gaussian confounders, the causal direction of the observed cause-effect pair can be identified under the mild condition that the disturbances are simultaneously supergaussian or subgaussian. We propose a simple and efficient method to detect the violation of this condition. We extend our work to multivariate causal network discovery problems. Specifically we propose algorithms to estimate the causal network structure, including causal ordering and causal strengths, using an iterative root finding-removing scheme based on pairwise measure. To address the redundant edge problem due to the finite sample size effect, we develop an efficient bootstrapping-based pruning algorithm. Experiments on synthetic data and real-world data have been conducted to show the applicability of our model and the effectiveness of our proposed algorithms.

Download Full-text

Boosting Instance Segmentation with Synthetic Data: A study to overcome the limits of real world data sets

10.1109/iccvw54120.2021.00110 ◽

2021 ◽

Author(s):

Florentin Poucin ◽

Andrea Kraus ◽

Martin Simon

Keyword(s):

Real World ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

World Data ◽

Instance Segmentation

Download Full-text

A Real-time Dynamic Simulation Scheme for Large-Scale Flood Hazard Using 3D Real World Data

2007 11th International Conference Information Visualization (IV '07) ◽

10.1109/iv.2007.15 ◽

2007 ◽

Cited By ~ 6

Author(s):

C Wang ◽

T. R. Wan ◽

I. J. Palmer

Keyword(s):

Real Time ◽

Dynamic Simulation ◽

Real World ◽

Large Scale ◽

Flood Hazard ◽

Real World Data ◽

World Data ◽

Time Dynamic ◽

Simulation Scheme

Download Full-text

An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design

10.21203/rs.3.rs-54715/v3 ◽

2020 ◽

Author(s):

Jenna M Reps ◽

Peter Rijnbeek ◽

Alana Cuthbert ◽

Patrick B Ryan ◽

Nicole Pratt ◽

...

Keyword(s):

At Risk ◽

Real World ◽

Synthetic Data ◽

Real World Data ◽

Test Set ◽

Loss To Follow Up ◽

World Data ◽

Cohort Design ◽

Lost To Follow Up

Abstract Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.Methods: We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.Results: The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on model discrimination performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.

Download Full-text

Proton Pump Inhibitors and Risk of Dementia: A Hypothesis Generated but Not Adequately Tested

American Journal of Alzheimer s Disease & Other Dementias® ◽

10.1177/15333175211062413 ◽

2021 ◽

Vol 36 ◽

pp. 153331752110624

Author(s):

Mishah Azhar ◽

Lawrence Fiedler ◽

Patricio S. Espinosa ◽

Charles H. Hennekens

Keyword(s):

Proton Pump Inhibitors ◽

Proton Pump ◽

Real World ◽

Large Scale ◽

Basic Research ◽

The United States ◽

Epidemiological Studies ◽

Real World Data ◽

Health Authorities ◽

Public Health Authorities

We reviewed the evidence on proton pump inhibitors (PPIs) and dementia. PPIs are among the most widely utilized drugs in the world. Dementia affects roughly 5% of the population of the United States (US) and world aged 60 years and older. With respect to PPIs and dementia, basic research has suggested plausible mechanisms but descriptive and analytic epidemiological studies are not inconsistent. In addition, a single large-scale randomized trial showed no association. When the evidence is incomplete, it is appropriate for clinicians and researchers to remain uncertain. Regulatory or public health authorities sometimes need to make real-world decisions based on real-world data. When the evidence is complete, then the most rational judgments for individual patients the health of the general public are possible At present, the evidence on PPIs and dementia suggests more reassurance than alarm. Further large-scale randomized evidence is necessary to do so.

Download Full-text

Heart Snapshot: a broadly validated smartphone measure of VO2max for collection of real world data

10.1101/2020.07.02.185314 ◽

2020 ◽

Author(s):

Dan E. Webster ◽

Meghasyam Tummalacherla ◽

Michael Higgins ◽

David Wing ◽

Euan Ashley ◽

...

Keyword(s):

Real World ◽

Gold Standard ◽

Large Scale ◽

Digital Health ◽

Clinical Care ◽

Skin Pigmentation ◽

Epidemiologic Studies ◽

Step Test ◽

Real World Data ◽

Laboratory Equipment

AbstractExpanding access to precision medicine will increasingly require that patient biometrics can be measured in remote care settings. VO2max, the maximum volume of oxygen usable during intense exercise, is one of the most predictive biometric risk factors for cardiovascular disease, frailty, and overall mortality.1,2 However, VO2max measurements are rarely performed in clinical care or large-scale epidemiologic studies due to the high cost, participant burden, and need for specialized laboratory equipment and staff.3,4 To overcome these barriers, we developed two smartphone sensor-based protocols for estimating VO2max: a generalization of a 12-minute run test (12-MRT) and a submaximal 3-minute step test (3-MST). In laboratory settings, Lins concordance for these two tests relative to gold standard VO2max testing was pc=0.66 for 12-MRT and pc=0.61 for 3-MST. Relative to “silver standards”5 (Cooper/Tecumseh protocols), concordance was pc=0.96 and pc=0.94, respectively. However, in remote settings, 12-MRT was significantly less concordant with gold standard (pc=0.25) compared to 3-MST (pc=0.61), though both had high test-retest reliability (ICC=0.88 and 0.86, respectively). These results demonstrate the importance of real-world evidence for validation of digital health measurements. In order to validate 3-MST in a broadly representative population in accordance with the All of Us Research Program6 for which this measurement was developed, the camera-based heart rate measurement was investigated for potential bias. No systematic measurement error was observed that corresponded to skin pigmentation level, operating system, or cost of the phone used. The smartphone-based 3-MST protocol, here termed Heart Snapshot, maintained fidelity across demographic variation in age and sex, across diverse skin pigmentation, and between iOS and Android implementations of various smartphone models. The source code for these smartphone measurements, along with the data used to validate them,6 are openly available to the research community.

Download Full-text

A Spatial Biosurveillance Synthetic Data Generator in R

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7583 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Drew Levin ◽

Patrick Finley

Keyword(s):

Power Law ◽

Real World ◽

Degree Distribution ◽

Transportation Network ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Scale Free ◽

Data Generator

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data

Download Full-text

A Joint Learning Approach to Intelligent Job Interview Assessment

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/492 ◽

2018 ◽

Cited By ~ 9

Author(s):

Dazhong Shen ◽

Hengshu Zhu ◽

Chen Zhu ◽

Tong Xu ◽

Chao Ma ◽

...

Keyword(s):

Real World ◽

Latent Variable ◽

Large Scale ◽

Real World Data ◽

Job Interviews ◽

Variable Model ◽

Job Interview ◽

Joint Learning ◽

Interview Process ◽

The Right

The job interview is considered as one of the most essential tasks in talent recruitment, which forms a bridge between candidates and employers in fitting the right person for the right job. While substantial efforts have been made on improving the job interview process, it is inevitable to have biased or inconsistent interview assessment due to the subjective nature of the traditional interview process. To this end, in this paper, we propose a novel approach to intelligent job interview assessment by learning the large-scale real-world interview data. Specifically, we develop a latent variable model named Joint Learning Model on Interview Assessment (JLMIA) to jointly model job description, candidate resume and interview assessment. JLMIA can effectively learn the representative perspectives of different job interview processes from the successful job application records in history. Therefore, a variety of applications in job interviews can be enabled, such as person-job fit and interview question recommendation. Extensive experiments conducted on real-world data clearly validate the effectiveness of JLMIA, which can lead to substantially less bias in job interviews and provide a valuable understanding of job interview assessment.

Download Full-text

Exploring the Feasibility of Using Real-World Data from a Large Clinical Data Research Network to Simulate Clinical Trials of Alzheimer’s Disease

10.1101/2020.06.03.20121491 ◽

2020 ◽

Author(s):

Zhaoyi Chen ◽

Hansi Zhang ◽

Yi Guo ◽

Thomas J George ◽

Mattia Prosperi ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Clinical Trials ◽

Clinical Data ◽

Real World ◽

Large Scale ◽

Research Network ◽

Real World Data ◽

World Data ◽

Trial Simulation

AbstractClinical trials are essential but often have high financial costs and long execution time. Trial simulation using real world data (RWD) could potentially provide insights on a treatment’s efficacy and safety before running a large-scale trial. In this work, we explored the feasibility of using RWD from a large clinical data research network to simulate a randomized controlled trial of Alzheimer’s disease considering two different scenarios: an one-arm simulation of the standard-of-care control arm; and a two-arm simulation comparing treatment safety between the intervention and control arms with proper patient matching algorithms. We followed original trial’s design and addressed some key questions, including how to translate trial criteria to database queries and establish measures of safety (i.e., serious adverse events) from RWD. Our simulation generated results comparable to the original trial, but also exposed gaps in both trial simulation methodology and the generalizability issue of clinical trials.

Download Full-text

Predicting individual risk for COVID19 complications using EMR data

10.1101/2020.06.03.20121574 ◽

2020 ◽

Cited By ~ 2

Author(s):

Yaron Kinar ◽

Alon Lanyado ◽

Avi Shoshan ◽

Rachel Yesharim ◽

Tamar Domany ◽

...

Keyword(s):

High Risk ◽

Real World ◽

Large Scale ◽

Predictive Analytics ◽

Epidemiological Data ◽

Added Value ◽

Individual Risk ◽

Real World Data ◽

Mesh Terms ◽

Age And Sex

AbstractBackgroundThe global pandemic of COVID-19 has challenged healthcare organizations and caused numerous deaths and hospitalizations worldwide. The need for data-based decision support tools for many aspects of controlling and treating the disease is evident but has been hampered by the scarcity of real-world reliable data. Here we describe two approaches: a. the use of an existing EMR-based model for predicting complications due to influenza combined with available epidemiological data to create a model that identifies individuals at high risk to develop complications due to COVID-19 and b. a preliminary model that is trained using existing real world COVID-19 data.MethodsWe have utilized the computerized data of Maccabi Healthcare Services a 2.3 million member state-mandated health organization in Israel. The age and sex matched matrix used for training the XGBoost ILI-based model included, circa 690,000 rows and 900 features. The available dataset for COVID-based model included a total 2137 SARS-CoV-2 positive individuals who were either not hospitalized (n = 1658), or hospitalized and marked as mild (n = 332), or as having moderate (n = 83) or severe (n = 64) complications.FindingsThe AUC of our models and the priors on the 2137 COVID-19 patients for predicting moderate and severe complications as cases and all other as controls, the AUC for the ILI-based model was 0.852[0.824–0.879] for the COVID19-based model – 0.872[0.847–0.879].InterpretationThese models can effectively identify patients at high-risk for complication, thus allowing optimization of resources and more focused follow up and early triage these patients if once symptoms worsen.FundingThere was no funding for this studyResearch in contextEvidence before this studyWe have search PubMed for coronavirus[MeSH Major Topic] AND the following MeSH terms: risk score, predictive analytics, algorithm, predictive analytics. Only few studies were found on predictive analytics for developing COVID19 complications using real-world data. Many of the relevant works were based on self-reported information and are therefore difficult to implement at large scale and without patient or physician participation.Added value of this studyWe have described two models for assessing risk of COVID-19 complications and mortality, based on EMR data. One model was derived by combining a machine-learning model for influenza-complications with epidemiological data for age and sex dependent mortality rates due to COVID-19. The other was directly derived from initial COVID-19 complications data.Implications of all the available evidenceThe developed models may effectively identify patients at high-risk for developing COVID19 complications. Implementing such models into operational data systems may support COVID-19 care workflows and assist in triaging patients.

Download Full-text