Causality in Linear Nongaussian Acyclic Models in the Presence of Latent Gaussian Confounders

LiNGAM has been successfully applied to some real-world causal discovery problems. Nevertheless, causal sufficiency is assumed; that is, there is no latent confounder of the observations, which may be unrealistic for real-world problems. Taking into the consideration latent confounders will improve the reliability and accuracy of estimations of the real causal structures. In this letter, we investigate a model called linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC) which can be seen as a specific case of lvLiNGAM. This model includes the latent confounders, which are assumed to be independent gaussian distributed and statistically independent of the disturbances. To tackle the causal discovery problem of this model, first we propose a pairwise cumulant-based measure of causal directions for cause-effect pairs. We prove that in spite of the presence of latent gaussian confounders, the causal direction of the observed cause-effect pair can be identified under the mild condition that the disturbances are simultaneously supergaussian or subgaussian. We propose a simple and efficient method to detect the violation of this condition. We extend our work to multivariate causal network discovery problems. Specifically we propose algorithms to estimate the causal network structure, including causal ordering and causal strengths, using an iterative root finding-removing scheme based on pairwise measure. To address the redundant edge problem due to the finite sample size effect, we develop an efficient bootstrapping-based pruning algorithm. Experiments on synthetic data and real-world data have been conducted to show the applicability of our model and the effectiveness of our proposed algorithms.

Download Full-text

A Benchmark for Bivariate Causal Discovery Methods

10.5194/egusphere-egu21-8584 ◽

2021 ◽

Author(s):

Christoph Käding ◽

Jakob Runge

Keyword(s):

Real World ◽

Large Scale ◽

Synthetic Data ◽

Causal Discovery ◽

Real World Data ◽

Dependency Structure ◽

Current State ◽

Data Generating Process ◽

Active Research ◽

It Knowledge

<p>The Earth&#8217;s climate is a highly complex and dynamical system. To better understand and robustly predict it, knowledge about its underlying dynamics and causal dependency structure is required. Since controlled experiments are infeasible in the climate system, observational data-driven approaches are needed. Observational causal inference is a very active research topic and a plethora of methods have been proposed. Each of these approaches comes with inherent strengths, weaknesses, and assumptions about the data generating process as well as further constraints.<br>In this work, we focus on the fundamental case of bivariate causal discovery, i.e., given two data samples X and Y the task is to detect whether X causes Y or Y causes X. We present a large-scale benchmark that represents combinations of various characteristics of data-generating processes and sample sizes. By comparing most of the current state-of-the-art methods, we aim to shed light onto the real-world performance of evaluated methods. Since we employ synthetic data, we are able to precisely control the data characteristics and can unveil the behavior of methods when their underlying assumptions are met or violated. Further, we give a comparison on a set of real-world data with known causal relations to complete our evaluation.</p>

Download Full-text

Boosting Instance Segmentation with Synthetic Data: A study to overcome the limits of real world data sets

10.1109/iccvw54120.2021.00110 ◽

2021 ◽

Author(s):

Florentin Poucin ◽

Andrea Kraus ◽

Martin Simon

Keyword(s):

Real World ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

World Data ◽

Instance Segmentation

Download Full-text

An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design

10.21203/rs.3.rs-54715/v3 ◽

2020 ◽

Author(s):

Jenna M Reps ◽

Peter Rijnbeek ◽

Alana Cuthbert ◽

Patrick B Ryan ◽

Nicole Pratt ◽

...

Keyword(s):

At Risk ◽

Real World ◽

Synthetic Data ◽

Real World Data ◽

Test Set ◽

Loss To Follow Up ◽

World Data ◽

Cohort Design ◽

Lost To Follow Up

Abstract Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.Methods: We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.Results: The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on model discrimination performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.

Download Full-text

A Spatial Biosurveillance Synthetic Data Generator in R

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7583 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Drew Levin ◽

Patrick Finley

Keyword(s):

Power Law ◽

Real World ◽

Degree Distribution ◽

Transportation Network ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Scale Free ◽

Data Generator

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data

Download Full-text

Evaluating a Longitudinal Synthetic Data Generator using Real World Data

2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS) ◽

10.1109/cbms52027.2021.00074 ◽

2021 ◽

Author(s):

Zhenchen Wang ◽

Puja Myles ◽

Anu Jain ◽

James L. Keidel ◽

Roberto Liddi ◽

...

Keyword(s):

Real World ◽

Synthetic Data ◽

Real World Data ◽

World Data ◽

Data Generator

Download Full-text

Using synthetic data for the dissemination of computational geospatial models

European Journal of Geography ◽

10.48088/ejg.k.che.11.3.76.91 ◽

2020 ◽

Vol 11 (3) ◽

pp. 76-91

Author(s):

Kostas CHELIOTIS ◽

Keyword(s):

Real World ◽

Synthetic Data ◽

Geospatial Data ◽

World Systems ◽

Close Collaboration ◽

Real World Data ◽

Computational Tools ◽

Full Extent ◽

Data Content ◽

Synthetic Datasets

Detailed datasets of real-world systems are becoming more and more available, accompanied by a similar increased use in research. However, datasets are often provided to researchers with restrictions regarding their publication. This poses a major limitation for the dissemination of computational tools, whose comprehension often requires the availability of the detailed dataset around which the tool was built. This paper discusses the potential of synthetic datasets for circumventing such limitations, as it is often the data content itself that is proprietary, rather than the dataset schema. Therefore, new data can be generated that conform to the schema, and may then be distributed freely alongside the relevant models, allowing other researchers to explore tools in action to their full extent. This paper presents the process of creating synthetic geospatial data within the scope of a research project which relied on real-world data, originally captured through close collaboration with industry partners.

Download Full-text

A Kernel Embedding–Based Approach for Nonstationary Causal Model Inference

Neural Computation ◽

10.1162/neco_a_01064 ◽

2018 ◽

Vol 30 (5) ◽

pp. 1394-1425 ◽

Cited By ~ 1

Author(s):

Shoubo Hu ◽

Zhitang Chen ◽

Laiwan Chan

Keyword(s):

Linear Model ◽

Real World ◽

Causal Model ◽

Real World Data ◽

Model Inference ◽

Causal Direction ◽

Causal Graphs ◽

Nonstationary Data ◽

Multiple Variables ◽

Multiple Domains

Although nonstationary data are more common in the real world, most existing causal discovery methods do not take nonstationarity into consideration. In this letter, we propose a kernel embedding–based approach, ENCI, for nonstationary causal model inference where data are collected from multiple domains with varying distributions. In ENCI, we transform the complicated relation of a cause-effect pair into a linear model of variables of which observations correspond to the kernel embeddings of the cause-and-effect distributions in different domains. In this way, we are able to estimate the causal direction by exploiting the causal asymmetry of the transformed linear model. Furthermore, we extend ENCI to causal graph discovery for multiple variables by transforming the relations among them into a linear nongaussian acyclic model. We show that by exploiting the nonstationarity of distributions, both cause-effect pairs and two kinds of causal graphs are identifiable under mild conditions. Experiments on synthetic and real-world data are conducted to justify the efficacy of ENCI over major existing methods.

Download Full-text

Synthetic Data Generator for Electric Vehicle Charging Sessions: Modeling and Evaluation Using Real-World Data

Energies ◽

10.3390/en13164211 ◽

2020 ◽

Vol 13 (16) ◽

pp. 4211

Author(s):

Manu Lahariya ◽

Dries F. Benoit ◽

Chris Develder

Keyword(s):

Electric Vehicle ◽

Real World ◽

Synthetic Data ◽

Gaussian Mixture ◽

Real World Data ◽

Arrival Times ◽

World Data ◽

Data Generator ◽

Charging Stations ◽

Ev Charging

Electric vehicle (EV) charging stations have become prominent in electricity grids in the past few years. Their increased penetration introduces both challenges and opportunities; they contribute to increased load, but also offer flexibility potential, e.g., in deferring the load in time. To analyze such scenarios, realistic EV data are required, which are hard to come by. Therefore, in this article we define a synthetic data generator (SDG) for EV charging sessions based on a large real-world dataset. Arrival times of EVs are modeled assuming that the inter-arrival times of EVs follow an exponential distribution. Connection time for EVs is dependent on the arrival time of EV, and can be described using a conditional probability distribution. This distribution is estimated using Gaussian mixture models, and departure times can calculated by sampling connection times for EV arrivals from this distribution. Our SDG is based on a novel method for the temporal modeling of EV sessions, and jointly models the arrival and departure times of EVs for a large number of charging stations. Our SDG was trained using real-world EV sessions, and used to generate synthetic samples of session data, which were statistically indistinguishable from the real-world data. We provide both (i) source code to train SDG models from new data, and (ii) trained models that reflect real-world datasets.

Download Full-text

An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01408-x ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jenna M. Reps ◽

Peter Rijnbeek ◽

Alana Cuthbert ◽

Patrick B. Ryan ◽

Nicole Pratt ◽

...

Keyword(s):

At Risk ◽

Real World ◽

Synthetic Data ◽

Real World Data ◽

Test Set ◽

Loss To Follow Up ◽

World Data ◽

Cohort Design ◽

Lost To Follow Up

Abstract Background Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up. Methods We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: (1) include all patients (including those lost to follow-up), (2) exclude all patients lost to follow-up or (3) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance. Results The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on model discrimination performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided. Conclusion Based on this study we therefore recommend (1) developing models using data that includes patients that are lost to follow-up and (2) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.

Download Full-text

Analysis of cause-effect inference by comparing regression errors

PeerJ Computer Science ◽

10.7717/peerj-cs.169 ◽

2019 ◽

Vol 5 ◽

pp. e169 ◽

Cited By ~ 5

Author(s):

Patrick Blöbaum ◽

Dominik Janzing ◽

Takashi Washio ◽

Shohei Shimizu ◽

Bernhard Schölkopf

Keyword(s):

Causal Inference ◽

Least Squares ◽

Real World ◽

Causal Relation ◽

Data Sets ◽

Noise Distribution ◽

Real World Data ◽

World Data ◽

Causal Direction ◽

Inference Methods

We address the problem of inferring the causal direction between two variables by comparing the least-squares errors of the predictions in both possible directions. Under the assumption of an independence between the function relating cause and effect, the conditional noise distribution, and the distribution of the cause, we show that the errors are smaller in causal direction if both variables are equally scaled and the causal relation is close to deterministic. Based on this, we provide an easily applicable algorithm that only requires a regression in both possible causal directions and a comparison of the errors. The performance of the algorithm is compared with various related causal inference methods in different artificial and real-world data sets.

Download Full-text