Causal Queries from Observational Data in Biological Systems via Bayesian Networks: An Empirical Study in Small Networks

Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.612551 ◽

2021 ◽

Vol 4 ◽

Author(s):

Bradley Butcher ◽

Vincent S. Huang ◽

Christopher Robinson ◽

Jeremy Reffin ◽

Sema K. Sgaier ◽

...

Keyword(s):

Global Health ◽

Bayesian Networks ◽

Sample Size ◽

Observational Data ◽

Real World ◽

Structure Learning ◽

Ground Truth ◽

Research Process ◽

Real World Data ◽

Real World Datasets

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

Download Full-text

Modeling and analysis of disease and risk factors through learning Bayesian networks from observational data

Quality and Reliability Engineering International ◽

10.1002/qre.893 ◽

2008 ◽

Vol 24 (3) ◽

pp. 291-302 ◽

Cited By ~ 13

Author(s):

Jing Li ◽

Jianjun Shi ◽

Devin Satz

Keyword(s):

Risk Factors ◽

Bayesian Networks ◽

Observational Data ◽

Modeling And Analysis

Download Full-text

Research Note—Toward a Causal Interpretation from Observational Data: A New Bayesian Networks Method for Structural Models with Latent Variables

Information Systems Research ◽

10.1287/isre.1080.0224 ◽

2010 ◽

Vol 21 (2) ◽

pp. 365-391 ◽

Cited By ~ 17

Author(s):

Zhiqiang (Eric) Zheng ◽

Paul A. Pavlou

Keyword(s):

Bayesian Networks ◽

Observational Data ◽

Latent Variables ◽

Structural Models ◽

Research Note ◽

Causal Interpretation

Download Full-text

Knowledge discovery from observational data for process control using causal Bayesian networks

IIE Transactions ◽

10.1080/07408170600899532 ◽

2007 ◽

Vol 39 (6) ◽

pp. 681-690 ◽

Cited By ~ 45

Author(s):

Jing Li ◽

Jianjun Shi

Keyword(s):

Process Control ◽

Bayesian Networks ◽

Knowledge Discovery ◽

Observational Data ◽

Causal Bayesian Networks

Download Full-text

Revealing Structure of Complex Biological Systems Using Bayesian Networks

Network Science ◽

10.1007/978-1-84996-396-1_9 ◽

2010 ◽

pp. 185-204 ◽

Cited By ~ 1

Author(s):

V. Anne Smith

Keyword(s):

Bayesian Networks ◽

Biological Systems

Download Full-text

An Empirical Study of Massively Parallel Bayesian Networks Learning for Sentiment Extraction from Unstructured Text

Web Technologies and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-642-20291-9_47 ◽

2011 ◽

pp. 424-435 ◽

Cited By ~ 6

Author(s):

Wei Chen ◽

Lang Zong ◽

Weijing Huang ◽

Gaoyan Ou ◽

Yue Wang ◽

...

Keyword(s):

Empirical Study ◽

Bayesian Networks ◽

Massively Parallel ◽

Unstructured Text

Download Full-text

Bayesian Network Approach to Estimate Gene Networks

Medical Informatics ◽

10.4018/978-1-60566-050-9.ch173 ◽

2011 ◽

pp. 2281-2305

Author(s):

Seiya Imoto

Keyword(s):

Bayesian Networks ◽

Observational Data ◽

Microarray Data ◽

Gene Networks ◽

Gene Network ◽

Directed Graphs ◽

Biological Data ◽

Biological Information ◽

Continuous Variables ◽

Genome Wide Data

In cells, genes interact with each other and this system can be viewed as directed graphs. A gene network is a graphical representation of transcriptional relations between genes and the problem of estimation of gene networks from genome-wide data, such as DNA microarray gene expression data, is one of the important issues in bioinformatics and systems biology. Here, we present a statistical method based on Bayesian networks to estimate gene networks from microarray data and other biological data. Because microarray data are measured as continuous variables and the relationship between genes are usually nonlinear, we combine Bayesian networks and nonparametric regression to handle continuous variables and nonlinear relations. Most parts of gene networks are still unknown, and we need to estimate them from observational data. This problem is equivalent to the structural learning of Bayesian networks, and we solve it from a Bayes approach. The main difficulty of gene network estimation is due to the number of genes involved in the network. Therefore, it leads to model overfitting to the observational data like microarray data. Hence, a combination of various kinds of biological data is a key technique to estimate accurate gene networks. We show a general framework to combine microarray data and other biological information to estimate gene networks.

Download Full-text

Bayesian Network Approach to Estimate Gene Networks

Bayesian Network Technologies ◽

10.4018/978-1-59904-141-4.ch013 ◽

2007 ◽

pp. 269-299

Author(s):

Seiya Imoto ◽

Satoru Miyano

Keyword(s):

Bayesian Networks ◽

Observational Data ◽

Microarray Data ◽

Gene Networks ◽

Gene Network ◽

Directed Graphs ◽

Biological Data ◽

Biological Information ◽

Continuous Variables ◽

Genome Wide Data

In cells, genes interact with each other and this system can be viewed as directed graphs. A gene network is a graphical representation of transcriptional relations between genes and the problem of estimation of gene networks from genome-wide data, such as DNA micro-array gene expression data, is one of the important issues in bioinformatics and systems biology. Here, we present a statistical method based on Bayesian networks to estimate gene networks from microarray data and other biological data. Because microarray data are measured as continuous variables and the relationship between genes are usually nonlinear, we combine Bayesian networks and nonparametric regression to handle continuous variables and nonlinear relations. Most parts of gene networks are still unknown, and we need to estimate them from observational data. This problem is equivalent to the structural learning of Bayesian networks, and we solve it from a Bayes approach. The main difficulty of gene network estimation is due to the number of genes involved in the network. Therefore, it leads to model overfitting to the observational data like microarray data. Hence, a combination of various kinds of biological data is a key technique to estimate accurate gene networks. We show a general framework to combine microarray data and other biological information to estimate gene networks.

Download Full-text