scholarly journals Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks

2021 ◽  
Vol 4 ◽  
Author(s):  
Bradley Butcher ◽  
Vincent S. Huang ◽  
Christopher Robinson ◽  
Jeremy Reffin ◽  
Sema K. Sgaier ◽  
...  

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

Author(s):  
Hao Zhang ◽  
Liangxiao Jiang ◽  
Wenqiang Xu

Crowdsourcing services provide a fast, efficient, and cost-effective means of obtaining large labeled data for supervised learning. Ground truth inference, also called label integration, designs proper aggregation strategies to infer the unknown true label of each instance from the multiple noisy label set provided by ordinary crowd workers. However, to the best of our knowledge, nearly all existing label integration methods focus solely on the multiple noisy label set itself of the individual instance while totally ignoring the intercorrelation among multiple noisy label sets of different instances. To solve this problem, a multiple noisy label distribution propagation (MNLDP) method is proposed in this study. MNLDP first transforms the multiple noisy label set of each instance into its multiple noisy label distribution and then propagates its multiple noisy label distribution to its nearest neighbors. Consequently, each instance absorbs a fraction of the multiple noisy label distributions from its nearest neighbors and yet simultaneously maintains a fraction of its own original multiple noisy label distribution. Promising experimental results on simulated and real-world datasets validate the effectiveness of our proposed method.


Author(s):  
Narayan Puthanmadam Subramaniyam ◽  
Reik V. Donner ◽  
Davide Caron ◽  
Gabriella Panuccio ◽  
Jari Hyttinen

AbstractIdentifying causal relationships is a challenging yet crucial problem in many fields of science like epidemiology, climatology, ecology, genomics, economics and neuroscience, to mention only a few. Recent studies have demonstrated that ordinal partition transition networks (OPTNs) allow inferring the coupling direction between two dynamical systems. In this work, we generalize this concept to the study of the interactions among multiple dynamical systems and we propose a new method to detect causality in multivariate observational data. By applying this method to numerical simulations of coupled linear stochastic processes as well as two examples of interacting nonlinear dynamical systems (coupled Lorenz systems and a network of neural mass models), we demonstrate that our approach can reliably identify the direction of interactions and the associated coupling delays. Finally, we study real-world observational microelectrode array electrophysiology data from rodent brain slices to identify the causal coupling structures underlying epileptiform activity. Our results, both from simulations and real-world data, suggest that OPTNs can provide a complementary and robust approach to infer causal effect networks from multivariate observational data.


Author(s):  
Lei Feng ◽  
Bo An

Partial label learning deals with the problem where each training instance is assigned a set of candidate labels, only one of which is correct. This paper provides the first attempt to leverage the idea of self-training for dealing with partially labeled examples. Specifically, we propose a unified formulation with proper constraints to train the desired model and perform pseudo-labeling jointly. For pseudo-labeling, unlike traditional self-training that manually differentiates the ground-truth label with enough high confidence, we introduce the maximum infinity norm regularization on the modeling outputs to automatically achieve this consideratum, which results in a convex-concave optimization problem. We show that optimizing this convex-concave problem is equivalent to solving a set of quadratic programming (QP) problems. By proposing an upper-bound surrogate objective function, we turn to solving only one QP problem for improving the optimization efficiency. Extensive experiments on synthesized and real-world datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art partial label learning approaches.


Author(s):  
Andreas Horner ◽  
Otto C Burghuber ◽  
Sylvia Hartl ◽  
Michael Studnicka ◽  
Monika Merkle ◽  
...  

Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 57
Author(s):  
Ryan Feng ◽  
Yu Yao ◽  
Ella Atkins

Autonomous vehicles require fleet-wide data collection for continuous algorithm development and validation. The smart black box (SBB) intelligent event data recorder has been proposed as a system for prioritized high-bandwidth data capture. This paper extends the SBB by applying anomaly detection and action detection methods for generalized event-of-interest (EOI) detection. An updated SBB pipeline is proposed for the real-time capture of driving video data. A video dataset is constructed to evaluate the SBB on real-world data for the first time. SBB performance is assessed by comparing the compression of normal and anomalous data and by comparing our prioritized data recording with an FIFO strategy. The results show that SBB data compression can increase the anomalous-to-normal memory ratio by ∼25%, while the prioritized recording strategy increases the anomalous-to-normal count ratio when compared to an FIFO strategy. We compare the real-world dataset SBB results to a baseline SBB given ground-truth anomaly labels and conclude that improved general EOI detection methods will greatly improve SBB performance.


2019 ◽  
Vol 106 (1) ◽  
pp. 57-59 ◽  
Author(s):  
Jeffrey S. Barrett ◽  
Penny M. Heaton

2021 ◽  
Vol 59 (1) ◽  
pp. 127-138
Author(s):  
Sollip Kim ◽  
Jeonghyun Chang ◽  
Soo-Kyung Kim ◽  
Sholhui Park ◽  
Jungwon Huh ◽  
...  

AbstractObjectivesTo maintain the consistency of laboratory test results, between-reagent lot variation should be verified before using new reagent lots in clinical laboratory. Although the Clinical and Laboratory Standards Institute (CLSI) document EP26-A deals with this issue, evaluation of reagent lot-to-lot difference is challenging in reality. We aim to investigate a practical way for determining between-reagent lot variation using real-world data in clinical chemistry.MethodsThe CLSI EP26-A protocol was applied to 83 chemistry tests in three clinical labs. Three criteria were used to define the critical difference (CD) of each test as follows: reference change value and total allowable error, which are based on biological variation, and acceptable limits by external quality assurance agencies. The sample size and rejection limits that could detect CD between-reagent lots were determined.ResultsFor more than half of chemistry tests, reagent lot-to-lot differences could be evaluated using only one patient sample per decision level. In many cases, the rejection limit that could detect reagent lot-to-lot difference with ≥90% probability was 0.6 times CD. However, the sample size and rejection limits vary depending on how the CD is defined. In some cases, impractical sample size or rejection limits were obtained. In some cases, information on sample size and rejection limit that met intended statistical power was not found in EP26-A.ConclusionsThe CLSI EP26-A did not provide all necessary answers. Alternative practical approaches are suggested when CLSI EP26-A does not provide guidance.


Author(s):  
Shuji Hao ◽  
Peilin Zhao ◽  
Yong Liu ◽  
Steven C. H. Hoi ◽  
Chunyan Miao

Relative similarity learning~(RSL) aims to learn similarity functions from data with relative constraints. Most previous algorithms developed for RSL are batch-based learning approaches which suffer from poor scalability when dealing with real-world data arriving sequentially. These methods are often designed to learn a single similarity function for a specific task. Therefore, they may be sub-optimal to solve multiple task learning problems. To overcome these limitations, we propose a scalable RSL framework named OMTRSL (Online Multi-Task Relative Similarity Learning). Specifically, we first develop a simple yet effective online learning algorithm for multi-task relative similarity learning. Then, we also propose an active learning algorithm to save the labeling cost. The proposed algorithms not only enjoy theoretical guarantee, but also show high efficacy and efficiency in extensive experiments on real-world datasets.


Sign in / Sign up

Export Citation Format

Share Document