scholarly journals A Spatial Biosurveillance Synthetic Data Generator in R

Author(s):  
Drew Levin ◽  
Patrick Finley

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data

2011 ◽  
Vol 2011 ◽  
pp. 1-14 ◽  
Author(s):  
Chunzhong Li ◽  
Zongben Xu

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.


Author(s):  
Zhenchen Wang ◽  
Puja Myles ◽  
Anu Jain ◽  
James L. Keidel ◽  
Roberto Liddi ◽  
...  

2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e18725-e18725
Author(s):  
Ravit Geva ◽  
Barliz Waissengrin ◽  
Dan Mirelman ◽  
Felix Bokstein ◽  
Deborah T. Blumenthal ◽  
...  

e18725 Background: Healthcare data sharing is important for the creation of diverse and large data sets, supporting clinical decision making, and accelerating efficient research to improve patient outcomes. This is especially vital in the case of real world data analysis. However, stakeholders are reluctant to share their data without ensuring patients’ privacy, proper protection of their data sets and the ways they are being used. Homomorphic encryption is a cryptographic capability that can address these issues by enabling computation on encrypted data without ever decrypting it, so the analytics results are obtained without revealing the raw data. The aim of this study is to prove the accuracy of analytics results and the practical efficiency of the technology. Methods: A real-world data set of colorectal cancer patients’ survival data, following two different treatment interventions, including 623 patients and 24 variables, amounting to 14,952 items of data, was encrypted using leveled homomorphic encryption implemented in the PALISADE software library. Statistical analysis of key oncological endpoints was blindly performed on both the raw data and the homomorphically-encrypted data using descriptive statistics and survival analysis with Kaplan-Meier curves. Results were then compared with an accuracy goal of two decimals. Results: The difference between the raw data and the homomorphically encrypted data results, regarding all variables analyzed was within the pre-determined accuracy range goal, as well as the practical efficiency of the encrypted computation measured by run time, are presented in table. Conclusions: This study demonstrates that data encrypted with Homomorphic Encryption can be statistical analyzed with a precision of at least two decimal places, allowing safe clinical conclusions drawing while preserving patients’ privacy and protecting data owners’ data assets. Homomorphic encryption allows performing efficient computation on encrypted data non-interactively and without requiring decryption during computation time. Utilizing the technology will empower large-scale cross-institution and cross- stakeholder collaboration, allowing safe international collaborations. Clinical trial information: 0048-19-TLV. [Table: see text]


2005 ◽  
Vol 42 (03) ◽  
pp. 839-850 ◽  
Author(s):  
Zsolt Katona

Consider the random graph model of Barabási and Albert, where we add a new vertex in every step and connect it to some old vertices with probabilities proportional to their degrees. If we connect it to only one of the old vertices then this will be a tree. These graphs have been shown to have a power-law degree distribution, the same as that observed in some large real-world networks. We are interested in the width of the tree and we show that it is at the nth step; this also holds for a slight generalization of the model with another constant. We then see how this theoretical result can be applied to directory trees.


2005 ◽  
Vol 42 (3) ◽  
pp. 839-850 ◽  
Author(s):  
Zsolt Katona

Consider the random graph model of Barabási and Albert, where we add a new vertex in every step and connect it to some old vertices with probabilities proportional to their degrees. If we connect it to only one of the old vertices then this will be a tree. These graphs have been shown to have a power-law degree distribution, the same as that observed in some large real-world networks. We are interested in the width of the tree and we show that it is at the nth step; this also holds for a slight generalization of the model with another constant. We then see how this theoretical result can be applied to directory trees.


2016 ◽  
Vol 28 (12) ◽  
pp. 2687-2725 ◽  
Author(s):  
Ken Takano ◽  
Hideitsu Hino ◽  
Shotaro Akaho ◽  
Noboru Murata

This study considers the common situation in data analysis when there are few observations of the distribution of interest or the target distribution, while abundant observations are available from auxiliary distributions. In this situation, it is natural to compensate for the lack of data from the target distribution by using data sets from these auxiliary distributions—in other words, approximating the target distribution in a subspace spanned by a set of auxiliary distributions. Mixture modeling is one of the simplest ways to integrate information from the target and auxiliary distributions in order to express the target distribution as accurately as possible. There are two typical mixtures in the context of information geometry: the [Formula: see text]- and [Formula: see text]-mixtures. The [Formula: see text]-mixture is applied in a variety of research fields because of the presence of the well-known expectation-maximazation algorithm for parameter estimation, whereas the [Formula: see text]-mixture is rarely used because of its difficulty of estimation, particularly for nonparametric models. The [Formula: see text]-mixture, however, is a well-tempered distribution that satisfies the principle of maximum entropy. To model a target distribution with scarce observations accurately, this letter proposes a novel framework for a nonparametric modeling of the [Formula: see text]-mixture and a geometrically inspired estimation algorithm. As numerical examples of the proposed framework, a transfer learning setup is considered. The experimental results show that this framework works well for three types of synthetic data sets, as well as an EEG real-world data set.


2020 ◽  
Vol 267 (S1) ◽  
pp. 185-196
Author(s):  
J. Gerb ◽  
S. A. Ahmadi ◽  
E. Kierig ◽  
B. Ertl-Wagner ◽  
M. Dieterich ◽  
...  

Abstract Background Objective and volumetric quantification is a necessary step in the assessment and comparison of endolymphatic hydrops (ELH) results. Here, we introduce a novel tool for automatic volumetric segmentation of the endolymphatic space (ELS) for ELH detection in delayed intravenous gadolinium-enhanced magnetic resonance imaging of inner ear (iMRI) data. Methods The core component is a novel algorithm based on Volumetric Local Thresholding (VOLT). The study included three different data sets: a real-world data set (D1) to develop the novel ELH detection algorithm and two validating data sets, one artificial (D2) and one entirely unseen prospective real-world data set (D3). D1 included 210 inner ears of 105 patients (50 male; mean age 50.4 ± 17.1 years), and D3 included 20 inner ears of 10 patients (5 male; mean age 46.8 ± 14.4 years) with episodic vertigo attacks of different etiology. D1 and D3 did not differ significantly concerning age, gender, the grade of ELH, or data quality. As an artificial data set, D2 provided a known ground truth and consisted of an 8-bit cuboid volume using the same voxel-size and grid as real-world data with different sized cylindrical and cuboid-shaped cutouts (signal) whose grayscale values matched the real-world data set D1 (mean 68.7 ± 7.8; range 48.9–92.8). The evaluation included segmentation accuracy using the Sørensen-Dice overlap coefficient and segmentation precision by comparing the volume of the ELS. Results VOLT resulted in a high level of performance and accuracy in comparison with the respective gold standard. In the case of the artificial data set, VOLT outperformed the gold standard in higher noise levels. Data processing steps are fully automated and run without further user input in less than 60 s. ELS volume measured by automatic segmentation correlated significantly with the clinical grading of the ELS (p < 0.01). Conclusion VOLT enables an open-source reproducible, reliable, and automatic volumetric quantification of the inner ears’ fluid space using MR volumetric assessment of endolymphatic hydrops. This tool constitutes an important step towards comparable and systematic big data analyses of the ELS in patients with the frequent syndrome of episodic vertigo attacks. A generic version of our three-dimensional thresholding algorithm has been made available to the scientific community via GitHub as an ImageJ-plugin.


Energies ◽  
2020 ◽  
Vol 13 (16) ◽  
pp. 4211
Author(s):  
Manu Lahariya ◽  
Dries F. Benoit ◽  
Chris Develder

Electric vehicle (EV) charging stations have become prominent in electricity grids in the past few years. Their increased penetration introduces both challenges and opportunities; they contribute to increased load, but also offer flexibility potential, e.g., in deferring the load in time. To analyze such scenarios, realistic EV data are required, which are hard to come by. Therefore, in this article we define a synthetic data generator (SDG) for EV charging sessions based on a large real-world dataset. Arrival times of EVs are modeled assuming that the inter-arrival times of EVs follow an exponential distribution. Connection time for EVs is dependent on the arrival time of EV, and can be described using a conditional probability distribution. This distribution is estimated using Gaussian mixture models, and departure times can calculated by sampling connection times for EV arrivals from this distribution. Our SDG is based on a novel method for the temporal modeling of EV sessions, and jointly models the arrival and departure times of EVs for a large number of charging stations. Our SDG was trained using real-world EV sessions, and used to generate synthetic samples of session data, which were statistically indistinguishable from the real-world data. We provide both (i) source code to train SDG models from new data, and (ii) trained models that reflect real-world datasets.


2008 ◽  
Vol DMTCS Proceedings vol. AI,... (Proceedings) ◽  
Author(s):  
Michael Drmota ◽  
Bernhard Gittenberger ◽  
Alois Panholzer

International audience We develop a combinatorial structure to serve as model of random real world networks. Starting with plane oriented recursive trees we substitute the nodes by more complex graphs. In such a way we obtain graphs having a global tree-like structure while locally looking clustered. This fits with observations obtained from real-world networks. In particular we show that the resulting graphs are scale-free, that is, the degree distribution has an asymptotic power law.


Sign in / Sign up

Export Citation Format

Share Document