scholarly journals Evaluating the utility of synthetic COVID-19 case data

JAMIA Open ◽  
2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Khaled El Emam ◽  
Lucy Mosquera ◽  
Elizabeth Jonker ◽  
Harpreet Sood

Abstract Background Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. Objectives Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. Methods A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. Results The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. Conclusions This synthetic dataset could be used as a proxy for the real dataset.

2021 ◽  
Vol 40 (3) ◽  
pp. 1-12
Author(s):  
Hao Zhang ◽  
Yuxiao Zhou ◽  
Yifei Tian ◽  
Jun-Hai Yong ◽  
Feng Xu

Reconstructing hand-object interactions is a challenging task due to strong occlusions and complex motions. This article proposes a real-time system that uses a single depth stream to simultaneously reconstruct hand poses, object shape, and rigid/non-rigid motions. To achieve this, we first train a joint learning network to segment the hand and object in a depth image, and to predict the 3D keypoints of the hand. With most layers shared by the two tasks, computation cost is saved for the real-time performance. A hybrid dataset is constructed here to train the network with real data (to learn real-world distributions) and synthetic data (to cover variations of objects, motions, and viewpoints). Next, the depth of the two targets and the keypoints are used in a uniform optimization to reconstruct the interacting motions. Benefitting from a novel tangential contact constraint, the system not only solves the remaining ambiguities but also keeps the real-time performance. Experiments show that our system handles different hand and object shapes, various interactive motions, and moving cameras.


Geophysics ◽  
1990 ◽  
Vol 55 (9) ◽  
pp. 1166-1182 ◽  
Author(s):  
Irshad R. Mufti

Finite‐difference seismic models are commonly set up in 2-D space. Such models must be excited by a line source which leads to different amplitudes than those in the real data commonly generated from a point source. Moreover, there is no provision for any out‐of‐plane events. These problems can be eliminated by using 3-D finite‐difference models. The fundamental strategy in designing efficient 3-D models is to minimize computational work without sacrificing accuracy. This was accomplished by using a (4,2) differencing operator which ensures the accuracy of much larger operators but requires many fewer numerical operations as well as significantly reduced manipulation of data in the computer memory. Such a choice also simplifies the problem of evaluating the wave field near the subsurface boundaries of the model where large operators cannot be used. We also exploited the fact that, unlike the real data, the synthetic data are free from ambient noise; consequently, one can retain sufficient resolution in the results by optimizing the frequency content of the source signal. Further computational efficiency was achieved by using the concept of the exploding reflector which yields zero‐offset seismic sections without the need to evaluate the wave field for individual shot locations. These considerations opened up the possibility of carrying out a complete synthetic 3-D survey on a supercomputer to investigate the seismic response of a large‐scale structure located in Oklahoma. The analysis of results done on a geophysical workstation provides new insight regarding the role of interference and diffraction in the interpretation of seismic data.


Geophysics ◽  
2014 ◽  
Vol 79 (1) ◽  
pp. M1-M10 ◽  
Author(s):  
Leonardo Azevedo ◽  
Ruben Nunes ◽  
Pedro Correia ◽  
Amílcar Soares ◽  
Luis Guerreiro ◽  
...  

Due to the nature of seismic inversion problems, there are multiple possible solutions that can equally fit the observed seismic data while diverging from the real subsurface model. Consequently, it is important to assess how inverse-impedance models are converging toward the real subsurface model. For this purpose, we evaluated a new methodology to combine the multidimensional scaling (MDS) technique with an iterative geostatistical elastic seismic inversion algorithm. The geostatistical inversion algorithm inverted partial angle stacks directly for acoustic and elastic impedance (AI and EI) models. It was based on a genetic algorithm in which the model perturbation at each iteration was performed recurring to stochastic sequential simulation. To assess the reliability and convergence of the inverted models at each step, the simulated models can be projected in a metric space computed by MDS. This projection allowed distinguishing similar from variable models and assessing the convergence of inverted models toward the real impedance ones. The geostatistical inversion results of a synthetic data set, in which the real AI and EI models are known, were plotted in this metric space along with the known impedance models. We applied the same principle to a real data set using a cross-validation technique. These examples revealed that the MDS is a valuable tool to evaluate the convergence of the inverse methodology and the impedance model variability among each iteration of the inversion process. Particularly for the geostatistical inversion algorithm we evaluated, it retrieves reliable impedance models while still producing a set of simulated models with considerable variability.


Author(s):  
Mehmet Niyazi Çankaya

The systematic sampling is used as a method to get the quantitative results from the tissues and the radiological images. Systematic sampling on real line (R) is a very attractive method within which the biomedical imaging is consulted by the practitioners. For the systematic sampling on R, the measurement function (MF) is occurred by slicing the three dimensional object equidistant  systematically. If the parameter q of MF is estimated to be small enough for mean square error, we can make the important remarks for the design-based stereology. This study is an extension of [17], and an exact calculation method is proposed to calculate the constant λ(q,N) of confidence interval in the systematic sampling. In the results, synthetic data can support the results of real data. The currently used covariogram model in variance approximation proposed by [28,29] is tested for the different measurement functions to see the performance on the variance estimation of systematically sampled R. The exact value of constant λ(q,N) is examined for the different measurement functions as well.


Author(s):  
Mehmet Niyazi Çankaya

The systematic sampling is used as a method to get the quantitative results from the tissues and the radiological images. Systematic sampling on real line (R) is a very attractive method within which the biomedical imaging is consulted by the practitioners. For the systematic sampling on R, the measurement function (MF) is occurred by slicing the three-dimensional object equidistant  systematically. The currently used covariogram model in variance approximation proposed by [28,29] is tested for the different measurement functions in a class to see the performance on the variance estimation of systematically sampled R. This study is an extension of [17], and an exact calculation method is proposed to calculate the constant λ(q,N) of confidence interval in the systematic sampling. The exact value of constant λ(q,N) is examined for the different measurement functions as well. As a result, it is observed from the simulation that the proposed MF should be used to check the performances of the variance approximation and the constant λ(q,N). Synthetic data can support the results of real data.


PLoS ONE ◽  
2021 ◽  
Vol 16 (11) ◽  
pp. e0260308
Author(s):  
Mauro Castelli ◽  
Luca Manzoni ◽  
Tatiane Espindola ◽  
Aleš Popovič ◽  
Andrea De Lorenzo

Wireless networks are among the fundamental technologies used to connect people. Considering the constant advancements in the field, telecommunication operators must guarantee a high-quality service to keep their customer portfolio. To ensure this high-quality service, it is common to establish partnerships with specialized technology companies that deliver software services in order to monitor the networks and identify faults and respective solutions. A common barrier faced by these specialized companies is the lack of data to develop and test their products. This paper investigates the use of generative adversarial networks (GANs), which are state-of-the-art generative models, for generating synthetic telecommunication data related to Wi-Fi signal quality. We developed, trained, and compared two of the most used GAN architectures: the Vanilla GAN and the Wasserstein GAN (WGAN). Both models presented satisfactory results and were able to generate synthetic data similar to the real ones. In particular, the distribution of the synthetic data overlaps the distribution of the real data for all of the considered features. Moreover, the considered generative models can reproduce the same associations observed for the synthetic features. We chose the WGAN as the final model, but both models are suitable for addressing the problem at hand.


Geophysics ◽  
2006 ◽  
Vol 71 (5) ◽  
pp. G211-G223 ◽  
Author(s):  
Lasse Amundsen ◽  
Lars Løseth ◽  
Rune Mittet ◽  
Svein Ellingsrud ◽  
Bjørn Ursin

This paper gives a unified treatment of electromagnetic (EM) field decomposition into upgoing and downgoing components for conductive and nonconductive media, where the electromagnetic data are measured on a plane in which the electric permittivity, magnetic permeability, and electrical conductivity are known constants with respect to space and time. Above and below the plane of measurement, the medium can be arbitrarily inhomogeneous and anisotropic. In particular, the proposed decomposition theory applies to marine EM, low-frequency data acquired for hydrocarbon mapping where the upgoing components of the recorded field guided and refracted from the reservoir, that are of interest for the interpretation. The direct-source field, the refracted airwave induced by the source, the reflected field from the sea surface, and mostmagnetotelluric noise traveling downward just below the seabed are field components that are considered to be noise in electromagnetic measurements. The viability and validity of the decomposition method is demonstrated using modeled and real marine EM data, also termed seabed logging (SBL) data. The synthetic data are simulated in a model that is fairly representative of the geologic area where the real SBL were collected. The results from the synthetic data study therefore are used to assist in the interpretation of the real data from an area with [Formula: see text] water depth above a known gas province offshore Norway. The effect of the airwave is seen clearly in measured data. After field decomposition just below the seabed, the upgoing component of the recorded electric field has almost linear phase, indicating that most of the effect of the airwave component has been removed.


2020 ◽  
Vol 12 (5) ◽  
pp. 771 ◽  
Author(s):  
Miguel Angel Ortíz-Barrios ◽  
Ian Cleland ◽  
Chris Nugent ◽  
Pablo Pancardo ◽  
Eric Järpe ◽  
...  

Automatic detection and recognition of Activities of Daily Living (ADL) are crucial for providing effective care to frail older adults living alone. A step forward in addressing this challenge is the deployment of smart home sensors capturing the intrinsic nature of ADLs performed by these people. As the real-life scenario is characterized by a comprehensive range of ADLs and smart home layouts, deviations are expected in the number of sensor events per activity (SEPA), a variable often used for training activity recognition models. Such models, however, rely on the availability of suitable and representative data collection and is habitually expensive and resource-intensive. Simulation tools are an alternative for tackling these barriers; nonetheless, an ongoing challenge is their ability to generate synthetic data representing the real SEPA. Hence, this paper proposes the use of Poisson regression modelling for transforming simulated data in a better approximation of real SEPA. First, synthetic and real data were compared to verify the equivalence hypothesis. Then, several Poisson regression models were formulated for estimating real SEPA using simulated data. The outcomes revealed that real SEPA can be better approximated ( R pred 2 = 92.72 % ) if synthetic data is post-processed through Poisson regression incorporating dummy variables.


10.2196/16492 ◽  
2020 ◽  
Vol 8 (2) ◽  
pp. e16492 ◽  
Author(s):  
Anat Reiner Benaim ◽  
Ronit Almog ◽  
Yuri Gorelik ◽  
Irit Hochberg ◽  
Laila Nassar ◽  
...  

Background Privacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed. Objective This paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data. Methods A cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data. Results This study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed. Conclusions The use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.


2020 ◽  
Vol 13 (12) ◽  
pp. 6361-6381
Author(s):  
Marisol Monterrubio-Velasco ◽  
F. Ramón Zúñiga ◽  
Quetzalcoatl Rodríguez-Pérez ◽  
Otilio Rojas ◽  
Armando Aguilar-Meléndez ◽  
...  

Abstract. Seismicity and magnitude distributions are fundamental for seismic hazard analysis. The Mexican subduction margin along the Pacific Coast is one of the most active seismic zones in the world, which makes it an optimal region for observation and experimentation analyses. Some remarkable seismicity features have been observed on a subvolume of this subduction region, suggesting that the observed simplicity of earthquake sources arises from the rupturing of single asperities. This subregion has been named SUB3 in a recent seismotectonic regionalization of Mexico. In this work, we numerically test this hypothesis using the TREMOL (sThochastic Rupture Earthquake MOdeL) v0.1.0 code. As test cases, we choose four of the most significant recent events (6.5 < Mw < 7.8) that occurred in the Guerrero–Oaxaca region (SUB3) during the period 1988–2018, and whose associated seismic histories are well recorded in the regional catalogs. Synthetic seismicity results show a reasonable fit to the real data, which improves when the available data from the real events increase. These results give support to the hypothesis that single-asperity ruptures are a distinctive feature that controls seismicity in SUB3. Moreover, a fault aspect ratio sensitivity analysis is carried out to study how the synthetic seismicity varies. Our results indicate that asperity shape is an important modeling parameter controlling the frequency–magnitude distribution of synthetic data. Therefore, TREMOL provides appropriate means to model complex seismicity curves, such as those observed in the SUB3 region, and highlights its usefulness as a tool to shed additional light on the earthquake process.


Sign in / Sign up

Export Citation Format

Share Document