scholarly journals Spot the difference: comparing results of analyses from real patient data and synthetic derivatives

JAMIA Open ◽  
2020 ◽  
Author(s):  
Randi E Foraker ◽  
Sean C Yu ◽  
Aditi Gupta ◽  
Andrew P Michelson ◽  
Jose A Pineda Soto ◽  
...  

Abstract Background Synthetic data may provide a solution to researchers who wish to generate and share data in support of precision healthcare. Recent advances in data synthesis enable the creation and analysis of synthetic derivatives as if they were the original data; this process has significant advantages over data deidentification. Objectives To assess a big-data platform with data-synthesizing capabilities (MDClone Ltd., Beer Sheva, Israel) for its ability to produce data that can be used for research purposes while obviating privacy and confidentiality concerns. Methods We explored three use cases and tested the robustness of synthetic data by comparing the results of analyses using synthetic derivatives to analyses using the original data using traditional statistics, machine learning approaches, and spatial representations of the data. We designed these use cases with the purpose of conducting analyses at the observation level (Use Case 1), patient cohorts (Use Case 2), and population-level data (Use Case 3). Results For each use case, the results of the analyses were sufficiently statistically similar (P > 0.05) between the synthetic derivative and the real data to draw the same conclusions. Discussion and conclusion This article presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in clinical research for faster insights and improved data sharing in support of precision healthcare.

2021 ◽  
Author(s):  
Randi Foraker ◽  
Aixia Guo ◽  
Jason Thomas ◽  
Noa Zamstein ◽  
Philip R.O. Payne ◽  
...  

BACKGROUND Background: Synthetic data can be used by collaborators to generate and share data in support of answering critical research questions to address the COVID-19 pandemic. Computationally-derived (“synthetic”) data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record (EHR) data. OBJECTIVE Objectives: To compare the results of analyses using synthetic derivatives to analyses using the original data downloaded from a big-data platform with data-synthesizing capabilities (MDClone Ltd., Beer Sheva, Israel) to assess the strengths and limitations of leveraging computationally-derived data for research purposes. METHODS Methods: We used the National COVID Cohort Collaborative’s (N3C) instance of MDClone, comprising EHR data from 34 N3C institutional partners. We tested three use cases, including (1) exploring the distributions of key features of the COVID-positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-related measures and outcomes, and constructing their respective epidemic curves. We compared the results of analyses using synthetic derivatives to analyses using the original data using traditional statistics, machine learning approaches, temporal and spatial representations of the data. RESULTS Results: For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. While the synthetic and original data yielded overall nearly the same results, there were exceptions which included an odds ratio on either side of the null in multivariable analyses (0.97 versus 1.01) and epidemic curves constructed for zip codes with low population counts. CONCLUSIONS Discussion & Conclusion: This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights. CLINICALTRIAL N/A


Mathematics ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 71
Author(s):  
Pablo Bonilla-Escribano ◽  
David Ramírez ◽  
Alejandro Porras-Segovia ◽  
Antonio Artés-Rodríguez

Variability is defined as the propensity at which a given signal is likely to change. There are many choices for measuring variability, and it is not generally known which ones offer better properties. This paper compares different variability metrics applied to irregularly (nonuniformly) sampled time series, which have important clinical applications, particularly in mental healthcare. Using both synthetic and real patient data, we identify the most robust and interpretable variability measures out of a set 21 candidates. Some of these candidates are also proposed in this work based on the absolute slopes of the time series. An additional synthetic data experiment shows that when the complete time series is unknown, as it happens with real data, a non-negligible bias that favors normalized and/or metrics based on the raw observations of the series appears. Therefore, only the results of the synthetic experiments, which have access to the full series, should be used to draw conclusions. Accordingly, the median absolute deviation of the absolute value of the successive slopes of the data is the best way of measuring variability for this kind of time series.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
August DuMont Schütte ◽  
Jürgen Hetzel ◽  
Sergios Gatidis ◽  
Tobias Hepp ◽  
Benedikt Dietz ◽  
...  

AbstractPrivacy concerns around sharing personally identifiable information are a major barrier to data sharing in medical research. In many cases, researchers have no interest in a particular individual’s information but rather aim to derive insights at the level of cohorts. Here, we utilise generative adversarial networks (GANs) to create medical imaging datasets consisting entirely of synthetic patient data. The synthetic images ideally have, in aggregate, similar statistical properties to those of a source dataset but do not contain sensitive personal information. We assess the quality of synthetic data generated by two GAN models for chest radiographs with 14 radiology findings and brain computed tomography (CT) scans with six types of intracranial haemorrhages. We measure the synthetic image quality by the performance difference of predictive models trained on either the synthetic or the real dataset. We find that synthetic data performance disproportionately benefits from a reduced number of classes. Our benchmark also indicates that at low numbers of samples per class, label overfitting effects start to dominate GAN training. We conducted a reader study in which trained radiologists discriminate between synthetic and real images. In accordance with our benchmark results, the classification accuracy of radiologists improves with an increasing resolution. Our study offers valuable guidelines and outlines practical conditions under which insights derived from synthetic images are similar to those that would have been derived from real data. Our results indicate that synthetic data sharing may be an attractive alternative to sharing real patient-level data in the right setting.


JAMIA Open ◽  
2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Khaled El Emam ◽  
Lucy Mosquera ◽  
Elizabeth Jonker ◽  
Harpreet Sood

Abstract Background Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. Objectives Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. Methods A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. Results The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. Conclusions This synthetic dataset could be used as a proxy for the real dataset.


Author(s):  
Yunita Yunita ◽  
Hidayat Hidayat ◽  
Harun Sitompul

This study aims to: (1) investigate the effect of Jigsaw cooperative learning on students learning outcomes; (2) find the difference in learning outcomes between high and low learning motivation and (3) find the interaction between learning approaches and learning motivation towards learning outcomes. The population of the study is students of grade IVa, IVb, IVc at SD Kasih Ibu Patumbak and the sample in this study is grade IVa with 35 students and grade IVb with 35 students. The results show that: (1) the average student learning outcomes of jigsaw cooperative learning is 28.40 while conventional is 24.14. Thus, students learning outcomes that get cooperative learning of jigsaw type are higher than conventional learning, (2) Students who have high motivation get an average value = 30.74, while low motivation is 22.72. Thus, it can be concluded that there are differences in student learning outcomes having high learning motivation and low learning motivation, and (3) students learning outcomes  taught by jigsaw cooperative learning are high learning motivation groups (32.94), and low learning motivation groups (24.58), while students taught with conventional learning are high learning motivation groups (28.40 ), and low motivation groups (20,95). Thus, there is no interaction between learning approaches and learning motivation towards learning outcomes.


2020 ◽  
Vol 139 ◽  
pp. 93-102 ◽  
Author(s):  
MF Van Bressem ◽  
P Duignan ◽  
JA Raga ◽  
K Van Waerebeek ◽  
N Fraijia-Fernández ◽  
...  

Crassicauda spp. (Nematoda) infest the cranial sinuses of several odontocetes, causing diagnostic trabecular osteolytic lesions. We examined skulls of 77 Indian Ocean humpback dolphins Sousa plumbea and 69 Indo-Pacific bottlenose dolphins Tursiops aduncus, caught in bather-protecting nets off KwaZulu-Natal (KZN) from 1970-2017, and skulls of 6 S. plumbea stranded along the southern Cape coast in South Africa from 1963-2002. Prevalence of cranial crassicaudiasis was evaluated according to sex and cranial maturity. Overall, prevalence in S. plumbea and T. aduncus taken off KZN was 13 and 31.9%, respectively. Parasitosis variably affected 1 or more cranial bones (frontal, pterygoid, maxillary and sphenoid). No significant difference was found by gender for either species, allowing sexes to be pooled. However, there was a significant difference in lesion prevalence by age, with immature T. aduncus 4.6 times more likely affected than adults, while for S. plumbea, the difference was 6.5-fold. As severe osteolytic lesions are unlikely to heal without trace, we propose that infection is more likely to have a fatal outcome for immature dolphins, possibly because of incomplete bone development, lower immune competence in clearing parasites or an over-exuberant inflammatory response in concert with parasitic enzymatic erosion. Cranial osteolysis was not observed in mature males (18 S. plumbea, 21 T. aduncus), suggesting potential cohort-linked immune-mediated resistance to infestation. Crassicauda spp. may play a role in the natural mortality of S. plumbea and T. aduncus, but the pathogenesis and population level impact remain unknown.


Author(s):  
P.L. Nikolaev

This article deals with method of binary classification of images with small text on them Classification is based on the fact that the text can have 2 directions – it can be positioned horizontally and read from left to right or it can be turned 180 degrees so the image must be rotated to read the sign. This type of text can be found on the covers of a variety of books, so in case of recognizing the covers, it is necessary first to determine the direction of the text before we will directly recognize it. The article suggests the development of a deep neural network for determination of the text position in the context of book covers recognizing. The results of training and testing of a convolutional neural network on synthetic data as well as the examples of the network functioning on the real data are presented.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2021 ◽  
Vol 40 (3) ◽  
pp. 1-12
Author(s):  
Hao Zhang ◽  
Yuxiao Zhou ◽  
Yifei Tian ◽  
Jun-Hai Yong ◽  
Feng Xu

Reconstructing hand-object interactions is a challenging task due to strong occlusions and complex motions. This article proposes a real-time system that uses a single depth stream to simultaneously reconstruct hand poses, object shape, and rigid/non-rigid motions. To achieve this, we first train a joint learning network to segment the hand and object in a depth image, and to predict the 3D keypoints of the hand. With most layers shared by the two tasks, computation cost is saved for the real-time performance. A hybrid dataset is constructed here to train the network with real data (to learn real-world distributions) and synthetic data (to cover variations of objects, motions, and viewpoints). Next, the depth of the two targets and the keypoints are used in a uniform optimization to reconstruct the interacting motions. Benefitting from a novel tangential contact constraint, the system not only solves the remaining ambiguities but also keeps the real-time performance. Experiments show that our system handles different hand and object shapes, various interactive motions, and moving cameras.


Electronics ◽  
2021 ◽  
Vol 10 (5) ◽  
pp. 592
Author(s):  
Radek Silhavy ◽  
Petr Silhavy ◽  
Zdenka Prokopova

Software size estimation represents a complex task, which is based on data analysis or on an algorithmic estimation approach. Software size estimation is a nontrivial task, which is important for software project planning and management. In this paper, a new method called Actors and Use Cases Size Estimation is proposed. The new method is based on the number of actors and use cases only. The method is based on stepwise regression and led to a very significant reduction in errors when estimating the size of software systems compared to Use Case Points-based methods. The proposed method is independent of Use Case Points, which allows the elimination of the effect of the inaccurate determination of Use Case Points components, because such components are not used in the proposed method.


Sign in / Sign up

Export Citation Format

Share Document