SynSys: A Synthetic Data Generation System for Healthcare Applications

Creation of realistic synthetic behavior-based sensor data is an important aspect of testing machine learning techniques for healthcare applications. Many of the existing approaches for generating synthetic data are often limited in terms of complexity and realism. We introduce SynSys, a machine learning-based synthetic data generation method, to improve upon these limitations. We use this method to generate synthetic time series data that is composed of nested sequences using hidden Markov models and regression models which are initially trained on real datasets. We test our synthetic data generation technique on a real annotated smart home dataset. We use time series distance measures as a baseline to determine how realistic the generated data is compared to real data and demonstrate that SynSys produces more realistic data in terms of distance compared to random data generation, data from another home, and data from another time period. Finally, we apply our synthetic data generation technique to the problem of generating data when only a small amount of ground truth data is available. Using semi-supervised learning we demonstrate that SynSys is able to improve activity recognition accuracy compared to using the small amount of real data alone.

Download Full-text

Towards synthetic data generation for machine learning models in weather and climate

10.5194/egusphere-egu2020-20132 ◽

2020 ◽

Author(s):

David Meyer

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Climate Models ◽

Synthetic Data ◽

Real Data ◽

Data Generation ◽

Learning Models ◽

Synthetic Data Generation ◽

Weather And Climate ◽

Machine Learning Models

<p>The use of real data for training machine learning (ML) models are often a cause of major limitations. For example, real data may be (a) representative of a subset of situations and domains, (b) expensive to produce, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data are becoming increasingly popular in computer vision, ML models used in weather and climate models still rely on the use of large real data datasets. Here we present some recent work towards the generation of synthetic data for weather and climate applications and outline some of the major challenges and limitations encountered.</p>

Download Full-text

Machine learning based Synthetic Data Generation using Iterative Regression Analysis

2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA) ◽

10.1109/iceca49313.2020.9297491 ◽

2020 ◽

Author(s):

Sanskar Shah ◽

Darshan Gandhi ◽

Jil Kothari

Keyword(s):

Machine Learning ◽

Regression Analysis ◽

Synthetic Data ◽

Data Generation ◽

Synthetic Data Generation

Download Full-text

Synthetic Data Generation to Mitigate the Low/No-Shot Problem in Machine Learning

2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR) ◽

10.1109/aipr47015.2019.9174596 ◽

2019 ◽

Cited By ~ 1

Author(s):

Emily E. Berkson ◽

Jared D. VanCor ◽

Steven Esposito ◽

Gary Chern ◽

Mark Pritt

Keyword(s):

Machine Learning ◽

Synthetic Data ◽

Data Generation ◽

Synthetic Data Generation

Download Full-text

Machine Learning Methods and Synthetic Data Generation to Predict Large Wildfires

Sensors ◽

10.3390/s21113694 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3694

Author(s):

Fernando-Juan Pérez-Porras ◽

Paula Triviño-Tarradas ◽

Carmen Cima-Rodríguez ◽

Jose-Emilio Meroño-de-Larriva ◽

Alfonso García-Ferrer ◽

...

Keyword(s):

Machine Learning ◽

Decision Support ◽

High Probability ◽

Synthetic Data ◽

Data Generation ◽

Initial Attack ◽

Machine Learning Methods ◽

Synthetic Data Generation ◽

Complex Process ◽

Different Parts

Wildfires are becoming more frequent in different parts of the globe, and the ability to predict when and where they will occur is a complex process. Identifying wildfire events with high probability of becoming a large wildfire is an important task for supporting initial attack planning. Different methods, including those that are physics-based, statistical, and based on machine learning (ML) are used in wildfire analysis. Among the whole, those based on machine learning are relatively novel. In addition, because the number of wildfires is much greater than the number of large wildfires, the dataset to be used in a ML model is imbalanced, resulting in overfitting or underfitting the results. In this manuscript, we propose to generate synthetic data from variables of interest together with ML models for the prediction of large wildfires. Specifically, five synthetic data generation methods have been evaluated, and their results are analyzed with four ML methods. The results yield an improvement in the prediction power when synthetic data are used, offering a new method to be taken into account in Decision Support Systems (DSS) when managing wildfires.

Download Full-text

MITRE: predicting host status from microbiota time-series data

10.1101/447250 ◽

2018 ◽

Author(s):

Elijah Bogart ◽

Richard Creswell ◽

Georg K. Gerber

Keyword(s):

Machine Learning ◽

Time Series ◽

Time Series Data ◽

Synthetic Data ◽

Black Box ◽

Series Data ◽

Learning Approaches ◽

Rule Engine ◽

Microbiome Composition ◽

Host Status

AbstractLongitudinal studies are crucial for discovering casual relationships between the microbiome and human disease. We present Microbiome Interpretable Temporal Rule Engine (MITRE), the first machine learning method specifically designed for predicting host status from microbiome time-series data. Our method maintains interpretability by learning predictive rules over automatically inferred time-periods and phylogenetically related microbes. We validate MITRE’s performance on semi-synthetic data, and five real datasets measuring microbiome composition over time in infant and adult cohorts. Our results demonstrate that MITRE performs on par or outperforms “black box” machine learning approaches, providing a powerful new tool enabling discovery of biologically interpretable relationships between microbiome and human host.

Download Full-text

Synthetic Data Generation for Steel Defect Detection and Classification Using Deep Learning

Symmetry ◽

10.3390/sym13071176 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1176

Author(s):

Aleksei Boikov ◽

Vladimir Payor ◽

Roman Savelev ◽

Alexandr Kolesnikov

Keyword(s):

Neural Networks ◽

Defect Detection ◽

Production Control ◽

Surface Defects ◽

Synthetic Data ◽

Real Data ◽

Data Generation ◽

Synthetic Data Generation ◽

Steel Slab ◽

Defect Recognition

The paper presents a methodology for training neural networks for vision tasks on synthesized data on the example of steel defect recognition in automated production control systems. The article describes the process of dataset procedural generation of steel slab defects with a symmetrical distribution. The results of training two neural networks Unet and Xception on a generated data grid and testing them on real data are presented. The performance of these neural networks was assessed using real data from the Severstal: Steel Defect Detection set. In both cases, the neural networks showed good results in the classification and segmentation of surface defects of steel workpieces in the image. Dice score on synthetic data reaches 0.62, and accuracy—0.81.

Download Full-text

Development of a Machine-Learning-Based Classifier for the Identification of Head and Body Impacts in Elite Level Australian Rules Football Players

Frontiers in Sports and Active Living ◽

10.3389/fspor.2021.725245 ◽

2021 ◽

Vol 3 ◽

Author(s):

Peter Goodin ◽

Andrew J. Gardner ◽

Nasim Dokani ◽

Ben Nizette ◽

Saeed Ahmadizadeh ◽

...

Keyword(s):

Machine Learning ◽

Time Series ◽

Time Series Data ◽

Machine Learning Algorithms ◽

Series Data ◽

Football Players ◽

True Positive ◽

Ground Truth Data ◽

Australian Rules Football ◽

Elite Level

Background: Exposure to thousands of head and body impacts during a career in contact and collision sports may contribute to current or later life issues related to brain health. Wearable technology enables the measurement of impact exposure. The validation of impact detection is required for accurate exposure monitoring. In this study, we present a method of automatic identification (classification) of head and body impacts using an instrumented mouthguard, video-verified impacts, and machine-learning algorithms.Methods: Time series data were collected via the Nexus A9 mouthguard from 60 elite level men (mean age = 26.33; SD = 3.79) and four women (mean age = 25.50; SD = 5.91) from the Australian Rules Football players from eight clubs, participating in 119 games during the 2020 season. Ground truth data labeling on the captures used in this machine learning study was performed through the analysis of game footage by two expert video reviewers using SportCode and Catapult Vision. The visual labeling process occurred independently of the mouthguard time series data. True positive captures (captures where the reviewer directly observed contact between the mouthguard wearer and another player, the ball, or the ground) were defined as hits. Spectral and convolutional kernel based features were extracted from time series data. Performances of untuned classification algorithms from scikit-learn in addition to XGBoost were assessed to select the best performing baseline method for tuning.Results: Based on performance, XGBoost was selected as the classifier algorithm for tuning. A total of 13,712 video verified captures were collected and used to train and validate the classifier. True positive detection ranged from 94.67% in the Test set to 100% in the hold out set. True negatives ranged from 95.65 to 96.83% in the test and rest sets, respectively.Discussion and conclusion: This study suggests the potential for high performing impact classification models to be used for Australian Rules Football and highlights the importance of frequencies <150 Hz for the identification of these impacts.

Download Full-text

Quantum Machine Learning Architecture for COVID-19 Classification Based on Synthetic Data Generation Using Conditional Adversarial Neural Network

Cognitive Computation ◽

10.1007/s12559-021-09926-6 ◽

2021 ◽

Author(s):

Javaria Amin ◽

Muhammad Sharif ◽

Nadia Gul ◽

Seifedine Kadry ◽

Chinmay Chakraborty

Keyword(s):

Neural Network ◽

Machine Learning ◽

Synthetic Data ◽

Data Generation ◽

Synthetic Data Generation ◽

Quantum Machine Learning

Download Full-text

Synthetic data generation technique in Signer-independent sign language recognition

Pattern Recognition Letters ◽

10.1016/j.patrec.2008.12.007 ◽

2009 ◽

Vol 30 (5) ◽

pp. 513-524 ◽

Cited By ~ 10

Author(s):

Feng Jiang ◽

Wen Gao ◽

Hongxun Yao ◽

Debin Zhao ◽

Xilin Chen

Keyword(s):

Sign Language ◽

Synthetic Data ◽

Data Generation ◽

Language Recognition ◽

Sign Language Recognition ◽

Generation Technique ◽

Synthetic Data Generation

Download Full-text

BOOSTING SEGMENTATION ACCURACY OF THE DEEP LEARNING MODELS BASED ON THE SYNTHETIC DATA GENERATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliv-2-w1-2021-33-2021 ◽

2021 ◽

Vol XLIV-2/W1-2021 ◽

pp. 33-40

Author(s):

V. V. Danilov ◽

O. M. Gerget ◽

D. Y. Kolpashchikov ◽

N. V. Laptev ◽

R. A. Manakov ◽

...

Keyword(s):

Machine Learning ◽

Coordinate System ◽

Learning Algorithms ◽

Synthetic Data ◽

Real Data ◽

Machine Learning Algorithms ◽

Dice Similarity Coefficient ◽

Data Generation ◽

Heart Chamber ◽

Echocardiographic Images

Abstract. In the era of data-driven machine learning algorithms, data represents a new oil. The application of machine learning algorithms shows they need large heterogeneous datasets that crucially are correctly labeled. However, data collection and its labeling are time-consuming and labor-intensive processes. A particular task we solve using machine learning is related to the segmentation of medical devices in echocardiographic images during minimally invasive surgery. However, the lack of data motivated us to develop an algorithm generating synthetic samples based on real datasets. The concept of this algorithm is to place a medical device (catheter) in an empty cavity of an anatomical structure, for example, in a heart chamber, and then transform it. To create random transformations of the catheter, the algorithm uses a coordinate system that uniquely identifies each point regardless of the bend and the shape of the object. It is proposed to take a cylindrical coordinate system as a basis, modifying it by replacing the Z-axis with a spline along which the h-coordinate is measured. Having used the proposed algorithm, we generated new images with the catheter inserted into different heart cavities while varying its location and shape. Afterward, we compared the results of deep neural networks trained on the datasets comprised of real and synthetic data. The network trained on both real and synthetic datasets performed more accurate segmentation than the model trained only on real data. For instance, modified U-net trained on combined datasets performed segmentation with the Dice similarity coefficient of 92.6±2.2%, while the same model trained only on real samples achieved the level of 86.5±3.6%. Using a synthetic dataset allowed decreasing the accuracy spread and improving the generalization of the model. It is worth noting that the proposed algorithm allows reducing subjectivity, minimizing the labeling routine, increasing the number of samples, and improving the heterogeneity.

Download Full-text