Towards synthetic data generation for machine learning models in weather and climate

Mapping Intimacies ◽

10.5194/egusphere-egu2020-20132 ◽

2020 ◽

Author(s):

David Meyer

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Climate Models ◽

Synthetic Data ◽

Real Data ◽

Data Generation ◽

Learning Models ◽

Synthetic Data Generation ◽

Weather And Climate ◽

Machine Learning Models

<p>The use of real data for training machine learning (ML) models are often a cause of major limitations. For example, real data may be (a) representative of a subset of situations and domains, (b) expensive to produce, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data are becoming increasingly popular in computer vision, ML models used in weather and climate models still rely on the use of large real data datasets. Here we present some recent work towards the generation of synthetic data for weather and climate applications and outline some of the major challenges and limitations encountered.</p>

Download Full-text

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

Applied Sciences ◽

10.3390/app11052158 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2158

Author(s):

Fida K. Dankar ◽

Mahmoud Ibrahim

Keyword(s):

Machine Learning ◽

Propensity Score ◽

Real Life ◽

Synthetic Data ◽

Supervised Machine Learning ◽

Data Generation ◽

Learning Models ◽

Healthcare Data ◽

Synthetic Data Generation ◽

Machine Learning Models

Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.

Download Full-text

Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model

10.5194/gmd-2020-427 ◽

2021 ◽

Author(s):

David Meyer ◽

Thomas Nagler ◽

Robin J. Hogan

Keyword(s):

Machine Learning ◽

Synthetic Data ◽

Longwave Radiation ◽

Real Data ◽

Absolute Error ◽

Mean Bias Error ◽

Bias Error ◽

Data Generation ◽

Weather And Climate ◽

The Mean

Abstract. Can we improve machine learning (ML) emulators with synthetic data? The use of real data for training ML models is often the cause of major limitations. For example, real data may be (a) only representative of a subset of situations and domains, (b) expensive to source, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data is becoming increasingly popular in computer vision, the training of ML emulators in weather and climate still relies on the use of real data datasets. Here we investigate whether the use of copula-based synthetically-augmented datasets improves the prediction of ML emulators for estimating the downwelling longwave radiation. Results show that bulk errors are cut by up to 75 % for the mean bias error (from 0.08 to −0.02 W m−2) and by up to 62 % (from 1.17 to 0.44 W m−2) for the mean absolute error, thus showing potential for improving the generalization of future ML emulators.

Download Full-text

SynSys: A Synthetic Data Generation System for Healthcare Applications

Sensors ◽

10.3390/s19051181 ◽

2019 ◽

Vol 19 (5) ◽

pp. 1181 ◽

Cited By ~ 10

Author(s):

Jessamyn Dahmen ◽

Diane Cook

Keyword(s):

Machine Learning ◽

Time Series ◽

Synthetic Data ◽

Real Data ◽

Series Data ◽

Data Generation ◽

Healthcare Applications ◽

Ground Truth Data ◽

Generation Technique ◽

Synthetic Data Generation

Creation of realistic synthetic behavior-based sensor data is an important aspect of testing machine learning techniques for healthcare applications. Many of the existing approaches for generating synthetic data are often limited in terms of complexity and realism. We introduce SynSys, a machine learning-based synthetic data generation method, to improve upon these limitations. We use this method to generate synthetic time series data that is composed of nested sequences using hidden Markov models and regression models which are initially trained on real datasets. We test our synthetic data generation technique on a real annotated smart home dataset. We use time series distance measures as a baseline to determine how realistic the generated data is compared to real data and demonstrate that SynSys produces more realistic data in terms of distance compared to random data generation, data from another home, and data from another time period. Finally, we apply our synthetic data generation technique to the problem of generating data when only a small amount of ground truth data is available. Using semi-supervised learning we demonstrate that SynSys is able to improve activity recognition accuracy compared to using the small amount of real data alone.

Download Full-text

End-To-End Computer Vision Framework: An Open-Source Platform for Research and Education

Sensors ◽

10.3390/s21113691 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3691

Author(s):

Ciprian Orhei ◽

Silviu Vert ◽

Muguras Mocofan ◽

Radu Vasiu

Keyword(s):

Machine Learning ◽

Image Processing ◽

Computer Vision ◽

Open Source ◽

Visual Processing ◽

Research Field ◽

Learning Models ◽

Research Activity ◽

End To End ◽

Machine Learning Models

Computer Vision is a cross-research field with the main purpose of understanding the surrounding environment as closely as possible to human perception. The image processing systems is continuously growing and expanding into more complex systems, usually tailored to the certain needs or applications it may serve. To better serve this purpose, research on the architecture and design of such systems is also important. We present the End-to-End Computer Vision Framework, an open-source solution that aims to support researchers and teachers within the image processing vast field. The framework has incorporated Computer Vision features and Machine Learning models that researchers can use. In the continuous need to add new Computer Vision algorithms for a day-to-day research activity, our proposed framework has an advantage given by the configurable and scalar architecture. Even if the main focus of the framework is on the Computer Vision processing pipeline, the framework offers solutions to incorporate even more complex activities, such as training Machine Learning models. EECVF aims to become a useful tool for learning activities in the Computer Vision field, as it allows the learner and the teacher to handle only the topics at hand, and not the interconnection necessary for visual processing flow.

Download Full-text

Machine learning based Synthetic Data Generation using Iterative Regression Analysis

2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA) ◽

10.1109/iceca49313.2020.9297491 ◽

2020 ◽

Author(s):

Sanskar Shah ◽

Darshan Gandhi ◽

Jil Kothari

Keyword(s):

Machine Learning ◽

Regression Analysis ◽

Synthetic Data ◽

Data Generation ◽

Synthetic Data Generation

Download Full-text

Mechanistic models versus machine learning, a fight worth fighting for the biological community?

Biology Letters ◽

10.1098/rsbl.2017.0660 ◽

2018 ◽

Vol 14 (5) ◽

pp. 20170660 ◽

Cited By ~ 59

Author(s):

Ruth E. Baker ◽

Jose-Maria Peña ◽

Jayaratnam Jayamohan ◽

Antoine Jérusalem

Keyword(s):

Machine Learning ◽

Disease Progression ◽

Royal Society ◽

Mechanistic Models ◽

Learning Tools ◽

Data Generation ◽

Learning Models ◽

Input Output ◽

Correlation Studies ◽

Machine Learning Models

Ninety per cent of the world's data have been generated in the last 5 years ( Machine learning: the power and promise of computers that learn by example . Report no. DES4702. Issued April 2017. Royal Society). A small fraction of these data is collected with the aim of validating specific hypotheses. These studies are led by the development of mechanistic models focused on the causality of input–output relationships. However, the vast majority is aimed at supporting statistical or correlation studies that bypass the need for causality and focus exclusively on prediction. Along these lines, there has been a vast increase in the use of machine learning models, in particular in the biomedical and clinical sciences, to try and keep pace with the rate of data generation. Recent successes now beg the question of whether mechanistic models are still relevant in this area. Said otherwise, why should we try to understand the mechanisms of disease progression when we can use machine learning tools to directly predict disease outcome?

Download Full-text

Synthetic Data Generation to Mitigate the Low/No-Shot Problem in Machine Learning

2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR) ◽

10.1109/aipr47015.2019.9174596 ◽

2019 ◽

Cited By ~ 1

Author(s):

Emily E. Berkson ◽

Jared D. VanCor ◽

Steven Esposito ◽

Gary Chern ◽

Mark Pritt

Keyword(s):

Machine Learning ◽

Synthetic Data ◽

Data Generation ◽

Synthetic Data Generation

Download Full-text

Machine Learning Methods and Synthetic Data Generation to Predict Large Wildfires

Sensors ◽

10.3390/s21113694 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3694

Author(s):

Fernando-Juan Pérez-Porras ◽

Paula Triviño-Tarradas ◽

Carmen Cima-Rodríguez ◽

Jose-Emilio Meroño-de-Larriva ◽

Alfonso García-Ferrer ◽

...

Keyword(s):

Machine Learning ◽

Decision Support ◽

High Probability ◽

Synthetic Data ◽

Data Generation ◽

Initial Attack ◽

Machine Learning Methods ◽

Synthetic Data Generation ◽

Complex Process ◽

Different Parts

Wildfires are becoming more frequent in different parts of the globe, and the ability to predict when and where they will occur is a complex process. Identifying wildfire events with high probability of becoming a large wildfire is an important task for supporting initial attack planning. Different methods, including those that are physics-based, statistical, and based on machine learning (ML) are used in wildfire analysis. Among the whole, those based on machine learning are relatively novel. In addition, because the number of wildfires is much greater than the number of large wildfires, the dataset to be used in a ML model is imbalanced, resulting in overfitting or underfitting the results. In this manuscript, we propose to generate synthetic data from variables of interest together with ML models for the prediction of large wildfires. Specifically, five synthetic data generation methods have been evaluated, and their results are analyzed with four ML methods. The results yield an improvement in the prediction power when synthetic data are used, offering a new method to be taken into account in Decision Support Systems (DSS) when managing wildfires.

Download Full-text

Synthetic Data Generation for Steel Defect Detection and Classification Using Deep Learning

Symmetry ◽

10.3390/sym13071176 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1176

Author(s):

Aleksei Boikov ◽

Vladimir Payor ◽

Roman Savelev ◽

Alexandr Kolesnikov

Keyword(s):

Neural Networks ◽

Defect Detection ◽

Production Control ◽

Surface Defects ◽

Synthetic Data ◽

Real Data ◽

Data Generation ◽

Synthetic Data Generation ◽

Steel Slab ◽

Defect Recognition

The paper presents a methodology for training neural networks for vision tasks on synthesized data on the example of steel defect recognition in automated production control systems. The article describes the process of dataset procedural generation of steel slab defects with a symmetrical distribution. The results of training two neural networks Unet and Xception on a generated data grid and testing them on real data are presented. The performance of these neural networks was assessed using real data from the Severstal: Steel Defect Detection set. In both cases, the neural networks showed good results in the classification and segmentation of surface defects of steel workpieces in the image. Dice score on synthetic data reaches 0.62, and accuracy—0.81.

Download Full-text

Wearable Technology as a Source of Data-Generation Tool for Artificial Intelligence

Applications of Artificial Intelligence for Smart Technology - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-3335-2.ch002 ◽

2021 ◽

pp. 17-34

Author(s):

Gyasi Emmanuel Kwabena ◽

Mageshbabu Ramamurthy ◽

Akila Wijethunga ◽

Purushotham Swarnalatha

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Wearable Sensors ◽

Wearable Technology ◽

Data Generation ◽

Learning Models ◽

The World ◽

Smart Wearable ◽

The Internet Of Things ◽

Machine Learning Models

The world is fascinated to see how technology evolves each passing day. All too soon, there's an emerging technology that is trending around us, and it is no other technology than smart wearable technology. Less attention is paid to the data that our bodies are radiating and communicating to us, but with the timely arrival of wearable sensors, we now have numerous devices that can be tracking and collecting the data that our bodies are radiating. Apart from numerous benefits that we derive from the functions provided by wearable technology such as monitoring of our fitness levels, etc., one other critical importance of wearable technology is helping the advancement of artificial intelligence (AI) and machine learning (ML). Machine learning thrives on the availability of massive data and wearable technology which forms part of the internet of things (IoT) generates megabytes of data every single day. The data generated by these wearable devices are used as a dataset for the training and learning of machine learning models. Through the analysis of the outcome of these machine learning models, scientific conclusions are made.

Download Full-text