Towards synthetic data generation for machine learning models in weather and climate

Author(s):  
David Meyer

<p>The use of real data for training machine learning (ML) models are often a cause of major limitations. For example, real data may be (a) representative of a subset of situations and domains, (b) expensive to produce, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data are becoming increasingly popular in computer vision, ML models used in weather and climate models still rely on the use of large real data datasets. Here we present some recent work towards the generation of synthetic data for weather and climate applications and outline some of the major challenges and limitations encountered.</p>

2021 ◽  
Vol 11 (5) ◽  
pp. 2158
Author(s):  
Fida K. Dankar ◽  
Mahmoud Ibrahim

Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.


2021 ◽  
Author(s):  
David Meyer ◽  
Thomas Nagler ◽  
Robin J. Hogan

Abstract. Can we improve machine learning (ML) emulators with synthetic data? The use of real data for training ML models is often the cause of major limitations. For example, real data may be (a) only representative of a subset of situations and domains, (b) expensive to source, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data is becoming increasingly popular in computer vision, the training of ML emulators in weather and climate still relies on the use of real data datasets. Here we investigate whether the use of copula-based synthetically-augmented datasets improves the prediction of ML emulators for estimating the downwelling longwave radiation. Results show that bulk errors are cut by up to 75 % for the mean bias error (from 0.08 to −0.02 W m−2) and by up to 62 % (from 1.17 to 0.44 W m−2) for the mean absolute error, thus showing potential for improving the generalization of future ML emulators.


Sensors ◽  
2019 ◽  
Vol 19 (5) ◽  
pp. 1181 ◽  
Author(s):  
Jessamyn Dahmen ◽  
Diane Cook

Creation of realistic synthetic behavior-based sensor data is an important aspect of testing machine learning techniques for healthcare applications. Many of the existing approaches for generating synthetic data are often limited in terms of complexity and realism. We introduce SynSys, a machine learning-based synthetic data generation method, to improve upon these limitations. We use this method to generate synthetic time series data that is composed of nested sequences using hidden Markov models and regression models which are initially trained on real datasets. We test our synthetic data generation technique on a real annotated smart home dataset. We use time series distance measures as a baseline to determine how realistic the generated data is compared to real data and demonstrate that SynSys produces more realistic data in terms of distance compared to random data generation, data from another home, and data from another time period. Finally, we apply our synthetic data generation technique to the problem of generating data when only a small amount of ground truth data is available. Using semi-supervised learning we demonstrate that SynSys is able to improve activity recognition accuracy compared to using the small amount of real data alone.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3691
Author(s):  
Ciprian Orhei ◽  
Silviu Vert ◽  
Muguras Mocofan ◽  
Radu Vasiu

Computer Vision is a cross-research field with the main purpose of understanding the surrounding environment as closely as possible to human perception. The image processing systems is continuously growing and expanding into more complex systems, usually tailored to the certain needs or applications it may serve. To better serve this purpose, research on the architecture and design of such systems is also important. We present the End-to-End Computer Vision Framework, an open-source solution that aims to support researchers and teachers within the image processing vast field. The framework has incorporated Computer Vision features and Machine Learning models that researchers can use. In the continuous need to add new Computer Vision algorithms for a day-to-day research activity, our proposed framework has an advantage given by the configurable and scalar architecture. Even if the main focus of the framework is on the Computer Vision processing pipeline, the framework offers solutions to incorporate even more complex activities, such as training Machine Learning models. EECVF aims to become a useful tool for learning activities in the Computer Vision field, as it allows the learner and the teacher to handle only the topics at hand, and not the interconnection necessary for visual processing flow.


2018 ◽  
Vol 14 (5) ◽  
pp. 20170660 ◽  
Author(s):  
Ruth E. Baker ◽  
Jose-Maria Peña ◽  
Jayaratnam Jayamohan ◽  
Antoine Jérusalem

Ninety per cent of the world's data have been generated in the last 5 years ( Machine learning: the power and promise of computers that learn by example . Report no. DES4702. Issued April 2017. Royal Society). A small fraction of these data is collected with the aim of validating specific hypotheses. These studies are led by the development of mechanistic models focused on the causality of input–output relationships. However, the vast majority is aimed at supporting statistical or correlation studies that bypass the need for causality and focus exclusively on prediction. Along these lines, there has been a vast increase in the use of machine learning models, in particular in the biomedical and clinical sciences, to try and keep pace with the rate of data generation. Recent successes now beg the question of whether mechanistic models are still relevant in this area. Said otherwise, why should we try to understand the mechanisms of disease progression when we can use machine learning tools to directly predict disease outcome?


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3694
Author(s):  
Fernando-Juan Pérez-Porras ◽  
Paula Triviño-Tarradas ◽  
Carmen Cima-Rodríguez ◽  
Jose-Emilio Meroño-de-Larriva ◽  
Alfonso García-Ferrer ◽  
...  

Wildfires are becoming more frequent in different parts of the globe, and the ability to predict when and where they will occur is a complex process. Identifying wildfire events with high probability of becoming a large wildfire is an important task for supporting initial attack planning. Different methods, including those that are physics-based, statistical, and based on machine learning (ML) are used in wildfire analysis. Among the whole, those based on machine learning are relatively novel. In addition, because the number of wildfires is much greater than the number of large wildfires, the dataset to be used in a ML model is imbalanced, resulting in overfitting or underfitting the results. In this manuscript, we propose to generate synthetic data from variables of interest together with ML models for the prediction of large wildfires. Specifically, five synthetic data generation methods have been evaluated, and their results are analyzed with four ML methods. The results yield an improvement in the prediction power when synthetic data are used, offering a new method to be taken into account in Decision Support Systems (DSS) when managing wildfires.


Symmetry ◽  
2021 ◽  
Vol 13 (7) ◽  
pp. 1176
Author(s):  
Aleksei Boikov ◽  
Vladimir Payor ◽  
Roman Savelev ◽  
Alexandr Kolesnikov

The paper presents a methodology for training neural networks for vision tasks on synthesized data on the example of steel defect recognition in automated production control systems. The article describes the process of dataset procedural generation of steel slab defects with a symmetrical distribution. The results of training two neural networks Unet and Xception on a generated data grid and testing them on real data are presented. The performance of these neural networks was assessed using real data from the Severstal: Steel Defect Detection set. In both cases, the neural networks showed good results in the classification and segmentation of surface defects of steel workpieces in the image. Dice score on synthetic data reaches 0.62, and accuracy—0.81.


Author(s):  
Gyasi Emmanuel Kwabena ◽  
Mageshbabu Ramamurthy ◽  
Akila Wijethunga ◽  
Purushotham Swarnalatha

The world is fascinated to see how technology evolves each passing day. All too soon, there's an emerging technology that is trending around us, and it is no other technology than smart wearable technology. Less attention is paid to the data that our bodies are radiating and communicating to us, but with the timely arrival of wearable sensors, we now have numerous devices that can be tracking and collecting the data that our bodies are radiating. Apart from numerous benefits that we derive from the functions provided by wearable technology such as monitoring of our fitness levels, etc., one other critical importance of wearable technology is helping the advancement of artificial intelligence (AI) and machine learning (ML). Machine learning thrives on the availability of massive data and wearable technology which forms part of the internet of things (IoT) generates megabytes of data every single day. The data generated by these wearable devices are used as a dataset for the training and learning of machine learning models. Through the analysis of the outcome of these machine learning models, scientific conclusions are made.


Sign in / Sign up

Export Citation Format

Share Document