scholarly journals Learning Optimal Time Series Combination and Pre-Processing by Smart Joins

2020 ◽  
Vol 10 (18) ◽  
pp. 6346
Author(s):  
Amaia Gil ◽  
Marco Quartulli ◽  
Igor G. Olaizola ◽  
Basilio Sierra

In industrial applications of data science and machine learning, most of the steps of a typical pipeline focus on optimizing measures of model fitness to the available data. Data preprocessing, instead, is often ad-hoc, and not based on the optimization of quantitative measures. This paper proposes the use of optimization in the preprocessing step, specifically studying a time series joining methodology, and introduces an error function to measure the adequateness of the joining. Experiments show how the method allows monitoring preprocessing errors for different time slices, indicating when a retraining of the preprocessing may be needed. Thus, this contribution helps quantifying the implications of data preprocessing on the result of data analysis and machine learning methods. The methodology is applied to two case studies: synthetic simulation data with controlled distortions, and a real scenario of an industrial process.

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Andrei Bratu ◽  
Gabriela Czibula

Data augmentation is a commonly used technique in data science for improving the robustness and performance of machine learning models. The purpose of the paper is to study the feasibility of generating synthetic data points of temporal nature towards this end. A general approach named DAuGAN (Data Augmentation using Generative Adversarial Networks) is presented for identifying poorly represented sections of a time series, studying the synthesis and integration of new data points, and performance improvement on a benchmark machine learning model. The problem is studied and applied in the domain of algorithmic trading, whose constraints are presented and taken into consideration. The experimental results highlight an improvement in performance on a benchmark reinforcement learning agent trained on a dataset enhanced with DAuGAN to trade a financial instrument.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Sergio Martin del Campo Barraza ◽  
William Lindskog ◽  
Davide Badalotti ◽  
Oskar Liew ◽  
Arash Toyser

Data-based models built using machine learning solutions are becoming more prominent in the condition monitoring, maintenance, and prognostics fields. The capacity to build these models using a machine learning approach depends largely in the quality of the data. Of particular importance is the availability of labelled data, which describes the conditions that are intended to be identified. However, properly labelled data that is useful in many machine learning strategies is a scare resource. Furthermore, producing high-quality labelled data is expensive, time-consuming and a lot of times inaccurate given the uncertainty surrounding the labeling process and the annotators.  Active Learning (AL) has emerged as a semi-supervised approach that enables cost and time reductions of the labeling process. This approach has had a delayed adoption for time series classification given the difficulty to extract and present the time series information in such a way that it is easy to understand for the human annotator who incorporates the labels. This difficulty arises from the large dimensionality that many of these time series possess. This challenge is exacerbated by the cold-start problem, where the initial labelled dataset used in typical AL frameworks may not exist. Thus, the initial set of labels to be allocated to the time series samples is not available. This last challenge is particularly common on many condition monitoring applications where data samples of specific faults or problems does not exist. In this article, we present an AL framework to be used in the classification of time series from industrial process data, in particular vibration waveforms originated from condition monitoring applications. In this framework, we deal with the absence of labels to train an initial classification model by introducing a pre-clustering step. This step uses an unsupervised clustering algorithm to identify the number of labels and selects the points with a stronger group belonging as initial samples to be labelled in the active learning step. Furthermore, this framework presents two approaches to present the information to the annotator that can be via time-series imaging and automatic extraction of statistical features. Our work is motivated by the interest to facilitate the effort required for labeling time-series waveforms, while maintaining a high level of accuracy and consistency on those labels. In addition, we study the number of time-series samples that require to be labelled to achieve different levels of classification accuracy, as well as their confidence intervals. These experiments are carried out using vibration signals from a well-known rolling element bearing dataset and typical process data from a production plant.   An active learning framework that considers the conditions of the data commonly found in maintenance and condition monitoring applications while presenting the data in ways easy to interpret by human annotators can facilitate the generation reliable datasets. These datasets can, in turn, assist in the development of data-driven models that describe the many different processes that a machine undergoes.


2021 ◽  
Author(s):  
Martijn Witjes ◽  
Leandro Parente ◽  
Chris J. van Diemen ◽  
Tomislav Hengl ◽  
Martin Landa ◽  
...  

Abstract A seamless spatiotemporal machine learning framework for automated prediction, uncertainty assessment, and analysis of land use / land cover (LULC) dynamics is presented. The framework includes: (1) harmonization and preprocessing of high-resolution spatial and spatiotemporal covariate datasets (GLAD Landsat, NPP/VIIRS) including 5 million harmonized LUCAS and CORINE Land Cover-derived training samples, (2) model building based on spatial k-fold cross-validation and hyper-parameter optimization, (3) prediction of the most probable class, class probabilities and uncertainty per pixel, (4) LULC change analysis on time-series of produced maps. The spatiotemporal ensemble model was fitted by combining random forest, gradient boosted trees, and artificial neural network, with logistic regressor as meta-learner. The results show that the most important covariates for mapping LULC in Europe are: seasonal aggregates of Landsat green and near-infrared bands, multiple Landsat-derived spectral indices, and elevation. Spatial cross-validation of the model indicates consistent performance across multiple years with 62%, 70%, and 87% accuracy when predicting 33 (level-3), 14 (level-2), and 5 classes (level-1); with artificial surface classes such as 'airports' and 'railroads' showing the lowest match with validation points. The spatiotemporal model outperforms spatial models on known-year classification by 2.7% and unknown-year classification by 3.5%. Results of the accuracy assessment using 48,365 independent test samples shows 87% match with the validation points. Results of time-series analysis (time-series of LULC probabilities and NDVI images) suggest gradual deforestation trends in large parts of Sweden, the Alps, and Scotland. An advantage of using spatiotemporal ML is that the fitted model can be used to predict LULC in years that were not included in its training dataset, allowing generalization to past and future periods, e.g. to predict land cover for years prior to 2000 and beyond 2020. The generated land cover time-series data stack (ODSE-LULC), including the training points, is publicly available via the Open Data Science (ODS)-Europe Viewer.


2020 ◽  
Vol 110 (9-10) ◽  
pp. 2445-2463 ◽  
Author(s):  
Yuanyuan Li ◽  
Stefano Carabelli ◽  
Edoardo Fadda ◽  
Daniele Manerba ◽  
Roberto Tadei ◽  
...  

Abstract Along with the fourth industrial revolution, different tools coming from optimization, Internet of Things, data science, and artificial intelligence fields are creating new opportunities in production management. While manufacturing processes are stochastic and rescheduling decisions need to be made under uncertainty, it is still a complicated task to decide whether a rescheduling is worthwhile, which is often addressed in practice on a greedy basis. To find a tradeoff between rescheduling frequency and the growing accumulation of delays, we propose a rescheduling framework, which integrates machine learning (ML) techniques and optimization algorithms. To prove the effectiveness, we first model a flexible job-shop scheduling problem with sequence-dependent setup and limited dual resources (FJSP) inspired by an industrial application. Then, we solve the scheduling problem through a hybrid metaheuristic approach. We train the ML classification model for identifying rescheduling patterns. Finally, we compare its rescheduling performance with periodical rescheduling approaches. Through observing the simulation results, we find the integration of these techniques can provide a good compromise between rescheduling frequency and scheduling delays. The main contributions of the work are the formalization of the FJSP problem, the development of ad hoc solution methods, and the proposal/validation of an innovative ML and optimization-based framework for supporting rescheduling decisions.


2021 ◽  
pp. 014459872110117
Author(s):  
Amine Tadjer ◽  
Aojie Hong ◽  
Reidar B Bratvold

Traditional decline curve analyses (DCAs), both deterministic and probabilistic, use specific models to fit production data for production forecasting. Various decline curve models have been applied for unconventional wells, including the Arps model, stretched exponential model, Duong model, and combined capacitance-resistance model. However, it is not straightforward to determine which model should be used, as multiple models may fit a dataset equally well but provide different forecasts, and hastily selecting a model for probabilistic DCA can underestimate the uncertainty in a production forecast. Data science, machine learning, and artificial intelligence are revolutionizing the oil and gas industry by utilizing computing power more effectively and efficiently. We propose a data-driven approach in this paper to performing short term predictions for unconventional oil production. Two states of the art level models have tested: DeepAR and used Prophet time series analysis on petroleum production data. Compared with the traditional approach using decline curve models, the machine learning approach can be regarded as” model-free” (non-parametric) because the pre-determination of decline curve models is not required. The main goal of this work is to develop and apply neural networks and time series techniques to oil well data without having substantial knowledge regarding the extraction process or physical relationship between the geological and dynamic parameters. For evaluation and verification purpose, The proposed method is applied to a selected well of Midland fields from the USA. By comparing our results, we can infer that both DeepAR and Prophet analysis are useful for gaining a better understanding of the behavior of oil wells, and can mitigate over/underestimates resulting from using a single decline curve model for forecasting. In addition, the proposed approach performs well in spreading model uncertainty to uncertainty in production forecasting; that is, we end up with a forecast which outperforms the standard DCA methods.


Author(s):  
Meike Klettke ◽  
Adrian Lutsch ◽  
Uta Störl

AbstractData engineering is an integral part of any data science and ML process. It consists of several subtasks that are performed to improve data quality and to transform data into a target format suitable for analysis. The quality and correctness of the data engineering steps is therefore important to ensure the quality of the overall process.In machine learning processes requirements such as fairness and explainability are essential. The answers to these must also be provided by the data engineering subtasks. In this article, we will show how these can be achieved by logging, monitoring and controlling the data changes in order to evaluate their correctness. However, since data preprocessing algorithms are part of any machine learning pipeline, they must obviously also guarantee that they do not produce data biases.In this article we will briefly introduce three classes of methods for measuring data changes in data engineering and present which research questions still remain unanswered in this area.


2021 ◽  
Author(s):  
Martijn Witjes ◽  
Leandro Parente ◽  
Chris J. van Diemen ◽  
Tomislav Hengl ◽  
Martin Landa ◽  
...  

Abstract A seamless spatiotemporal machine learning framework for automated prediction, uncertainty assessment, and analysis of long-term LULC dynamics is presented. The framework includes: (1) harmonization and preprocessing of high-resolution spatial and spatiotemporal input datasets (GLAD Landsat, NPP/VIIRS) including 5~million harmonized LUCAS and CORINE Land Cover-derived training samples, (2) model building based on spatial k-fold cross-validation and hyper-parameter optimization, (3) prediction of the most probable class, class probabilities and uncertainty per pixel, (4) LULC change analysis on time-series of produced maps. The spatiotemporal ensemble model consists of a random forest, gradient boosted tree classifier, and a artificial neural network, with a logistic regressor as meta-learner. The results show that the most important variables for mapping LULC in Europe are: seasonal aggregates of Landsat green and near-infrared bands, multiple Landsat-derived spectral indices, long-term surface water probability, and elevation. Spatial cross-validation of the model indicates consistent performance across multiple years with overall accuracy (weighted F1-score) of 0.49, 0.63, and 0.83 when predicting 44 (level-3), 14 (level-2), and 5 classes (level-1). The spatiotemporal model outperforms spatial models on known-year classification by 2.7% and unknown-year classification by 3.5%. Results of the accuracy assessment using 48,365 independent test samples shows 87% match with the validation points. Results of time-series analysis (time-series of LULC probabilities and NDVI images) suggest forest loss in large parts of Sweden, the Alps, and Scotland. An advantage of using spatiotemporal ML is that the fitted model can be used to predict LULC in years that were not included in its training dataset, allowing generalization to past and future periods, e.g. to predict land cover for years prior to 2000 and beyond 2020. The generated land cover time-series data stack (ODSE-LULC), including the training points, is publicly available via the Open Data Science (ODS)-Europe Viewer. Functions used to prepare data and run modeling are available via the eumap library for python.


Author(s):  
Luca Barbaglia ◽  
Sergio Consoli ◽  
Sebastiano Manzan

AbstractForecasting economic and financial variables is a challenging task for several reasons, such as the low signal-to-noise ratio, regime changes, and the effect of volatility among others. A recent trend is to extract information from news as an additional source to forecast economic activity and financial variables. The goal is to evaluate if news can improve forecasts from standard methods that usually are not well-specified and have poor out-of-sample performance. In a currently on-going project, our goal is to combine a richer information set that includes news with a state-of-the-art machine learning model. In particular, we leverage on two recent advances in Data Science, specifically on Word Embedding and Deep Learning models, which have recently attracted extensive attention in many scientific fields. We believe that by combining the two methodologies, effective solutions can be built to improve the prediction accuracy for economic and financial time series. In this preliminary contribution, we provide an overview of the methodology under development and some initial empirical findings. The forecasting model is based on DeepAR, an auto-regressive probabilistic Recurrent Neural Network model, that is combined with GloVe Word Embeddings extracted from economic news. The target variable is the spread between the US 10-Year Treasury Constant Maturity and the 3-Month Treasury Constant Maturity (T10Y3M). The DeepAR model is trained on a large number of related GloVe Word Embedding time series, and employed to produce point and density forecasts.


Sign in / Sign up

Export Citation Format

Share Document