Supervised and unsupervised machine-learning for automated quality control of environmental sensor data

We can observe a global decrease of well maintained weather stations by meteorological services and governmental institutes. At the same time, environmental sensor data is increasing through the use of opportunistic or remote sensing approaches. Overall, the trend for environmental sensor networks is strongly going towards automated routines, especially for quality-control (QC) to provide usable data in near real-time. A common QC scenario is that data is being flagged manually using expert knowledge and visual inspection by humans. To reduce this tedious process and to enable near-real time data provision, machine-learning (ML) algorithms exhibit a high potential as they can be designed to imitate the experts actions.&#160;Here we address these three common challenges when applying ML for QC: 1) Robustness to missing values in the input data. 2) Availability of training data, i.e. manual quality flags that mark erroneous data points. And 3) Generalization of the model regarding non-stationary behavior of one&#160; experimental system or changes in the experimental setup when applied to a different study area. We approach the QC problem and the related issues both as a supervised and an unsupervised learning problem using deep neural networks on the one hand and dimensionality reduction combined with clustering algorithms on the other.We compare the different ML algorithms on two time-series datasets to test their applicability across scales and domains. One dataset consists of signal levels of 4000 commercial microwave links distributed all over Germany that can be used to monitor precipitation. The second dataset contains time-series of soil moisture and temperature from 120 sensors deployed at a small-scale measurement plot at the TERENO site &#8220;Hohes Holz&#8221;.First results show that supervised ML provides an optimized performance for QC for an experimental system not subject to change and at the cost of a laborious preparation of the training data. The unsupervised approach is also able to separate valid from erroneous data at reasonable accuracy. However, it provides the additional benefit that it does not require manual flags and can thus be retrained more easily in case the system is subject to significant changes.&#160;In this presentation, we discuss the performance, advantages and drawbacks of the proposed ML routines to tackle the aforementioned challenges. Thus, we aim to provide a starting point for researchers in the promising field of ML application for automated QC of environmental sensor data.

Download Full-text

On the potential and challenges of using machine-learning for automated quality control of environmental sensor data

10.5194/egusphere-egu2020-20777 ◽

2020 ◽

Author(s):

Lennart Schmidt ◽

Hannes Mollenhauer ◽

Corinna Rebmann ◽

David Schäfer ◽

Antje Claussnitzer ◽

...

Keyword(s):

Machine Learning ◽

Quality Control ◽

Ground Truth ◽

Sensor Data ◽

Small Scale ◽

Ground Truth Data ◽

Starting Point ◽

Environmental Sensor ◽

Spatio Temporal ◽

Automated Quality Control

With more and more data being gathered from environmental sensor networks, the importance of automated quality-control (QC) routines to provide usable data in near-real time is becoming increasingly apparent. Machine-learning (ML) algorithms exhibit a high potential to this respect as they are able to exploit the spatio-temporal relation of multiple sensors to identify anomalies while allowing for non-linear functional relations in the data. In this study, we evaluate the potential of ML for automated QC on two spatio-temporal datasets at different spatial scales: One is a dataset of atmospheric variables at 53 stations across Northern Germany. The second dataset contains timeseries of soil moisture and temperature at 40 sensors at a small-scale measurement plot.Furthermore, we investigate strategies to tackle three challenges that are commonly present when applying ML for QC: 1) As sensors might drop out, the ML models have to be designed to be robust against missing values in the input data. We address this by comparing different data imputation methods, coupled with a binary representation of whether a value is missing or not. 2) Quality flags that mark erroneous data points to serve as ground truth for model training might not be available. And 3) There is no guarantee that the system under study is stationary, which might render the outputs of a trained model useless in the future. To address 2) and 3), we frame the problem both as a supervised and unsupervised learning problem. Here, the use of unsupervised ML-models can be beneficial as they do not require ground truth data and can thus be retrained more easily should the system be subject to significant changes. In this presentation, we discuss the performance, advantages and drawbacks of the proposed strategies to tackle the aforementioned challenges. Thus, we provide a starting point for researchers in the largely untouched field of ML application for automated quality control of environmental sensor data.

Download Full-text

Machine-Learning for the Prediction of Lost Circulation Events - Time Series Analysis and Model Evaluation

10.2118/204706-ms ◽

2021 ◽

Author(s):

Arturo Magana-Mora ◽

Mohammad AlJubran ◽

Jothibasu Ramasamy ◽

Mohammed AlBassam ◽

Chinthaka Gooneratne ◽

...

Keyword(s):

Machine Learning ◽

Time Series ◽

Real Time ◽

Large Scale ◽

Model Comparison ◽

Drilling Fluid ◽

Sensor Data ◽

False Alarms ◽

Suitable Model ◽

Lost Circulation

Abstract Objective/Scope. Lost circulation events (LCEs) are among the top causes for drilling nonproductive time (NPT). The presence of natural fractures and vugular formations causes loss of drilling fluid circulation. Drilling depleted zones with incorrect mud weights can also lead to drilling induced losses. LCEs can also develop into additional drilling hazards, such as stuck pipe incidents, kicks, and blowouts. An LCE is traditionally diagnosed only when there is a reduction in mud volume in mud pits in the case of moderate losses or reduction of mud column in the annulus in total losses. Using machine learning (ML) for predicting the presence of a loss zone and the estimation of fracture parameters ahead is very beneficial as it can immediately alert the drilling crew in order for them to take the required actions to mitigate or cure LCEs. Methods, Procedures, Process. Although different computational methods have been proposed for the prediction of LCEs, there is a need to further improve the models and reduce the number of false alarms. Robust and generalizable ML models require a sufficiently large amount of data that captures the different parameters and scenarios representing an LCE. For this, we derived a framework that automatically searches through historical data, locates LCEs, and extracts the surface drilling and rheology parameters surrounding such events. Results, Observations, and Conclusions. We derived different ML models utilizing various algorithms and evaluated them using the data-split technique at the level of wells to find the most suitable model for the prediction of an LCE. From the model comparison, random forest classifier achieved the best results and successfully predicted LCEs before they occurred. The developed LCE model is designed to be implemented in the real-time drilling portal as an aid to the drilling engineers and the rig crew to minimize or avoid NPT. Novel/Additive Information. The main contribution of this study is the analysis of real-time surface drilling parameters and sensor data to predict an LCE from a statistically representative number of wells. The large-scale analysis of several wells that appropriately describe the different conditions before an LCE is critical for avoiding model undertraining or lack of model generalization. Finally, we formulated the prediction of LCEs as a time-series problem and considered parameter trends to accurately determine the early signs of LCEs.

Download Full-text

Using Machine Learning for Real-Time BAC Estimation from a New-Generation Transdermal Biosensor in the Laboratory

10.31234/osf.io/849er ◽

2020 ◽

Author(s):

Catharine Fairbairn ◽

Dahyeon Kang ◽

Nigel Bosch

Keyword(s):

Machine Learning ◽

Time Series ◽

Real Time ◽

Machine Learning Algorithms ◽

Sensor Data ◽

Laboratory Research ◽

Time Interval ◽

Rapid Sampling ◽

Potential Applications ◽

New Generation

Background: Transdermal biosensors offer a noninvasive, low-cost technology for the assessment of alcohol consumption with broad potential applications in addiction science. Older-generation transdermal devices feature bulky designs and sparse sampling intervals, limiting potential applications for transdermal technology. Recently a new-generation of transdermal device has become available, featuring smartphone connectivity, compact designs, and rapid sampling. Here we present initial laboratory research examining the validity of a new-generation transdermal sensor prototype. Methods: Participants were young drinkers administered alcohol (target BAC=.08%) or no-alcohol in the laboratory. Participants wore transdermal sensors while providing repeated breathalyzer (BrAC) readings. We assessed the association between BrAC (measured BrAC for a specific time point) and eBrAC (BrAC estimated based only on transdermal readings collected in the immediately preceding time interval). Extra-Trees machine learning algorithms, incorporating transdermal time series features as predictors, were used to create eBrAC. Results: Failure rates for the new-generation prototype sensor were high (16%-34%). Among participants with useable new-generation sensor data, models demonstrated strong capabilities for separating drinking from non-drinking episodes, and significant (moderate) ability to differentiate BrAC levels within intoxicated participants. Differences between eBrAC and BrAC were 60% higher for models based on data from old-generation vs new-generation devices. Model comparisons indicated that both time series analysis and machine learning contributed significantly to final model accuracy. Conclusions: Results provide favorable preliminary evidence for the accuracy of real-time BAC estimates from a new-generation sensor. Future research featuring variable alcohol doses and real-world contexts will be required to further validate these devices.

Download Full-text

Near Real-time Autonomous Quality Control for Streaming Environmental Sensor Data

Procedia Computer Science ◽

10.1016/j.procs.2018.08.139 ◽

2018 ◽

Vol 126 ◽

pp. 1656-1665 ◽

Cited By ~ 1

Author(s):

Connor Scully-Allison ◽

Vinh Le ◽

Eric Fritzinger ◽

Scotty Strachan ◽

Frederick C. Harris ◽

...

Keyword(s):

Quality Control ◽

Real Time ◽

Sensor Data ◽

Environmental Sensor ◽

Time Autonomous

Download Full-text

Machine Learning Solutions for Bridge Scour Forecast Based on Monitoring Data

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211012693 ◽

2021 ◽

pp. 036119812110126

Author(s):

Negin Yousefpour ◽

Steve Downie ◽

Steve Walker ◽

Nathan Perkins ◽

Hristo Dikanski

Keyword(s):

Machine Learning ◽

Real Time ◽

Monitoring Data ◽

Sensor Data ◽

Monitoring Systems ◽

Bridge Scour ◽

Scour Monitoring ◽

Novel Approach ◽

Bayesian Inference Method ◽

Water Level Variations

Bridge scour is a challenge throughout the U.S.A. and other countries. Despite the scale of the issue, there is still a substantial lack of robust methods for scour prediction to support reliable, risk-based management and decision making. Throughout the past decade, the use of real-time scour monitoring systems has gained increasing interest among state departments of transportation across the U.S.A. This paper introduces three distinct methodologies for scour prediction using advanced artificial intelligence (AI)/machine learning (ML) techniques based on real-time scour monitoring data. Scour monitoring data included the riverbed and river stage elevation time series at bridge piers gathered from various sources. Deep learning algorithms showed promising in prediction of bed elevation and water level variations as early as a week in advance. Ensemble neural networks proved successful in the predicting the maximum upcoming scour depth, using the observed sensor data at the onset of a scour episode, and based on bridge pier, flow and riverbed characteristics. In addition, two of the common empirical scour models were calibrated based on the observed sensor data using the Bayesian inference method, showing significant improvement in prediction accuracy. Overall, this paper introduces a novel approach for scour risk management by integrating emerging AI/ML algorithms with real-time monitoring systems for early scour forecast.

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

A Novel Approach to Time Series Forecasting Using Model-Free Adaptive Control Framework

Volume 1: Adaptive/Intelligent Sys. Control; Driver Assistance/Autonomous Tech.; Control Design Methods; Nonlinear Control; Robotics; Assistive/Rehabilitation Devices; Biomedical/Neural Systems; Building Energy Systems; Connected Vehicle Systems; Control/Estimation of Energy Systems; Control Apps.; Smart Buildings/Microgrids; Education; Human-Robot Systems; Soft Mechatronics/Robotic Components/Systems; Energy/Power Systems; Energy Storage; Estimation/Identification; Vehicle Efficiency/Emissions ◽

10.1115/dscc2020-3329 ◽

2020 ◽

Author(s):

Meenakshi Narayan ◽

Ann Majewicz Fey

Keyword(s):

Time Series ◽

Adaptive Control ◽

Real Time ◽

Time Series Data ◽

Mean Squared Error ◽

Force Sensor ◽

Sensor Data ◽

Series Data ◽

Model Free ◽

Control Framework

Abstract Sensor data predictions could significantly improve the accuracy and effectiveness of modern control systems; however, existing machine learning and advanced statistical techniques to forecast time series data require significant computational resources which is not ideal for real-time applications. In this paper, we propose a novel forecasting technique called Compact Form Dynamic Linearization Model-Free Prediction (CFDL-MFP) which is derived from the existing model-free adaptive control framework. This approach enables near real-time forecasts of seconds-worth of time-series data due to its basis as an optimal control problem. The performance of the CFDL-MFP algorithm was evaluated using four real datasets including: force sensor readings from surgical needle, ECG measurements for heart rate, and atmospheric temperature and Nile water level recordings. On average, the forecast accuracy of CFDL-MFP was 28% better than the benchmark Autoregressive Integrated Moving Average (ARIMA) algorithm. The maximum computation time of CFDL-MFP was 49.1ms which was 170 times faster than ARIMA. Forecasts were best for deterministic data patterns, such as the ECG data, with a minimum average root mean squared error of (0.2±0.2).

Download Full-text

Detecting Pressure Anomalies While Drilling Using a Machine Learning Hybrid Approach

10.2118/204035-ms ◽

2021 ◽

Author(s):

Aurore Lafond ◽

Maurice Ringer ◽

Florian Le Blay ◽

Jiaxu Liu ◽

Ekaterina Millan ◽

...

Keyword(s):

Machine Learning ◽

Data Quality ◽

Real Time ◽

Large Scale ◽

Hybrid Approach ◽

Physical Models ◽

Training Data ◽

Digital Data ◽

Machine Learning Techniques ◽

New System

Abstract Abnormal surface pressure is typically the first indicator of a number of problematic events, including kicks, losses, washouts and stuck pipe. These events account for 60–70% of all drilling-related nonproductive time, so their early and accurate detection has the potential to save the industry billions of dollars. Detecting these events today requires an expert user watching multiple curves, which can be costly, and subject to human errors. The solution presented in this paper is aiming at augmenting traditional models with new machine learning techniques, which enable to detect these events automatically and help the monitoring of the drilling well. Today’s real-time monitoring systems employ complex physical models to estimate surface standpipe pressure while drilling. These require many inputs and are difficult to calibrate. Machine learning is an alternative method to predict pump pressure, but this alone needs significant labelled training data, which is often lacking in the drilling world. The new system combines these approaches: a machine learning framework is used to enable automated learning while the physical models work to compensate any gaps in the training data. The system uses only standard surface measurements, is fully automated, and is continuously retrained while drilling to ensure the most accurate pressure prediction. In addition, a stochastic (Bayesian) machine learning technique is used, which enables not only a prediction of the pressure, but also the uncertainty and confidence of this prediction. Last, the new system includes a data quality control workflow. It discards periods of low data quality for the pressure anomaly detection and enables to have a smarter real-time events analysis. The new system has been tested on historical wells using a new test and validation framework. The framework runs the system automatically on large volumes of both historical and simulated data, to enable cross-referencing the results with observations. In this paper, we show the results of the automated test framework as well as the capabilities of the new system in two specific case studies, one on land and another offshore. Moreover, large scale statistics enlighten the reliability and the efficiency of this new detection workflow. The new system builds on the trend in our industry to better capture and utilize digital data for optimizing drilling.

Download Full-text

Machine Learning Classification of Head Impact Sensor Data

Volume 3: Biomedical and Biotechnology Engineering ◽

10.1115/imece2019-12173 ◽

2019 ◽

Author(s):

Tyler F. Rooks ◽

Andrea S. Dargie ◽

Valeta Carol Chancey

Keyword(s):

Machine Learning ◽

Decision Tree ◽

External Validation ◽

Classification Algorithm ◽

Sensor Data ◽

Environmental Sensors ◽

Head Acceleration ◽

Machine Learning Classification ◽

Environmental Sensor ◽

Validation Set

Abstract A shortcoming of using environmental sensors for the surveillance of potentially concussive events is substantial uncertainty regarding whether the event was caused by head acceleration (“head impacts”) or sensor motion (with no head acceleration). The goal of the present study is to develop a machine learning model to classify environmental sensor data obtained in the field and evaluate the performance of the model against the performance of the proprietary classification algorithm used by the environmental sensor. Data were collected from Soldiers attending sparring sessions conducted under a U.S. Army Combatives School course. Data from one sparring session were used to train a decision tree classification algorithm to identify good and bad signals. Data from the remaining sparring sessions were kept as an external validation set. The performance of the proprietary algorithm used by the sensor was also compared to the trained algorithm performance. The trained decision tree was able to correctly classify 95% of events for internal cross-validation and 88% of events for the external validation set. Comparatively, the proprietary algorithm was only able to correctly classify 61% of the events. In general, the trained algorithm was better able to predict when a signal was good or bad compared to the proprietary algorithm. The present study shows it is possible to train a decision tree algorithm using environmental sensor data collected in the field.

Download Full-text

LabelSens: enabling real-time sensor data labelling at the point of collection using an artificial intelligence-based approach

Personal and Ubiquitous Computing ◽

10.1007/s00779-020-01427-x ◽

2020 ◽

Vol 24 (5) ◽

pp. 709-722

Author(s):

Kieran Woodward ◽

Eiman Kanjo ◽

Andreas Oikonomou ◽

Alan Chamberlain

Keyword(s):

Machine Learning ◽

Data Collection ◽

Real Time ◽

High Performance ◽

Poor Performance ◽

Limited Range ◽

Performance Comparison ◽

Mobile App ◽

Sensor Data ◽

New Techniques

Abstract In recent years, machine learning has developed rapidly, enabling the development of applications with high levels of recognition accuracy relating to the use of speech and images. However, other types of data to which these models can be applied have not yet been explored as thoroughly. Labelling is an indispensable stage of data pre-processing that can be particularly challenging, especially when applied to single or multi-model real-time sensor data collection approaches. Currently, real-time sensor data labelling is an unwieldy process, with a limited range of tools available and poor performance characteristics, which can lead to the performance of the machine learning models being compromised. In this paper, we introduce new techniques for labelling at the point of collection coupled with a pilot study and a systematic performance comparison of two popular types of deep neural networks running on five custom built devices and a comparative mobile app (68.5–89% accuracy within-device GRU model, 92.8% highest LSTM model accuracy). These devices are designed to enable real-time labelling with various buttons, slide potentiometer and force sensors. This exploratory work illustrates several key features that inform the design of data collection tools that can help researchers select and apply appropriate labelling techniques to their work. We also identify common bottlenecks in each architecture and provide field tested guidelines to assist in building adaptive, high-performance edge solutions.

Download Full-text