On the potential and challenges of using machine-learning for automated quality control of environmental sensor data

With more and more data being gathered from environmental sensor networks, the importance of automated quality-control (QC) routines to provide usable data in near-real time is becoming increasingly apparent. Machine-learning (ML) algorithms exhibit a high potential to this respect as they are able to exploit the spatio-temporal relation of multiple sensors to identify anomalies while allowing for non-linear functional relations in the data. In this study, we evaluate the potential of ML for automated QC on two spatio-temporal datasets at different spatial scales: One is a dataset of atmospheric variables at 53 stations across Northern Germany. The second dataset contains timeseries of soil moisture and temperature at 40 sensors at a small-scale measurement plot.Furthermore, we investigate strategies to tackle three challenges that are commonly present when applying ML for QC: 1) As sensors might drop out, the ML models have to be designed to be robust against missing values in the input data. We address this by comparing different data imputation methods, coupled with a binary representation of whether a value is missing or not. 2) Quality flags that mark erroneous data points to serve as ground truth for model training might not be available. And 3) There is no guarantee that the system under study is stationary, which might render the outputs of a trained model useless in the future. To address 2) and 3), we frame the problem both as a supervised and unsupervised learning problem. Here, the use of unsupervised ML-models can be beneficial as they do not require ground truth data and can thus be retrained more easily should the system be subject to significant changes. In this presentation, we discuss the performance, advantages and drawbacks of the proposed strategies to tackle the aforementioned challenges. Thus, we provide a starting point for researchers in the largely untouched field of ML application for automated quality control of environmental sensor data.

Download Full-text

Supervised and unsupervised machine-learning for automated quality control of environmental sensor data

10.5194/egusphere-egu21-14485 ◽

2021 ◽

Author(s):

Julius Polz ◽

Lennart Schmidt ◽

Luca Glawion ◽

Maximilian Graf ◽

Christian Werner ◽

...

Keyword(s):

Machine Learning ◽

Time Series ◽

Quality Control ◽

Real Time ◽

Experimental System ◽

Training Data ◽

Sensor Data ◽

Small Scale ◽

Environmental Sensor ◽

Erroneous Data

We can observe a global decrease of well maintained weather stations by meteorological services and governmental institutes. At the same time, environmental sensor data is increasing through the use of opportunistic or remote sensing approaches. Overall, the trend for environmental sensor networks is strongly going towards automated routines, especially for quality-control (QC) to provide usable data in near real-time. A common QC scenario is that data is being flagged manually using expert knowledge and visual inspection by humans. To reduce this tedious process and to enable near-real time data provision, machine-learning (ML) algorithms exhibit a high potential as they can be designed to imitate the experts actions.&#160;Here we address these three common challenges when applying ML for QC: 1) Robustness to missing values in the input data. 2) Availability of training data, i.e. manual quality flags that mark erroneous data points. And 3) Generalization of the model regarding non-stationary behavior of one&#160; experimental system or changes in the experimental setup when applied to a different study area. We approach the QC problem and the related issues both as a supervised and an unsupervised learning problem using deep neural networks on the one hand and dimensionality reduction combined with clustering algorithms on the other.We compare the different ML algorithms on two time-series datasets to test their applicability across scales and domains. One dataset consists of signal levels of 4000 commercial microwave links distributed all over Germany that can be used to monitor precipitation. The second dataset contains time-series of soil moisture and temperature from 120 sensors deployed at a small-scale measurement plot at the TERENO site &#8220;Hohes Holz&#8221;.First results show that supervised ML provides an optimized performance for QC for an experimental system not subject to change and at the cost of a laborious preparation of the training data. The unsupervised approach is also able to separate valid from erroneous data at reasonable accuracy. However, it provides the additional benefit that it does not require manual flags and can thus be retrained more easily in case the system is subject to significant changes.&#160;In this presentation, we discuss the performance, advantages and drawbacks of the proposed ML routines to tackle the aforementioned challenges. Thus, we aim to provide a starting point for researchers in the promising field of ML application for automated QC of environmental sensor data.

Download Full-text

Evaluation of Commercial Truck Parking Detection for Rest Areas

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198118788185 ◽

2018 ◽

Vol 2672 (9) ◽

pp. 141-151

Author(s):

Wei Sun ◽

Ethan Stoop ◽

Scott S. Washburn

Keyword(s):

Software Tool ◽

Vehicle Detection ◽

Ground Truth ◽

Early Morning ◽

Video Data ◽

Sensor Data ◽

Parking Space ◽

Ground Truth Data ◽

Detection Technology ◽

Rest Areas

Florida’s interstate rest areas are heavily utilized by commercial trucks for overnight parking. Many of these rest areas regularly experience 100% utilization of available commercial truck parking spaces during the evening and early-morning hours. Being able to communicate availability of commercial truck parking space to drivers in advance of arriving at a rest area would reduce unnecessary stops at full rest areas as well as driver anxiety. In order to do this, it is critical to implement a vehicle detection technology to reflect the parking status of the rest area correctly. The objective of this project was to evaluate three different wireless in-pavement vehicle detection technologies as applied to commercial truck parking at interstate rest areas. This paper mainly focuses on the following aspects: (a) accuracy of the vehicle detection in parking spaces, (b) installation, setup, and maintenance of the vehicle detection technology, and (c) truck parking trends at the rest area study site. The final project report includes a more detailed summary of the evaluation. The research team recorded video of the rest areas as the ground-truth data and developed a software tool to compare the video data with the parking sensor data. Two accuracy tests (event accuracy and occupancy accuracy) were conducted to evaluate each sensor’s ability to reflect the status of each parking space correctly. Overall, it was found that all three technologies performed well, with accuracy rates of 95% or better for both tests. This result suggests that, for implementation, pricing, and/or maintenance issues may be more significant factors for the choice of technology.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

Fully automated quality control of rigid and affine registrations of T1w and T2w MRI in big data using machine learning

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2021.104997 ◽

2021 ◽

pp. 104997

Author(s):

Sudhakar Tummala ◽

Venkata Sainath Gupta Thadikemalla ◽

Barbara A.K. Kreilkamp ◽

Erik B. Dam ◽

Niels K. Focke

Keyword(s):

Machine Learning ◽

Quality Control ◽

Big Data ◽

Automated Quality Control

Download Full-text

Machine Learning Classification of Head Impact Sensor Data

Volume 3: Biomedical and Biotechnology Engineering ◽

10.1115/imece2019-12173 ◽

2019 ◽

Author(s):

Tyler F. Rooks ◽

Andrea S. Dargie ◽

Valeta Carol Chancey

Keyword(s):

Machine Learning ◽

Decision Tree ◽

External Validation ◽

Classification Algorithm ◽

Sensor Data ◽

Environmental Sensors ◽

Head Acceleration ◽

Machine Learning Classification ◽

Environmental Sensor ◽

Validation Set

Abstract A shortcoming of using environmental sensors for the surveillance of potentially concussive events is substantial uncertainty regarding whether the event was caused by head acceleration (“head impacts”) or sensor motion (with no head acceleration). The goal of the present study is to develop a machine learning model to classify environmental sensor data obtained in the field and evaluate the performance of the model against the performance of the proprietary classification algorithm used by the environmental sensor. Data were collected from Soldiers attending sparring sessions conducted under a U.S. Army Combatives School course. Data from one sparring session were used to train a decision tree classification algorithm to identify good and bad signals. Data from the remaining sparring sessions were kept as an external validation set. The performance of the proprietary algorithm used by the sensor was also compared to the trained algorithm performance. The trained decision tree was able to correctly classify 95% of events for internal cross-validation and 88% of events for the external validation set. Comparatively, the proprietary algorithm was only able to correctly classify 61% of the events. In general, the trained algorithm was better able to predict when a signal was good or bad compared to the proprietary algorithm. The present study shows it is possible to train a decision tree algorithm using environmental sensor data collected in the field.

Download Full-text

Comparison of Surface Characteristics of the Antarctic Ice Sheet with Satellite Observations (Abstract)

Annals of Glaciology ◽

10.1017/s0260305500000975 ◽

1987 ◽

Vol 9 ◽

pp. 253

Author(s):

N. Young ◽

I. Goodwin

Keyword(s):

Surface Topography ◽

Accumulation Rate ◽

Ice Sheet ◽

Ground Truth ◽

Surface Characteristics ◽

Small Scale ◽

Landsat Images ◽

Ground Truth Data ◽

Additional Information ◽

Wilkes Land

Ground surveys of the ice sheet in Wilkes Land, Antarctica, have been made on oversnow traverses operating out of Casey. Data collected include surface elevation, accumulation rate, snow temperature, and physical characteristics of the snow cover. By the nature of the surveys, the data are mostly restricted to line profiles. In some regions, aerial surveys of surface topography have been made over a grid network. Satellite imagery and remote sensing are two means of extrapolating the results from measurements along lines to an areal presentation. They are also the only source of data over large areas of the continent. Landsat images in the visible and near infra-red wavelengths clearly depict many of the large- and small scale features of the surface. The intensity of the reflected radiation varies with the aspect and magnitude of the surface slope to reveal the surface topography. The multi-channel nature of the Landsat data is exploited to distinguish between different surface types through their different spectral signatures, e.g. bare ice, glaze, snow, etc. Additional information on surface type can be gained at a coarser scale from other satellite-borne sensors such as ESMR, SMMR, etc. Textural enhancement of the Landsat images reveals the surface micro-relief. Features in the enhanced images are compared to ground-truth data from the traverse surveys to produce a classification of surface types across the images and to determine the magnitude of the surface topography and micro-relief observed. The images can then be used to monitor changes over time.

Download Full-text

EXPLORING MACHINE LEARNING CLASSIFICATION ALGORITHMS FOR CROP CLASSIFICATION USING SENTINEL 2 DATA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-3-w6-573-2019 ◽

2019 ◽

Vol XLII-3/W6 ◽

pp. 573-578 ◽

Cited By ~ 3

Author(s):

◽

S. S. Ray

Keyword(s):

Machine Learning ◽

Satellite Data ◽

Classification Accuracy ◽

Ground Truth ◽

Kappa Coefficient ◽

Ground Truth Data ◽

Classification Techniques ◽

Machine Learning Classification ◽

Crop Classification ◽

Sentinel 2

Abstract. Crop Classification and recognition is a very important application of Remote Sensing. In the last few years, Machine learning classification techniques have been emerging for crop classification. Google Earth Engine (GEE) is a platform to explore the multiple satellite data with different advanced classification techniques without even downloading the satellite data. The main objective of this study is to explore the ability of different machine learning classification techniques like, Random Forest (RF), Classification And Regression Trees (CART) and Support Vector Machine (SVM) for crop classification. High Resolution optical data, Sentinel-2, MSI (10&thinsp;m) was used for crop classification in the Indian Agricultural Research Institute (IARI) farm for the Rabi season 2016 for major crops. Around 100 crop fields (~400 Hectare) in IARI were analysed. Smart phone-based ground truth data were collected. The best cloud free image of Sentinel 2 MSI data (5 Feb 2016) was used for classification using automatic filtering by percentage cloud cover property using the GEE. Polygons as feature space was used as training data sets based on the ground truth data for crop classification using machine learning techniques. Post classification, accuracy assessment analysis was done through the generation of the confusion matrix (producer and user accuracy), kappa coefficient and F value. In this study it was found that using GEE through cloud platform, satellite data accessing, filtering and pre-processing of satellite data could be done very efficiently. In terms of overall classification accuracy and kappa coefficient, Random Forest (93.3%, 0.9178) and CART (73.4%, 0.6755) classifiers performed better than SVM (74.3%, 0.6867) classifier. For validation, Field Operation Service Unit (FOSU) division of IARI, data was used and encouraging results were obtained.

Download Full-text

Integrating hierarchical statistical models and machine-learning algorithms for ground-truthing drone images of the vegetation: taxonomy, abundance and population ecological models

10.1101/491381 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

AbstractIn order to fit population ecological models, e.g. plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

Quantitative mapping and predictive modelling of Mn-nodules' distribution from hydroacoustic and optical AUV data linked by Random Forests machine learning

10.5194/bg-2018-353 ◽

2018 ◽

Author(s):

Iason-Zois Gazis ◽

Timm Schoening ◽

Evangelos Alevizos ◽

Jens Greinert

Keyword(s):

Machine Learning ◽

Random Forests ◽

Autonomous Underwater Vehicle ◽

Predictive Modelling ◽

Quantitative Information ◽

Training Data ◽

Optical Data ◽

Small Scale ◽

Ground Truth Data ◽

Median Size

Abstract. In this study, high-resolution bathymetric multibeam and optical image data, both obtained within the Belgian manganese (Mn) nodule mining license area by the autonomous underwater vehicle (AUV) Abyss, were combined in order to create a predictive Random Forests (RF) machine learning model. AUV bathymetry reveals small-scale terrain variations, allowing slope estimations and calculation of bathymetric derivatives such as slope, curvature, and ruggedness. Optical AUV imagery provides quantitative information regarding the distribution (number and median size) of Mn-nodules. Within the area considered in this study, Mn-nodules show a heterogeneous and spatially clustered pattern and their number per square meter is negatively correlated with their median size. A prediction of the number of Mn-nodules was achieved by combining information derived from the acoustic and optical data using a RF model. This model was tuned by examining the influence of the training set size, the number of growing trees (ntree) and the number of predictor variables to be randomly selected at each RF node (mtry) on the RF prediction accuracy. The use of larger training data sets with higher ntree and mtry values increases the accuracy. To estimate the Mn-nodule abundance, these predictions were linked to ground truth data acquired by box coring. Linking optical and hydro-acoustic data revealed a non-linear relationship between the Mn-nodule distribution and topographic characteristics. This highlights the importance of a detailed terrain reconstruction for a predictive modelling of Mn-nodule abundance. In addition, this study underlines the necessity of a sufficient spatial distribution of the optical data to provide reliable modelling input for the RF.

Download Full-text

Automated quality control methods for sensor data: a novel observatory approach

Biogeosciences Discussions ◽

10.5194/bgd-9-18175-2012 ◽

2012 ◽

Vol 9 (12) ◽

pp. 18175-18210

Author(s):

J. R. Taylor ◽

H. L. Loescher

Keyword(s):

Quality Control ◽

Data Quality ◽

Control Measures ◽

Sensor Data ◽

Data Quality Control ◽

Test Parameter ◽

Novel Approach ◽

Level Data ◽

Data Driven Approach ◽

Automated Quality Control

Abstract. National and international networks and observatories of terrestrial-based sensors are emerging rapidly. As such, there is demand for a standardized approach to data quality control, as well as interoperability of data among sensor networks. The National Ecological Observatory Network (NEON) has begun constructing their first terrestrial observing sites with 60 locations expected to be distributed across the US by 2017. This will result in over 14 000 automated sensors recording more than > 100 Tb of data per year. These data are then used to create other datasets and subsequent "higher-level" data products. In anticipation of this challenge, an overall data quality assurance plan has been developed and the first suite of data quality control measures defined. This data-driven approach focuses on automated methods for defining a suite of plausibility test parameter thresholds. Specifically, these plausibility tests scrutinize data range, persistence, and stochasticity on each measurement type by employing a suite of binary checks. The statistical basis for each of these tests is developed and the methods for calculating test parameter thresholds are explored here. While these tests have been used elsewhere, we apply them in a novel approach by calculating their relevant test parameter thresholds. Finally, implementing automated quality control is demonstrated with preliminary data from a NEON prototype site.

Download Full-text