Using Machine Learning for Dependable Outlier Detection in Environmental Monitoring Systems

2021 ◽  
Vol 5 (3) ◽  
pp. 1-30
Author(s):  
Gonçalo Jesus ◽  
António Casimiro ◽  
Anabela Oliveira

Sensor platforms used in environmental monitoring applications are often subject to harsh environmental conditions while monitoring complex phenomena. Therefore, designing dependable monitoring systems is challenging given the external disturbances affecting sensor measurements. Even the apparently simple task of outlier detection in sensor data becomes a hard problem, amplified by the difficulty in distinguishing true data errors due to sensor faults from deviations due to natural phenomenon, which look like data errors. Existing solutions for runtime outlier detection typically assume that the physical processes can be accurately modeled, or that outliers consist in large deviations that are easily detected and filtered by appropriate thresholds. Other solutions assume that it is possible to deploy multiple sensors providing redundant data to support voting-based techniques. In this article, we propose a new methodology for dependable runtime detection of outliers in environmental monitoring systems, aiming to increase data quality by treating them. We propose the use of machine learning techniques to model each sensor behavior, exploiting the existence of correlated data provided by other related sensors. Using these models, along with knowledge of processed past measurements, it is possible to obtain accurate estimations of the observed environment parameters and build failure detectors that use these estimations. When a failure is detected, these estimations also allow one to correct the erroneous measurements and hence improve the overall data quality. Our methodology not only allows one to distinguish truly abnormal measurements from deviations due to complex natural phenomena, but also allows the quantification of each measurement quality, which is relevant from a dependability perspective. We apply the methodology to real datasets from a complex aquatic monitoring system, measuring temperature and salinity parameters, through which we illustrate the process for building the machine learning prediction models using a technique based on Artificial Neural Networks, denoted ANNODE ( ANN Outlier Detection ). From this application, we also observe the effectiveness of our ANNODE approach for accurate outlier detection in harsh environments. Then we validate these positive results by comparing ANNODE with state-of-the-art solutions for outlier detection. The results show that ANNODE improves existing solutions regarding accuracy of outlier detection.

Author(s):  
Claudia C. Gutiérrez Rodríguez ◽  
Sylvie Servigne

With an increasingly technological improvement, sensors infrastructure actually supports many current and promising environmental applications. Environmental Monitoring Systems built on such sensors removes geographical, temporal and other restraints while increasing both the coverage and the quality of real world understanding. However, a main issue for such applications is the uncertainty of data coming from sensors, which may impact experts’ decisions. In this paper, the authors address this problem with an approach dedicated to provide environmental monitoring applications and users with data quality information.


Sensors ◽  
2017 ◽  
Vol 17 (10) ◽  
pp. 2329 ◽  
Author(s):  
Robert Vasta ◽  
Ian Crandell ◽  
Anthony Millican ◽  
Leanna House ◽  
Eric Smith

Author(s):  
Negin Yousefpour ◽  
Steve Downie ◽  
Steve Walker ◽  
Nathan Perkins ◽  
Hristo Dikanski

Bridge scour is a challenge throughout the U.S.A. and other countries. Despite the scale of the issue, there is still a substantial lack of robust methods for scour prediction to support reliable, risk-based management and decision making. Throughout the past decade, the use of real-time scour monitoring systems has gained increasing interest among state departments of transportation across the U.S.A. This paper introduces three distinct methodologies for scour prediction using advanced artificial intelligence (AI)/machine learning (ML) techniques based on real-time scour monitoring data. Scour monitoring data included the riverbed and river stage elevation time series at bridge piers gathered from various sources. Deep learning algorithms showed promising in prediction of bed elevation and water level variations as early as a week in advance. Ensemble neural networks proved successful in the predicting the maximum upcoming scour depth, using the observed sensor data at the onset of a scour episode, and based on bridge pier, flow and riverbed characteristics. In addition, two of the common empirical scour models were calibrated based on the observed sensor data using the Bayesian inference method, showing significant improvement in prediction accuracy. Overall, this paper introduces a novel approach for scour risk management by integrating emerging AI/ML algorithms with real-time monitoring systems for early scour forecast.


Author(s):  
Manmohan Singh Yadav ◽  
Shish Ahamad

<p>Environmental disasters like flooding, earthquake etc. causes catastrophic effects all over the world. WSN based techniques have become popular in susceptibility modelling of such disaster due to their greater strength and efficiency in the prediction of such threats. This paper demonstrates the machine learning-based approach to predict outlier in sensor data with bagging, boosting, random subspace, SVM and KNN based frameworks for outlier prediction using a WSN data. First of all database is pre processed with 14 sensor motes with presence of outlier due to intrusion. Subsequently segmented database is created from sensor pairs. Finally, the data entropy is calculated and used as a feature to determine the presence of outlier used different approach. Results show that the KNN model has the highest prediction capability for outlier assessment.</p>


2019 ◽  
Author(s):  
Tomer Sagi ◽  
Nitzan Shmueli ◽  
Bruce Friedman ◽  
Ruth Bergman

BACKGROUND Public Electronic Medical Records (EMR) datasets are a goldmine for vendors and researchers seeking to develop analytics designed to assist caregivers in monitoring, diagnosis, and treatment of patients. Both complex machine-learning-based tools, which require copious amounts of data to train, and a simple trend graph presented in a patient-centered dashboard, are sensitive to noise. OBJECTIVE We aim to systematically explore data errors in MIMIC-III as a representative of secondary use datasets and the impact of these errors on downstream analytics. METHODS We discuss the unique challenge of accounting for the specific patient's medical condition and personal characteristics such as age, weight, gender, and others, in identifying data errors when only a few measurements of each patient are available. To do so, we examine the prevalence and manifestations of errors in one of the most popular public medical research databases - MIMIC-III. We then evaluate how these errors impact visual analytics, score-based sepsis analytics SOFA and qSOFA, and a machine-learning-based sepsis predictor. RESULTS We find a variety of error patterns in MIMIC-III and highlight effective methods to find them. All analytics are found to be sensitive to sporadic error. Visual analytics are severely impacted, limiting their usefulness in the presence of error. qSOFA and SOFA suffer a score change of +1 (of 3) and +2.3-4 (of 15). The sepsis predictor suffers from a 0.01-0.3 score change compared to a median score of 0.08. CONCLUSIONS The use of statistical methods to detect data errors is limited to high-throughput scenarios and large data aggregations. There is a dearth of medical guidelines and error-detection practices to support rule-based systems, required to keep analytics safe and trustworthy in low-volume scenarios. Analytics developers should test their software’s sensitivity to error on public datasets. The medical informatics community should improve support for medical data-quality endeavors by creating guidelines for plausible values and analytics robustness to error and collecting real-world dirty datasets which contain errors as they appear in normal EMR use.


Author(s):  
Nripesh Trivedi

In this paper, characteristics of data obtained from the sensors (used in OpenSense project) are identified in order to build a data-oriented approach. This approach consists of application of Class Outliers: Distance Based (CODB) and Hoeffding tree algorithms. Subsequently, machine learning models were built to detect outliers in a sensor data stream. The approach presented in this paper may be used for developing methodologies for data-oriented outlier detection


2021 ◽  
Author(s):  
Otoniel José Campos Escobar ◽  
Peter Baumann

&lt;p&gt;Multi-dimensional arrays (also known as raster data, gridded data, or datacubes) are key, if not essential, in many science and engineering domains. In the case of Earth sciences, a significant amount of the data that is produced falls into the category of array data. That being said, the amount of data that is produced daily from this field is huge. This makes it hard for researchers to analyze and retrieve any valuable insight from it. 1-D sensor data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z subsurface voxel data, 4-D x/y/z/t atmospheric and ocean data often produce dozens of Terabytes of data every day, and the rate is only expected to increase in the future. In response, Array Databases systems were specifically designed and constructed to provide modeling, storage, and processing support for multi-dimensional arrays. They offer a declarative query language for flexible data retrieval and some, e.g., rasdaman, provide federation processing and standard-based query capabilities compliant with OGC standards&amp;#160;such as WCS, WCPS, and WMS. However, despite these advances, the gap between efficient information retrieval and the actual application of this data remains very broad, especially in the domain of artificial intelligence AI and machine learning ML.&lt;/p&gt;&lt;p&gt;In this contribution, we present the state-of-art in performing ML through Array Databases. First, a motivating example is introduced from the Deep Rain Project which aims at enhancing rainfall prediction accuracy in mountainous areas by implementing ML code on top of an Array Database. Deep Rain also explores novel methods for training prediction models by implementing server-side ML processing inside the database. A brief introduction of the Array Database rasdaman that is used in this project is also provided featuring its standard-based query capabilities and scalable federation processing features that are required for rainfall data processing. Next, the workflow approach for ML and Array Databases that is employed in the Deep Rain project is described in detail listing the benefits of using an Array Database with declarative query language capabilities in the machine learning pipeline. A concrete use case will be used to illustrate step by step how these tools integrate. Next, an alternative approach will be presented where ML is done inside the Array Database using user-defined functions UDFs. Finally,&amp;#160; a detailed comparison between the UDF and workflow approach is presented explaining their challenges and benefits.&lt;/p&gt;


Sign in / Sign up

Export Citation Format

Share Document