Robust Multivariate Outlier Detection Methods for Environmental Data

Geographic outliers at GBIF (Global Biodiversity Information Facility) are a known problem. Outliers can be errors, coordinates with high uncertainty, or simply occurrences from an undersampled region. Often in data cleaning pipelines, outliers are removed (even if they are legitimate points) because the researcher does not have time to verify each record one-by-one. Outlier points are usually occurrences that need attention. Currently, there is no outlier detection implemented at GBIF and it is up to the user to flag outliers themselves. DBSCAN (a density-based algorithm for discovering clusters in large spatial databases with noise) is a simple and popular clustering algorithm. It uses two parameters, (1) distance and (2) a minimum number of points per cluster, to decide if something is an outlier. Since occurrence data can be very patchy, non-clustering distance-based methods will fail often Fig. 1. DBSCAN does not need to know the expected number of clusters in advance. DBSCAN does well using only distance and does not require some additional environmental variables like Bioclim. Advanatages of DBSCAN : Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Drawbacks : Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Outlier detection and error detection are different. If your goal is to produce a system with no false positives, it will fail. While more complex environmentally-informed outlier detection methods (like reverse jackknifing (Chapman 2005)) might perform better for certain examples or even in genreal, DBSCAN performs adequately on almost everything despite being very simple. Currently I am using DBSCAN to find errors and assess dataset quality. It is a Spark job written in Scala (github). It does not run on species with lots of (>30K) unique latitude-longitude points, since the current implementation relies on an in-memory distance matrix. However, around 99% of species (plants, animals, fungi) on GBIF have fewer than >30K unique lat-long points (2,283 species keys / 222,993 species keys). There are other implementations ( example) that might scale to many more points. There are no immediate plans to include DBSCAN outliers as a data quality flag on GBIF, but it could be done somewhat easily, since this type of method does not rely on any external environmental data sources and already runs on the GBIF cluster.

Download Full-text

A Comparison of Multivariate Outlier Detection Methods For Finding Hyperspectral Anomalies

Military Operations Research ◽

10.5711/morj.13.4.19 ◽

2008 ◽

Vol 13 (4) ◽

pp. 19-43 ◽

Cited By ~ 17

Author(s):

Timothy E. Smetek ◽

Kenneth W. Bauer

Keyword(s):

Outlier Detection ◽

Detection Methods ◽

Multivariate Outlier Detection

Download Full-text

Generalised linear model-based algorithm for detection of outliers in environmental data and comparison with semi-parametric outlier detection methods

Atmospheric Pollution Research ◽

10.1016/j.apr.2019.01.010 ◽

2019 ◽

Vol 10 (4) ◽

pp. 1015-1023 ◽

Cited By ~ 1

Author(s):

Martina Čampulová ◽

Jaroslav Michálek ◽

Jiří Moučka

Keyword(s):

Linear Model ◽

Outlier Detection ◽

Environmental Data ◽

Detection Methods ◽

Model Based ◽

Detection Of Outliers

Download Full-text

Comparative study on multivariate outlier detection methods in sesame (Sesamum indicum L.)

Electronic Journal of Plant Breeding ◽

10.5958/0975-928x.2019.00108.x ◽

2019 ◽

Vol 10 (2) ◽

pp. 809

Author(s):

K. Muthu Prabakaran ◽

P. G. Saravanan ◽

S. Manonmani ◽

V. Anandhi

Keyword(s):

Comparative Study ◽

Outlier Detection ◽

Sesamum Indicum ◽

Detection Methods ◽

Multivariate Outlier Detection

Download Full-text

A Comparison of Multivariate Outlier Detection Methods for Clinical Laboratory Safety Data

Journal of the Royal Statistical Society Series D (The Statistician) ◽

10.1111/1467-9884.00279 ◽

2001 ◽

Vol 50 (3) ◽

pp. 295-307 ◽

Cited By ~ 33

Author(s):

Kay I. Penny ◽

Ian T. Jolliffe

Keyword(s):

Outlier Detection ◽

Clinical Laboratory ◽

Safety Data ◽

Detection Methods ◽

Laboratory Safety ◽

Multivariate Outlier Detection

Download Full-text

Identifying mixture copula components using outlier detection methods and goodness-of-fit tests

The Journal of Risk ◽

10.21314/jor.2014.288 ◽

2014 ◽

Vol 16 (4) ◽

pp. 61-101 ◽

Cited By ~ 1

Author(s):

Gregor Weiß

Keyword(s):

Outlier Detection ◽

Goodness Of Fit ◽

Detection Methods ◽

Goodness Of Fit Tests

Download Full-text

Multivariate Outlier Detection in Postprocessing of Multi-temporal PS-InSAR Results using Deep Learning

Procedia Computer Science ◽

10.1016/j.procs.2021.01.326 ◽

2021 ◽

Vol 181 ◽

pp. 1146-1153

Author(s):

Pedro Aguiar ◽

António Cunha ◽

Matus Bakon ◽

Antonio M. Ruiz-Armenteros ◽

Joaquim J. Sousa

Keyword(s):

Deep Learning ◽

Outlier Detection ◽

Multivariate Outlier Detection ◽

Multi Temporal

Download Full-text

Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441453 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-20

Author(s):

Georg Steinbuss ◽

Klemens Böhm

Keyword(s):

Outlier Detection ◽

Synthetic Data ◽

Real Data ◽

Detection Methods ◽

Quality Of Data ◽

Benchmark Data ◽

Core Idea ◽

Generic Process ◽

Unsupervised Outlier Detection

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.

Download Full-text

Multivariate Outlier Detection With High-Breakdown Estimators

Journal of the American Statistical Association ◽

10.1198/jasa.2009.tm09147 ◽

2010 ◽

Vol 105 (489) ◽

pp. 147-156 ◽

Cited By ~ 76

Author(s):

Andrea Cerioli

Keyword(s):

Outlier Detection ◽

Multivariate Outlier Detection

Download Full-text