scholarly journals Outlier Detection at GBIF Using DBSCAN

Author(s):  
John Waller

Geographic outliers at GBIF (Global Biodiversity Information Facility) are a known problem. Outliers can be errors, coordinates with high uncertainty, or simply occurrences from an undersampled region. Often in data cleaning pipelines, outliers are removed (even if they are legitimate points) because the researcher does not have time to verify each record one-by-one. Outlier points are usually occurrences that need attention. Currently, there is no outlier detection implemented at GBIF and it is up to the user to flag outliers themselves. DBSCAN (a density-based algorithm for discovering clusters in large spatial databases with noise) is a simple and popular clustering algorithm. It uses two parameters, (1) distance and (2) a minimum number of points per cluster, to decide if something is an outlier. Since occurrence data can be very patchy, non-clustering distance-based methods will fail often Fig. 1. DBSCAN does not need to know the expected number of clusters in advance. DBSCAN does well using only distance and does not require some additional environmental variables like Bioclim. Advanatages of DBSCAN : Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Drawbacks : Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Outlier detection and error detection are different. If your goal is to produce a system with no false positives, it will fail. While more complex environmentally-informed outlier detection methods (like reverse jackknifing (Chapman 2005)) might perform better for certain examples or even in genreal, DBSCAN performs adequately on almost everything despite being very simple. Currently I am using DBSCAN to find errors and assess dataset quality. It is a Spark job written in Scala (github). It does not run on species with lots of (>30K) unique latitude-longitude points, since the current implementation relies on an in-memory distance matrix. However, around 99% of species (plants, animals, fungi) on GBIF have fewer than >30K unique lat-long points (2,283 species keys / 222,993 species keys). There are other implementations ( example) that might scale to many more points. There are no immediate plans to include DBSCAN outliers as a data quality flag on GBIF, but it could be done somewhat easily, since this type of method does not rely on any external environmental data sources and already runs on the GBIF cluster.

2010 ◽  
Vol 136 (11) ◽  
pp. 1299-1304 ◽  
Author(s):  
Ibrahim Alameddine ◽  
Melissa A. Kenney ◽  
Russell J. Gosnell ◽  
Kenneth H. Reckhow

Epidemiologia ◽  
2021 ◽  
Vol 2 (3) ◽  
pp. 315-324
Author(s):  
Juan M. Banda ◽  
Ramya Tekumalla ◽  
Guanyu Wang ◽  
Jingyuan Yu ◽  
Tuo Liu ◽  
...  

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.


2021 ◽  
Vol 15 (4) ◽  
pp. 1-20
Author(s):  
Georg Steinbuss ◽  
Klemens Böhm

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.


2014 ◽  
Vol 668-669 ◽  
pp. 1374-1377 ◽  
Author(s):  
Wei Jun Wen

ETL refers to the process of data extracting, transformation and loading and is deemed as a critical step in ensuring the quality, data specification and standardization of marine environmental data. Marine data, due to their complication, field diversity and huge volume, still remain decentralized, polyphyletic and isomerous with different semantics and hence far from being able to provide effective data sources for decision making. ETL enables the construction of marine environmental data warehouse in the form of cleaning, transformation, integration, loading and periodic updating of basic marine data warehouse. The paper presents a research on rules for cleaning, transformation and integration of marine data, based on which original ETL system of marine environmental data warehouse is so designed and developed. The system further guarantees data quality and correctness in analysis and decision-making based on marine environmental data in the future.


2011 ◽  
Vol 90 (2-3) ◽  
pp. 73-94 ◽  
Author(s):  
E.P.M. Meijs

AbstractUphill and drainage line environments reveal many hiatuses or discordances, because of truncation by erosion. In downslope position accumulation often prevailed outside the drainage lines and prevented erosion, even during unstable periods. Consequently, downslope sections yield the most detailed environmental data, but often lack contact with uphill series. However, for stratigraphical correlation the contact between downslope and uphill series is essential. In the Veldwezelt loess sequence this contact is intact, which provides additional data on transitional processes. In view of these special palaeoenvironmental conditions, exhibiting a transition between downslope and uphill areas and a south-east trending stream, an extraordinarily detailed Late Saalian, Eemian and Weichselian loess sequence could be reconstructed. The Veldwezelt series furnished important pedological, sedimentological, faunal, tephrochronological and cryogenic data, on the basis of which palaeoenvironmental conclusions could be drawn and six types of pedo-sedimentological cycles distinguished. A stratigraphical overview was obtained by correlating the Veldwezelt section with other west European loess frameworks and tephra sequences; the sedimentary series at Harmignies (Mons Basin, southern Belgium) and the Greenland GRIP ice core.


Elem Sci Anth ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Kai-Lan Chang ◽  
Martin G. Schultz ◽  
Xin Lan ◽  
Audra McClure-Begley ◽  
Irina Petropavlovskikh ◽  
...  

This paper is aimed at atmospheric scientists without formal training in statistical theory. Its goal is to (1) provide a critical review of the rationale for trend analysis of the time series typically encountered in the field of atmospheric chemistry, (2) describe a range of trend-detection methods, and (3) demonstrate effective means of conveying the results to a general audience. Trend detections in atmospheric chemical composition data are often challenged by a variety of sources of uncertainty, which often behave differently to other environmental phenomena such as temperature, precipitation rate, or stream flow, and may require specific methods depending on the science questions to be addressed. Some sources of uncertainty can be explicitly included in the model specification, such as autocorrelation and seasonality, but some inherent uncertainties are difficult to quantify, such as data heterogeneity and measurement uncertainty due to the combined effect of short and long term natural variability, instrumental stability, and aggregation of data from sparse sampling frequency. Failure to account for these uncertainties might result in an inappropriate inference of the trends and their estimation errors. On the other hand, the variation in extreme events might be interesting for different scientific questions, for example, the frequency of extremely high surface ozone events and their relevance to human health. In this study we aim to (1) review trend detection methods for addressing different levels of data complexity in different chemical species, (2) demonstrate that the incorporation of scientifically interpretable covariates can outperform pure numerical curve fitting techniques in terms of uncertainty reduction and improved predictability, (3) illustrate the study of trends based on extreme quantiles that can provide insight beyond standard mean or median based trend estimates, and (4) present an advanced method of quantifying regional trends based on the inter-site correlations of multisite data. All demonstrations are based on time series of observed trace gases relevant to atmospheric chemistry, but the methods can be applied to other environmental data sets.


Sign in / Sign up

Export Citation Format

Share Document