scholarly journals occAssess: An R package for assessing potential biases in species occurrence data

2021 ◽  
Author(s):  
Robin James Boyd ◽  
Gary Powney ◽  
Claire Carvell ◽  
Oliver Pescott

Species occurrence records from a variety of sources are increasingly aggregated into heterogeneous databases and made available to ecologists for immediate analytical use. However, these data are typically biased, i.e. they are not a representative sample of the target population of interest, meaning that the information they provide may not be an accurate reflection of reality. It is therefore crucial that species occurrence data are properly scrutinised before they are used for research. In this article, we introduce occAssess, an R package that enables quick and easy screening of species occurrence data for potential biases. The package contains a number of discrete functions, each of which returns a measure of the potential for bias in one or more of the taxonomic, temporal, spatial and environmental dimensions. The outputs are provided visually (as ggplot2 objects) and do not include a formal recommendation as to whether data are of sufficient quality for any given inferential use. Instead, they should be used as ancillary information and viewed in the context of the question that is being asked, and the methods that are being used to answer it. We demonstrate the utility of occAssess by applying it to data on two key pollinator taxa in South America: leaf-nosed bats (Phyllostomidae) and hoverflies (Syrphidae). In this worked example, we briefly assess the degree to which various aspect of data coverage appear to have changed over time. We then discuss additional ways in which the package could be used, highlight its limitations, and point to where it could be improved in the future. Going forward, we hope that occAssess will help to improve the quality, and transparency, of assessments of species occurrence data as a necessary first step where they are being used for ecological research at large scales.

2015 ◽  
Author(s):  
Alexander Zizka ◽  
Alexandre Antonelli

1. Large-scale species occurrence data from geo-referenced observations and collected specimens are crucial for analyses in ecology, evolution and biogeography. Despite the rapidly growing availability of such data, their use in evolutionary analyses is often hampered by tedious manual classification of point occurrences into operational areas, leading to a lack of reproducibility and concerns regarding data quality. 2. Here we present speciesgeocodeR, a user-friendly R-package for data cleaning, data exploration and data visualization of species point occurrences using discrete operational areas, and linking them to analyses invoking phylogenetic trees. 3. The three core functions of the package are 1) automated and reproducible data cleaning, 2) rapid and reproducible classification of point occurrences into discrete operational areas in an adequate format for subsequent biogeographic analyses, and 3) a comprehensive summary and visualization of species distributions to explore large datasets and ensure data quality. In addition, speciesgeocodeR facilitates the access and analysis of publicly available species occurrence data, widely used operational areas and elevation ranges. Other functionalities include the implementation of minimum occurrence thresholds and the visualization of coexistence patterns and range sizes. SpeciesgeocodeR accompanies a richly illustrated and easy-to-follow tutorial and help functions.


Author(s):  
Alexander Zizka ◽  
Alexandre Antonelli ◽  
Daniele Silvestro

AbstractGeo-referenced species occurrences from public databases have become essential to biodiversity research and conservation. However, geographical biases are widely recognized as a factor limiting the usefulness of such data for understanding species diversity and distribution. In particular, differences in sampling intensity across a landscape due to differences in human accessibility are ubiquitous but may differ in strength among taxonomic groups and datasets. Although several factors have been described to influence human access (such as presence of roads, rivers, airports and cities), quantifying their specific and combined effects on recorded occurrence data remains challenging. Here we present sampbias, an algorithm and software for quantifying the effect of accessibility biases in species occurrence datasets. Sampbias uses a Bayesian approach to estimate how sampling rates vary as a function of proximity to one or multiple bias factors. The results are comparable among bias factors and datasets. We demonstrate the use of sampbias on a dataset of mammal occurrences from the island of Borneo, showing a high biasing effect of cities and a moderate effect of roads and airports. Sampbias is implemented as a well-documented, open-access and user-friendly R package that we hope will become a standard tool for anyone working with species occurrences in ecology, evolution, conservation and related fields.


2020 ◽  
Vol 15 (2) ◽  
pp. 69-80
Author(s):  
Jane Elith ◽  
Catherine Graham ◽  
Roozbeh Valavi ◽  
Meinrad Abegg ◽  
Caroline Bruce ◽  
...  

Species distribution models (SDMs) are widely used to predict and study distributions of species. Many different modeling methods and associated algorithms are used and continue to emerge. It is important to understand how different approaches perform, particularly when applied to species occurrence records that were not gathered in struc­tured surveys (e.g. opportunistic records). This need motivated a large-scale, collaborative effort, published in 2006, that aimed to create objective comparisons of algorithm performance. As a benchmark, and to facilitate future comparisons of approaches, here we publish that dataset: point location records for 226 anonymized species from six regions of the world, with accompanying predictor variables in raster (grid) and point formats. A particularly interesting characteristic of this dataset is that independent presence-absence survey data are available for evaluation alongside the presence-only species occurrence data intended for modeling. The dataset is available on Open Science Framework and as an R package and can be used as a benchmark for modeling approaches and for testing new ways to evaluate the accuracy of SDMs.


2020 ◽  
Author(s):  
Arthur Vinicius Rodrigues ◽  
Gabriel Nakamura ◽  
Leandro Duarte

AbstractThere is a big volume of occurrence records available in biodiversity databases, but researchers should guarantee its quality before use it in scientific studies. A problem that might compromise the quality of occurrence data is species misidentification. We address this issue by presenting naturaList, a R package designed to classify species occurrence data according to identification reliability.naturaList allows to classify species occurrences up to six levels of confidence in species identification, and to filter occurrence data accordingly. The highest level of confidence is assigned to records identified by a specialist, whose name must be provided by the user. The other five levels of confidence are derived from the occurrence data. We demonstrate naturaList functions using occurrences of Alsophila setosa, a tree fern species from Atlantic Forest, as example. We classified and filtered data in grid cells in order to maintain only the highest-level records in each cell. Then we selected only those records classified in the two highest levels of confidence.From 323 occurrences of Alsophila setosa displaying geographic coordinates, 69 (21%) were identified by a specialist. After filtering the highest-level records inside grid cells, 102 records remained. From these grid cell filtered data, 38 occurrences (37%) were classified into the highest confidence level. Three records were removed using an interactive map module, due to falling in sea sites or outside the native range size of the species. Since we selected only records classified in the two highest levels of confidence, the final dataset contained 94 occurrence records.naturaList guarantees the reproducibility of occurrence data processing and cleaning. Macroecologists, biogeographers and taxonomists might benefit from using naturaList package to evaluate the quality of species identification in occurrence data and by identify sites that need evaluation of taxonomic classification of species.


2021 ◽  
Author(s):  
Robin J. Boyd ◽  
Gary D. Powney ◽  
Claire Carvell ◽  
Oliver L. Pescott

Ecography ◽  
2015 ◽  
Vol 38 (5) ◽  
pp. 541-545 ◽  
Author(s):  
Matthew E. Aiello-Lammens ◽  
Robert A. Boria ◽  
Aleksandar Radosavljevic ◽  
Bruno Vilela ◽  
Robert P. Anderson

Author(s):  
Michael K. Young ◽  
Daniel J. Isaak ◽  
Kevin S. McKelvey ◽  
Michael K. Schwartz ◽  
Kellie J. Carim ◽  
...  

2017 ◽  
Vol 28 (1) ◽  
pp. 309-320 ◽  
Author(s):  
Scott Powers ◽  
Valerie McGuire ◽  
Leslie Bernstein ◽  
Alison J Canchola ◽  
Alice S Whittemore

Personal predictive models for disease development play important roles in chronic disease prevention. The performance of these models is evaluated by applying them to the baseline covariates of participants in external cohort studies, with model predictions compared to subjects' subsequent disease incidence. However, the covariate distribution among participants in a validation cohort may differ from that of the population for which the model will be used. Since estimates of predictive model performance depend on the distribution of covariates among the subjects to which it is applied, such differences can cause misleading estimates of model performance in the target population. We propose a method for addressing this problem by weighting the cohort subjects to make their covariate distribution better match that of the target population. Simulations show that the method provides accurate estimates of model performance in the target population, while un-weighted estimates may not. We illustrate the method by applying it to evaluate an ovarian cancer prediction model targeted to US women, using cohort data from participants in the California Teachers Study. The methods can be implemented using open-source code for public use as the R-package RMAP (Risk Model Assessment Package) available at http://stanford.edu/~ggong/rmap/ .


2018 ◽  
Vol 93 ◽  
pp. 333-343 ◽  
Author(s):  
Charlotte L. Outhwaite ◽  
Richard E. Chandler ◽  
Gary D. Powney ◽  
Ben Collen ◽  
Richard D. Gregory ◽  
...  

Ecology ◽  
2003 ◽  
Vol 84 (1) ◽  
pp. 242-251 ◽  
Author(s):  
Raphaël Pélissier ◽  
Pierre Couteron ◽  
Stéphane Dray ◽  
Daniel Sabatier

Sign in / Sign up

Export Citation Format

Share Document