sampbias, a method for quantifying geographic sampling biases in species distribution data

AbstractGeo-referenced species occurrences from public databases have become essential to biodiversity research and conservation. However, geographical biases are widely recognized as a factor limiting the usefulness of such data for understanding species diversity and distribution. In particular, differences in sampling intensity across a landscape due to differences in human accessibility are ubiquitous but may differ in strength among taxonomic groups and datasets. Although several factors have been described to influence human access (such as presence of roads, rivers, airports and cities), quantifying their specific and combined effects on recorded occurrence data remains challenging. Here we present sampbias, an algorithm and software for quantifying the effect of accessibility biases in species occurrence datasets. Sampbias uses a Bayesian approach to estimate how sampling rates vary as a function of proximity to one or multiple bias factors. The results are comparable among bias factors and datasets. We demonstrate the use of sampbias on a dataset of mammal occurrences from the island of Borneo, showing a high biasing effect of cities and a moderate effect of roads and airports. Sampbias is implemented as a well-documented, open-access and user-friendly R package that we hope will become a standard tool for anyone working with species occurrences in ecology, evolution, conservation and related fields.

Download Full-text

speciesgeocodeR: An R package for linking species occurrences, user-defined regions and phylogenetic trees for biogeography, ecology and evolution

10.1101/032755 ◽

2015 ◽

Cited By ~ 6

Author(s):

Alexander Zizka ◽

Alexandre Antonelli

Keyword(s):

Data Quality ◽

Phylogenetic Trees ◽

Large Scale ◽

Data Cleaning ◽

R Package ◽

Species Occurrence ◽

Occurrence Data ◽

User Friendly ◽

Species Occurrences

1. Large-scale species occurrence data from geo-referenced observations and collected specimens are crucial for analyses in ecology, evolution and biogeography. Despite the rapidly growing availability of such data, their use in evolutionary analyses is often hampered by tedious manual classification of point occurrences into operational areas, leading to a lack of reproducibility and concerns regarding data quality. 2. Here we present speciesgeocodeR, a user-friendly R-package for data cleaning, data exploration and data visualization of species point occurrences using discrete operational areas, and linking them to analyses invoking phylogenetic trees. 3. The three core functions of the package are 1) automated and reproducible data cleaning, 2) rapid and reproducible classification of point occurrences into discrete operational areas in an adequate format for subsequent biogeographic analyses, and 3) a comprehensive summary and visualization of species distributions to explore large datasets and ensure data quality. In addition, speciesgeocodeR facilitates the access and analysis of publicly available species occurrence data, widely used operational areas and elevation ranges. Other functionalities include the implementation of minimum occurrence thresholds and the visualization of coexistence patterns and range sizes. SpeciesgeocodeR accompanies a richly illustrated and easy-to-follow tutorial and help functions.

Download Full-text

occAssess: An R package for assessing potential biases in species occurrence data

10.1101/2021.04.19.440441 ◽

2021 ◽

Author(s):

Robin James Boyd ◽

Gary Powney ◽

Claire Carvell ◽

Oliver Pescott

Keyword(s):

Target Population ◽

R Package ◽

Heterogeneous Databases ◽

Ecological Research ◽

Species Occurrence ◽

Worked Example ◽

Occurrence Data ◽

Data Coverage ◽

Discrete Functions ◽

Occurrence Records

Species occurrence records from a variety of sources are increasingly aggregated into heterogeneous databases and made available to ecologists for immediate analytical use. However, these data are typically biased, i.e. they are not a representative sample of the target population of interest, meaning that the information they provide may not be an accurate reflection of reality. It is therefore crucial that species occurrence data are properly scrutinised before they are used for research. In this article, we introduce occAssess, an R package that enables quick and easy screening of species occurrence data for potential biases. The package contains a number of discrete functions, each of which returns a measure of the potential for bias in one or more of the taxonomic, temporal, spatial and environmental dimensions. The outputs are provided visually (as ggplot2 objects) and do not include a formal recommendation as to whether data are of sufficient quality for any given inferential use. Instead, they should be used as ancillary information and viewed in the context of the question that is being asked, and the methods that are being used to answer it. We demonstrate the utility of occAssess by applying it to data on two key pollinator taxa in South America: leaf-nosed bats (Phyllostomidae) and hoverflies (Syrphidae). In this worked example, we briefly assess the degree to which various aspect of data coverage appear to have changed over time. We then discuss additional ways in which the package could be used, highlight its limitations, and point to where it could be improved in the future. Going forward, we hope that occAssess will help to improve the quality, and transparency, of assessments of species occurrence data as a necessary first step where they are being used for ecological research at large scales.

Download Full-text

Standardized NEON organismal data for biodiversity research

10.32942/osf.io/8kun3 ◽

2021 ◽

Author(s):

Daijiang Li ◽

Sydne Record ◽

Eric Sokol ◽

Matthew E. Bitters ◽

Melissa Y. Chen ◽

...

Keyword(s):

Spatial Scales ◽

R Package ◽

The United States ◽

Ecological Community ◽

Biodiversity Research ◽

Long Time ◽

Community Data ◽

Patterns And Processes ◽

Taxonomic Groups ◽

Data Design

Understanding patterns and drivers of species distributions and abundances, and thus biodiversity, is a core goal of ecology. Despite advances in recent decades, research into these patterns and processes is currently limited by a lack of standardized, high-quality, empirical data that spans large spatial scales and long time periods. The National Ecological Observatory Network (NEON) fills this gap by providing freely available observational data that are: generated during robust and consistent organismal sampling of several sentinel taxonomic groups within 81 sites distributed across the United States; and will be collected for at least 30 years. The breadth and scope of these data provides a unique resource for advancing biodiversity research. To maximize the potential of this opportunity, however, it is critical that NEON data be maximally accessible and easily integrated into investigators’ workflows and analyses. To facilitate its use for biodiversity research and synthesis, we created a workflow to process and format NEON organismal data into the ecocomDP (ecological community data design pattern) format, and available through the `ecocomDP` R package; we then provided the standardized data as an R data package (`neonDivData`). We briefly summarize sampling designs and data wrangling decisions for the major taxonomic groups included in this effort. Our workflows are open-source so the biodiversity community may: add additional taxonomic groups; modify the workflow to produce datasets appropriate for their own analytical needs; and regularly update the data packages as more observations become available. Finally, we provide two simple examples of how the standardized data may be used for biodiversity research. By providing a standardized data package, we hope to enhance the utility of NEON organismal data in advancing biodiversity research.

Download Full-text

Research applications of primary biodiversity databases in the digital age

10.1101/605071 ◽

2019 ◽

Author(s):

Joan E. Ball-Damerow ◽

Laura Brenskelle ◽

Narayani Barve ◽

Pamela S. Soltis ◽

Petra Sierwald ◽

...

Keyword(s):

Data Quality ◽

Habitat Degradation ◽

Data Types ◽

Species Occurrence ◽

Online Systems ◽

Occurrence Data ◽

Data Compilation ◽

Taxonomic Groups ◽

Extinction Events ◽

Biodiversity Databases

ABSTRACTWe are in the midst of unprecedented change—climate shifts and sustained, widespread habitat degradation have led to dramatic declines in biodiversity rivaling historical extinction events. At the same time, new approaches to publishing and integrating previously disconnected data resources promise to help provide the evidence needed for more efficient and effective conservation and management. Stakeholders have invested considerable resources to contribute to online databases of species occurrences and genetic barcodes. However, estimates suggest that only 10% of biocollections are available in digital form. The biocollections community must therefore continue to promote digitization efforts, which in part requires demonstrating compelling applications of the data. Our overarching goal is therefore to determine trends in use of mobilized species occurrence data since 2010, as online systems have grown and now provide over one billion records. To do this, we characterized 501 papers that use openly accessible biodiversity databases. Our standardized tagging protocol was based on key topics of interest, including: database(s) used, taxa addressed, general uses of data, other data types linked to species occurrence data, and data quality issues addressed. We found that the most common uses of online biodiversity databases have been to estimate species distribution and richness, to outline data compilation and publication, and to assist in developing species checklists or describing new species. Only 69% of papers in our dataset addressed one or more aspects of data quality, which is low considering common errors and biases known to exist in opportunistic datasets. Globally, we find that biodiversity databases are still in the initial stages of data compilation. Novel and integrative applications are restricted to certain taxonomic groups and regions with higher numbers of quality records. Continued data digitization, publication, enhancement, and quality control efforts are necessary to make biodiversity science more efficient and relevant in our fast-changing world.

Download Full-text

occAssess: An R package for assessing potential biases in species occurrence data

Ecology and Evolution ◽

10.1002/ece3.8299 ◽

2021 ◽

Author(s):

Robin J. Boyd ◽

Gary D. Powney ◽

Claire Carvell ◽

Oliver L. Pescott

Keyword(s):

R Package ◽

Species Occurrence ◽

Occurrence Data

Download Full-text

Geographical and temporal distribution of hawkmoth (Lepidoptera: Sphingidae) species in Africa

Biodiversity Data Journal ◽

10.3897/bdj.9.e70912 ◽

2021 ◽

Vol 9 ◽

Author(s):

Esther Kioko ◽

Alex Musyoki ◽

Augustine Luanga ◽

Mwinzi Kioko ◽

Esther Mwangi ◽

...

Keyword(s):

Species Diversity ◽

Central Africa ◽

Temporal Distribution ◽

Conservation Strategies ◽

Distribution Data ◽

Species Occurrence ◽

Continuous Growth ◽

Occurrence Data ◽

National Museums

Hawkmoths consist of species where most adults are nocturnal, but there are some day-flying genera. Hawkmoth species have a wide variety of life-history traits, comprising species with adults (mostly nectarivorous though with some exceptions, honey-feeding), but there are also species that do not feed at all. The nectarivorous species are an important component of tropical ecosystems, with significant roles as major pollinators of both crops and wild flora with the pollination done by the adult stage. Pollinators are in decline world-wide and there is need for baseline data to provide information about their conservation strategies. Species occurrence data from Museum collections have been shown to be of great value as a tool for prioritising conservation actions in Africa. The National Museums of Kenya (NMK) have a large and active entomology collection that is in continuous growth. The NMK’s collection of hawkmoths had not been digitised prior to 2017. This moth family Sphingidae includes about 1,602 species and 205 genera worldwide (Kitching et al. 2018) with the majority of these species occurring in Africa. These moth species can also be used as indicators in biodiversity assessments as they can be easily sampled and identified. However, hawkmoths have rarely been surveyed over the long term for this purpose. Long-term datasets are of unquestionable significance for understanding and monitoring temporal changes in biodiversity. These hawkmoth data have addressed one of the most significant challenges to insect conservation, the lack of baseline information concerning species diversity and distribution and have provided key historic hawkmoth species diversity and distribution data that can be used to monitor their populations in the face of climate change and other environmental degradation issues that are facing the world today. The publication of the hawkmoth species occurrence data records in GBIF has enhanced data visibility to a wider audience promoting availability for use. The hawkmoth (Lepidoptera: Sphingidae) collection at the National Museums of Kenya was digitised from 2017 – 2020 and this paper presents details of species occurrence records as in the insect collection at the NMK, Nairobi, Kenya. The collection holds 5,095 voucher specimens consisting of 88 genera and 208 species. The collection covers the period between 1904 and 2020. The geographical distribution of the hawkmoths housed at the NMK covers East Africa at 81.41%, West Africa at 7.20%, Southern Africa at 6.89%, Central Africa at 4.02% and North Africa at 0.2%.

Download Full-text

Species occurrence data from the Range-Wide Bull Trout eDNA Project

Forest Service Research Data Archive ◽

10.2737/rds-2017-0038 ◽

2017 ◽

Cited By ~ 1

Author(s):

Michael K. Young ◽

Daniel J. Isaak ◽

Kevin S. McKelvey ◽

Michael K. Schwartz ◽

Kellie J. Carim ◽

...

Keyword(s):

Bull Trout ◽

Species Occurrence ◽

Occurrence Data

Download Full-text

DIscBIO: A User-Friendly Pipeline for Biomarker Discovery in Single-Cell Transcriptomics

International Journal of Molecular Sciences ◽

10.3390/ijms22031399 ◽

2021 ◽

Vol 22 (3) ◽

pp. 1399

Author(s):

Salim Ghannoum ◽

Waldir Leoncio Netto ◽

Damiano Fantini ◽

Benjamin Ragan-Kelley ◽

Amirabbas Parizadeh ◽

...

Keyword(s):

Single Cell ◽

Biomarker Discovery ◽

Enrichment Analysis ◽

Myxoid Liposarcoma ◽

R Package ◽

Differential Analysis ◽

A Cell ◽

Reproducible Analysis ◽

Transcriptomic Level ◽

User Friendly

The growing attention toward the benefits of single-cell RNA sequencing (scRNA-seq) is leading to a myriad of computational packages for the analysis of different aspects of scRNA-seq data. For researchers without advanced programing skills, it is very challenging to combine several packages in order to perform the desired analysis in a simple and reproducible way. Here we present DIscBIO, an open-source, multi-algorithmic pipeline for easy, efficient and reproducible analysis of cellular sub-populations at the transcriptomic level. The pipeline integrates multiple scRNA-seq packages and allows biomarker discovery with decision trees and gene enrichment analysis in a network context using single-cell sequencing read counts through clustering and differential analysis. DIscBIO is freely available as an R package. It can be run either in command-line mode or through a user-friendly computational pipeline using Jupyter notebooks. We showcase all pipeline features using two scRNA-seq datasets. The first dataset consists of circulating tumor cells from patients with breast cancer. The second one is a cell cycle regulation dataset in myxoid liposarcoma. All analyses are available as notebooks that integrate in a sequential narrative R code with explanatory text and output data and images. R users can use the notebooks to understand the different steps of the pipeline and will guide them to explore their scRNA-seq data. We also provide a cloud version using Binder that allows the execution of the pipeline without the need of downloading R, Jupyter or any of the packages used by the pipeline. The cloud version can serve as a tutorial for training purposes, especially for those that are not R users or have limited programing skills. However, in order to do meaningful scRNA-seq analyses, all users will need to understand the implemented methods and their possible options and limitations.

Download Full-text

BREC: an R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

BMC Bioinformatics ◽

10.1186/s12859-021-04233-1 ◽

2021 ◽

Vol 22 (S6) ◽

Author(s):

Yasmine Mansour ◽

Annie Chateau ◽

Anna-Sophie Fiston-Lavier

Keyword(s):

Data Quality ◽

Data Science ◽

Fruit Fly ◽

R Package ◽

Model Organisms ◽

Data Quality Control ◽

Recombination Rates ◽

Functional Dynamics ◽

Shiny App ◽

User Friendly

Abstract Background Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository https://github.com/GenomeStructureOrganization.

Download Full-text

Application Natura 2000 Data For The Invasive Plants Spread Prediction

Scientia Agriculturae Bohemica ◽

10.1515/sab-2015-0031 ◽

2015 ◽

Vol 46 (4) ◽

pp. 159-166 ◽

Cited By ~ 1

Author(s):

J. Pěknicová ◽

D. Petrus ◽

K. Berchová-Bímová

Keyword(s):

Environmental Factors ◽

Invasive Plants ◽

Species Distribution ◽

Natura 2000 ◽

Distribution Model ◽

Distribution Data ◽

Habitat Types ◽

Distribution Models ◽

Occurrence Data ◽

Heracleum Mantegazzianum

AbstractThe distribution of invasive plants depends on several environmental factors, e.g. on the distance from the vector of spreading, invaded community composition, land-use, etc. The species distribution models, a research tool for invasive plants spread prediction, involve the combination of environmental factors, occurrence data, and statistical approach. For the construction of the presented distribution model, the occurrence data on invasive plants (Solidagosp.,Fallopiasp.,Robinia pseudoaccacia,andHeracleum mantegazzianum) and Natura 2000 habitat types from the Protected Landscape Area Kokořínsko have been intersected in ArcGIS and statistically analyzed. The data analysis was focused on (1) verification of the accuracy of the Natura 2000 habitat map layer, and the accordance with the habitats occupied by invasive species and (2) identification of a suitable scale of intersection between the habitat and species distribution. Data suitability was evaluated for the construction of the model on local scale. Based on the data, the invaded habitat types were described and the optimal scale grid was evaluated. The results show the suitability of Natura 2000 habitat types for modelling, however more input data (e.g. on soil types, elevation) are needed.

Download Full-text