speciesgeocodeR: An R package for linking species occurrences, user-defined regions and phylogenetic trees for biogeography, ecology and evolution

Mapping Intimacies ◽

10.1101/032755 ◽

2015 ◽

Cited By ~ 6

Author(s):

Alexander Zizka ◽

Alexandre Antonelli

Keyword(s):

Data Quality ◽

Phylogenetic Trees ◽

Large Scale ◽

Data Cleaning ◽

R Package ◽

Species Occurrence ◽

Occurrence Data ◽

User Friendly ◽

Species Occurrences

1. Large-scale species occurrence data from geo-referenced observations and collected specimens are crucial for analyses in ecology, evolution and biogeography. Despite the rapidly growing availability of such data, their use in evolutionary analyses is often hampered by tedious manual classification of point occurrences into operational areas, leading to a lack of reproducibility and concerns regarding data quality. 2. Here we present speciesgeocodeR, a user-friendly R-package for data cleaning, data exploration and data visualization of species point occurrences using discrete operational areas, and linking them to analyses invoking phylogenetic trees. 3. The three core functions of the package are 1) automated and reproducible data cleaning, 2) rapid and reproducible classification of point occurrences into discrete operational areas in an adequate format for subsequent biogeographic analyses, and 3) a comprehensive summary and visualization of species distributions to explore large datasets and ensure data quality. In addition, speciesgeocodeR facilitates the access and analysis of publicly available species occurrence data, widely used operational areas and elevation ranges. Other functionalities include the implementation of minimum occurrence thresholds and the visualization of coexistence patterns and range sizes. SpeciesgeocodeR accompanies a richly illustrated and easy-to-follow tutorial and help functions.

Download Full-text

sampbias, a method for quantifying geographic sampling biases in species distribution data

10.1101/2020.01.13.903757 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alexander Zizka ◽

Alexandre Antonelli ◽

Daniele Silvestro

Keyword(s):

R Package ◽

Distribution Data ◽

Species Occurrence ◽

Biodiversity Research ◽

Occurrence Data ◽

Standard Tool ◽

Moderate Effect ◽

Taxonomic Groups ◽

User Friendly ◽

Species Occurrences

AbstractGeo-referenced species occurrences from public databases have become essential to biodiversity research and conservation. However, geographical biases are widely recognized as a factor limiting the usefulness of such data for understanding species diversity and distribution. In particular, differences in sampling intensity across a landscape due to differences in human accessibility are ubiquitous but may differ in strength among taxonomic groups and datasets. Although several factors have been described to influence human access (such as presence of roads, rivers, airports and cities), quantifying their specific and combined effects on recorded occurrence data remains challenging. Here we present sampbias, an algorithm and software for quantifying the effect of accessibility biases in species occurrence datasets. Sampbias uses a Bayesian approach to estimate how sampling rates vary as a function of proximity to one or multiple bias factors. The results are comparable among bias factors and datasets. We demonstrate the use of sampbias on a dataset of mammal occurrences from the island of Borneo, showing a high biasing effect of cities and a moderate effect of roads and airports. Sampbias is implemented as a well-documented, open-access and user-friendly R package that we hope will become a standard tool for anyone working with species occurrences in ecology, evolution, conservation and related fields.

Download Full-text

BREC: an R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

BMC Bioinformatics ◽

10.1186/s12859-021-04233-1 ◽

2021 ◽

Vol 22 (S6) ◽

Author(s):

Yasmine Mansour ◽

Annie Chateau ◽

Anna-Sophie Fiston-Lavier

Keyword(s):

Data Quality ◽

Data Science ◽

Fruit Fly ◽

R Package ◽

Model Organisms ◽

Data Quality Control ◽

Recombination Rates ◽

Functional Dynamics ◽

Shiny App ◽

User Friendly

Abstract Background Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository https://github.com/GenomeStructureOrganization.

Download Full-text

Introducing bdclean: a user friendly biodiversity data cleaning pipeline

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25564 ◽

2018 ◽

Vol 2 ◽

pp. e25564

Author(s):

Tomer Gueta ◽

Vijay Barve ◽

Thiloshon Nagarajah ◽

Ashwin Agrawal ◽

Yohay Carmel

Keyword(s):

Data Cleaning ◽

Control Process ◽

R Package ◽

Modular Approach ◽

Data Validation ◽

Biodiversity Data ◽

Quality Control Process ◽

Cleaning Procedures ◽

R Packages ◽

User Friendly

A new R package for biodiversity data cleaning, 'bdclean', was initiated in the Google Summer of Code (GSoC) 2017 and is available on github. Several R packages have great data validation and cleaning functions, but 'bdclean' provides features to manage a complete pipeline for biodiversity data cleaning; from data quality explorations, to cleaning procedures and reporting. Users are able go through the quality control process in a very structured, intuitive, and effective way. A modular approach to data cleaning functionality should make this package extensible for many biodiversity data cleaning needs. Under GSoC 2018, 'bdclean' will go through a comprehensive upgrade. New features will be highlighted in the demonstration.

Download Full-text

BREC: An R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

10.1101/2020.06.29.178095 ◽

2020 ◽

Author(s):

Yasmine Mansour ◽

Annie Chateau ◽

Anna-Sophie Fiston-Lavier

Keyword(s):

Data Quality ◽

Data Science ◽

Fruit Fly ◽

R Package ◽

Model Organisms ◽

Data Quality Control ◽

Recombination Rates ◽

Functional Dynamics ◽

Shiny App ◽

User Friendly

AbstractMotivationMeiotic recombination is a vital biological process playing an essential role in genomes structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates, necessary to address evolutionary questions.ResultsHere, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers density and distribution issues. BREC’s heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC’s recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource.AvailabilityBREC R-package is available at the GitHub repository https://github.com/ymansour21/BREC.

Download Full-text

occAssess: An R package for assessing potential biases in species occurrence data

10.1101/2021.04.19.440441 ◽

2021 ◽

Author(s):

Robin James Boyd ◽

Gary Powney ◽

Claire Carvell ◽

Oliver Pescott

Keyword(s):

Target Population ◽

R Package ◽

Heterogeneous Databases ◽

Ecological Research ◽

Species Occurrence ◽

Worked Example ◽

Occurrence Data ◽

Data Coverage ◽

Discrete Functions ◽

Occurrence Records

Species occurrence records from a variety of sources are increasingly aggregated into heterogeneous databases and made available to ecologists for immediate analytical use. However, these data are typically biased, i.e. they are not a representative sample of the target population of interest, meaning that the information they provide may not be an accurate reflection of reality. It is therefore crucial that species occurrence data are properly scrutinised before they are used for research. In this article, we introduce occAssess, an R package that enables quick and easy screening of species occurrence data for potential biases. The package contains a number of discrete functions, each of which returns a measure of the potential for bias in one or more of the taxonomic, temporal, spatial and environmental dimensions. The outputs are provided visually (as ggplot2 objects) and do not include a formal recommendation as to whether data are of sufficient quality for any given inferential use. Instead, they should be used as ancillary information and viewed in the context of the question that is being asked, and the methods that are being used to answer it. We demonstrate the utility of occAssess by applying it to data on two key pollinator taxa in South America: leaf-nosed bats (Phyllostomidae) and hoverflies (Syrphidae). In this worked example, we briefly assess the degree to which various aspect of data coverage appear to have changed over time. We then discuss additional ways in which the package could be used, highlight its limitations, and point to where it could be improved in the future. Going forward, we hope that occAssess will help to improve the quality, and transparency, of assessments of species occurrence data as a necessary first step where they are being used for ecological research at large scales.

Download Full-text

Mercator: a pipeline for multi-method, unsupervised visualization and distance generation

Bioinformatics ◽

10.1093/bioinformatics/btab037 ◽

2021 ◽

Author(s):

Zachary B Abrams ◽

Caitlin E Coombes ◽

Suli Li ◽

Kevin R Coombes

Keyword(s):

Large Scale ◽

R Package ◽

High Dimensional ◽

Vast Number ◽

Large Scale Data ◽

User Friendly ◽

Exploratory Pattern ◽

Scale Data ◽

Selection Of ◽

Publication Quality

Abstract Summary Unsupervised machine learning provides tools for researchers to uncover latent patterns in large-scale data, based on calculated distances between observations. Methods to visualize high-dimensional data based on these distances can elucidate subtypes and interactions within multi-dimensional and high-throughput data. However, researchers can select from a vast number of distance metrics and visualizations, each with their own strengths and weaknesses. The Mercator R package facilitates selection of a biologically meaningful distance from 10 metrics, together appropriate for binary, categorical and continuous data, and visualization with 5 standard and high-dimensional graphics tools. Mercator provides a user-friendly pipeline for informaticians or biologists to perform unsupervised analyses, from exploratory pattern recognition to production of publication-quality graphics. Availabilityand implementation Mercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html).

Download Full-text

Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods

Biodiversity Informatics ◽

10.17161/bi.v15i2.13384 ◽

2020 ◽

Vol 15 (2) ◽

pp. 69-80

Author(s):

Jane Elith ◽

Catherine Graham ◽

Roozbeh Valavi ◽

Meinrad Abegg ◽

Caroline Bruce ◽

...

Keyword(s):

Species Distribution ◽

Large Scale ◽

R Package ◽

Open Science ◽

Species Occurrence ◽

Distribution Models ◽

Algorithm Performance ◽

Modeling Methods ◽

Science Framework ◽

Occurrence Records

Species distribution models (SDMs) are widely used to predict and study distributions of species. Many different modeling methods and associated algorithms are used and continue to emerge. It is important to understand how different approaches perform, particularly when applied to species occurrence records that were not gathered in structured surveys (e.g. opportunistic records). This need motivated a large-scale, collaborative effort, published in 2006, that aimed to create objective comparisons of algorithm performance. As a benchmark, and to facilitate future comparisons of approaches, here we publish that dataset: point location records for 226 anonymized species from six regions of the world, with accompanying predictor variables in raster (grid) and point formats. A particularly interesting characteristic of this dataset is that independent presence-absence survey data are available for evaluation alongside the presence-only species occurrence data intended for modeling. The dataset is available on Open Science Framework and as an R package and can be used as a benchmark for modeling approaches and for testing new ways to evaluate the accuracy of SDMs.

Download Full-text

metabaR : an R package for the evaluation and improvement of DNA metabarcoding data quality

10.1101/2020.08.28.271817 ◽

2020 ◽

Author(s):

Lucie Zinger ◽

Clément Lionnet ◽

Anne-Sophie Benoiston ◽

Julian Donald ◽

Céline Mercier ◽

...

Keyword(s):

Data Quality ◽

Large Scale ◽

Environmental Gradients ◽

R Package ◽

Data Curation ◽

Environmental Research ◽

List Type ◽

Sequencing Platform ◽

Dna Metabarcoding ◽

Ecological Patterns

AbstractDNA metabarcoding is becoming the tool of choice for biodiversity studies across taxa and large-scale environmental gradients. Yet, the artefacts present in metabarcoding datasets often preclude a proper interpretation of ecological patterns. Bioinformatic pipelines removing experimental noise have been designed to address this issue. However, these often only partially target produced artefacts, or are marker specific. In addition, assessments of data curation quality and the appropriateness of filtering thresholds are seldom available in existing pipelines, partly due to the lack of appropriate visualisation tools.Here, we present metabaR, an R package that provides a comprehensive suite of tools to effectively curate DNA metabarcoding data after basic bioinformatic analyses. In particular, metabaR uses experimental negative or positive controls to identify different types of artefactual sequences, i.e. reagent contaminants and tag-jumps. It also flags potentially dysfunctional PCRs based on PCR replicate similarities when those are available. Finally, metabaR provides tools to visualise DNA metabarcoding data characteristics in their experimental context as well as their distribution, and facilitate assessment of the appropriateness of data curation filtering thresholds.metabaR is applicable to any DNA metabarcoding experimental design but is most powerful when the design includes experimental controls and replicates. More generally, the simplicity and flexibility of the package makes it applicable any DNA marker, and data generated with any sequencing platform, and pre-analysed with any bioinformatic pipeline. Its outputs are easily usable for downstream analyses with any ecological R package.metabaR complements existing bioinformatics pipelines by providing scientists with a variety of functions with customisable methods that will allow the user to effectively clean DNA metabarcoding data and avoid serious misinterpretations. It thus offers a promising platform for automatised data quality assessments of DNA metabarcoding data for environmental research and biomonitoring.

Download Full-text

Mapping the Impact of Digitisation for Poorly Documented Countries: Mozambique as a case study

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37025 ◽

2019 ◽

Vol 3 ◽

Author(s):

Isabel Neves ◽

Maria da Luz Mathias ◽

Cristiane Bastos-Silveira

Keyword(s):

Data Cleaning ◽

Data Sources ◽

Data Driven ◽

Biodiversity Data ◽

Species Occurrence ◽

Terrestrial Mammals ◽

Management Actions ◽

Occurrence Data ◽

The Impact ◽

Conservation And Management

Despite the rise of the global availability of biodiversity data by digitisation, essential regions of the world remain poorly documented (Peterson et al. 2015). Research-neglected regions that lack quality information, are mainly the species-rich and developing nations (Gaikwad and Chavan 2006). Mozambique is an African country without a wide-ranging knowledge regarding its fauna’s diversity and distribution (Neves et al. 2018). Undeniably, this country's knowledge gaps constitute a significant impediment for the improvement of effective conservation measures. Primary species occurrence data across dispersed data sources can be a cost-effective resource for boosting knowledge about a country’s biodiversity. Aiming to aggregate a comprehensive dataset of Mozambique’s terrestrial mammals, we compiled primary species occurrence data from dispersed data sources. The produced dataset not only gathered digitalised accessible knowledge (DAK) from the Global Biodiversity Information Facility (GBIF) and natural history collections, but also retrieved and digitalised species occurrence data enclosed in grey and scientific literature. Particularly for poorly documented countries, filling data gaps are crucial for new and broad insights for biodiversity research and preservation. Thus, quantification of the effects of data digitisation and mobilisation goes beyond the specific goals of organisations, institutions or data-sharing resources. The impact of data digitisation should be disseminated, not only by the number of publications and times data are accessed (Nelson and Ellis 2018), but also by the actual achievements in regions covered by DAK. To highlight the impact of further data digitisation in a poorly documented country, we examine the effective gain of further digitisation and data cleaning on the terrestrial mammals from Mozambique. We demonstrate the increase in the overall knowledge, not merely in terms of number of species, number of records, and country’s coverage, but from the production of outputs with potential value for data-driven conservation research and planning. More than 17000 records were compiled. The digitisation of data in literature as well as data cleaning and quality improvements resulted in a substantial increase in the amount of DAK, which acknowledges Mozambique’s high species diversity (Fig. 1). The digitisation and data mobilisation hereby described allowed for the update of the country’s terrestrial mammals checklist (Neves et al. 2018). The final dataset also expands the knowledge of the most poorly documented provinces, allowing generation of a data-driven proposal of priority areas to survey (in review). Also, an assessment of Mozambique’s conservation network effectiveness for mammal protection was performed, and additional relevant areas were suggested (in prep.). The dataset compiled is an important "stepping stone" towards an enhanced knowledge of Mozambique’s fauna. Biodiversity conservation and management in developing countries rich in natural resources, which often must deal with a lack of internal capacity for applied research and conservation actions, are challenges. Considering that digitisation and mobilisation of biodiversity data are resourceful processes for improving knowledge, collaborative work between institutions of those countries and international data-provider communities could, in the short term, successfully improve the information baseline to support decision-making in future conservation and management actions.

Download Full-text

Research applications of primary biodiversity databases in the digital age

10.1101/605071 ◽

2019 ◽

Author(s):

Joan E. Ball-Damerow ◽

Laura Brenskelle ◽

Narayani Barve ◽

Pamela S. Soltis ◽

Petra Sierwald ◽

...

Keyword(s):

Data Quality ◽

Habitat Degradation ◽

Data Types ◽

Species Occurrence ◽

Online Systems ◽

Occurrence Data ◽

Data Compilation ◽

Taxonomic Groups ◽

Extinction Events ◽

Biodiversity Databases

ABSTRACTWe are in the midst of unprecedented change—climate shifts and sustained, widespread habitat degradation have led to dramatic declines in biodiversity rivaling historical extinction events. At the same time, new approaches to publishing and integrating previously disconnected data resources promise to help provide the evidence needed for more efficient and effective conservation and management. Stakeholders have invested considerable resources to contribute to online databases of species occurrences and genetic barcodes. However, estimates suggest that only 10% of biocollections are available in digital form. The biocollections community must therefore continue to promote digitization efforts, which in part requires demonstrating compelling applications of the data. Our overarching goal is therefore to determine trends in use of mobilized species occurrence data since 2010, as online systems have grown and now provide over one billion records. To do this, we characterized 501 papers that use openly accessible biodiversity databases. Our standardized tagging protocol was based on key topics of interest, including: database(s) used, taxa addressed, general uses of data, other data types linked to species occurrence data, and data quality issues addressed. We found that the most common uses of online biodiversity databases have been to estimate species distribution and richness, to outline data compilation and publication, and to assist in developing species checklists or describing new species. Only 69% of papers in our dataset addressed one or more aspects of data quality, which is low considering common errors and biases known to exist in opportunistic datasets. Globally, we find that biodiversity databases are still in the initial stages of data compilation. Novel and integrative applications are restricted to certain taxonomic groups and regions with higher numbers of quality records. Continued data digitization, publication, enhancement, and quality control efforts are necessary to make biodiversity science more efficient and relevant in our fast-changing world.

Download Full-text