scholarly journals Reviewing taxonomic bias in a megadiverse country: primary biodiversity data, cultural salience, and scientific interest of South African animals

2022 ◽  
pp. 1-11
Author(s):  
Fortunate M. Phaka ◽  
Maarten P.M. Vanhove ◽  
Louis H. du Preez ◽  
Jean Hugé

Taxonomic bias, resulting in some taxa receiving more attention than others, has been shown to persist throughout history. Such bias in primary biodiversity data needs to be addressed because the data are vital to environmental management. This study reviews taxonomic bias in South African primary biodiversity data obtained from the Global Biodiversity Information Facility (GBIF). The focus was specifically on animal classes, and regression analysis was used to assess the influence of scientific interest and cultural salience on taxonomic bias. A higher resolution analysis of the two explanatory variables’ influence on taxonomic bias is conducted using a generalised linear model on a subset of herpetofaunal families from the focal classes. Furthermore, the potential effects of cultural salience and scientific interest on a taxon’s extinction risk are investigated. The findings show that taxonomic bias in South Africa’s primary biodiversity data has similarities with global scale taxonomic bias. Among animal classes, there is strong bias towards birds while classes such as Polychaeta and Maxillopoda are under-represented. Cultural salience has a stronger influence on taxonomic bias than scientific interest. It is, however, unclear how these explanatory variables may influence the extinction risk of taxa. We recommend that taxonomic bias can be reduced if primary biodiversity data collection has a range of targets that guide (but do not limit) accumulation of species occurrence records per habitat. Within this range, a lower target of species occurrence records accommodates species that are difficult to detect. The upper target means occurrence records for any species are less urgent but nonetheless useful and thus data collection efforts can focus on species with fewer occurrence records.

Author(s):  
Barnaby Walker ◽  
Tarciso Leão ◽  
Steven Bachman ◽  
Eve Lucas ◽  
Eimear Nic Lughadha

Extinction risk assessments are increasingly important to many stakeholders (Bennun et al. 2017) but there remain large gaps in our knowledge about the status of many species. The IUCN Red List of Threatened Species (IUCN 2019, hereafter Red List) is the most comprehensive assessment of extinction risk. However, it includes assessments of just 7% of all vascular plants, while 18% of all assessed animals lack sufficient data to assign a conservation status. The wide availability of species occurrence information through digitised natural history collections and aggregators such as the Global Biodiversity Information Facility (GBIF), coupled with machine learning methods, provides an opportunity to fill these gaps in our knowledge. Machine learning approaches have already been proposed to guide conservation assessment efforts (Nic Lughadha et al. 2018), assign a conservation status to species with insufficient data for a full assessment (Bland et al. 2014), and predict the number of threatened species across the world (Pelletier et al. 2018). The wide range in sources of species occurrence records can lead to data quality issues, such as missing, imprecise, or mistaken information. These data quality issues may be compounded in databases that aggregate information from multiple sources: many such records derive from field observations (78% for plant species in GBIF; Meyer et al. 2016) largely unsupported by voucher specimens that would allow confirmation or correction of their identification. Even where voucher specimens do exist, different taxonomic or geographic information can be held for a single collection event represented by duplicate specimens deposited in different natural history collections. Tools are available to help clean species occurrence data, but these cannot deal with problems like specimen misidentification, which previous work (Nic Lughadha et al. 2019) has shown to have a large impact on preliminary assessments of conservation status. Machine learning models based on species occurrence records have been reported to predict with high accuracy the conservation status of species. However, given the black-box nature of some of the better machine learning models, it is unclear how well these accuracies apply beyond the data on which the models were trained. Practices for training machine learning models differ between studies, but more interrogation of these models is required if we are to know how much to trust their predictions. To address these problems, we compare predictions made by a machine learning model when trained on specimen occurrence records that have benefitted from minimal or more thorough cleaning, with those based on records from an expert-curated database. We then explore different techniques to interrogate machine learning models and quantify the uncertainty in their predictions.


Author(s):  
Christine Driller ◽  
Markus Koch ◽  
Giuseppe Abrami ◽  
Wahed Hemati ◽  
Andy Lücking ◽  
...  

The storage of data in public repositories such as the Global Biodiversity Information Facility (GBIF) or the National Center for Biotechnology Information (NCBI) is nowadays stipulated in the policies of many publishers in order to facilitate data replication or proliferation. Species occurrence records contained in legacy printed literature are no exception to this. The extent of their digital and machine-readable availability, however, is still far from matching the existing data volume (Thessen and Parr 2014). But precisely these data are becoming more and more relevant to the investigation of ongoing loss of biodiversity. In order to extract species occurrence records at a larger scale from available publications, one has to apply specialised text mining tools. However, such tools are in short supply especially for scientific literature in the German language. The Specialised Information Service Biodiversity Research*1 BIOfid (Koch et al. 2017) aims at reducing this desideratum, inter alia, by preparing a searchable text corpus semantically enriched by a new kind of multi-label annotation. For this purpose, we feed manual annotations into automatic, machine-learning annotators. This mixture of automatic and manual methods is needed, because BIOfid approaches a new application area with respect to language (mainly German of the 19th century), text type (biological reports), and linguistic focus (technical and everyday language). We will present current results of the performance of BIOfid’s semantic search engine and the application of independent natural language processing (NLP) tools. Most of these are freely available online, such as TextImager (Hemati et al. 2016). We will show how TextImager is tied into the BIOfid pipeline and how it is made scalable (e.g. extendible by further modules) and usable on different systems (docker containers). Further, we will provide a short introduction to generating machine-learning training data using TextAnnotator (Abrami et al. 2019) for multi-label annotation. Annotation reproducibility can be assessed by the implementation of inter-annotator agreement methods (Abrami et al. 2020). Beyond taxon recognition and entity linking, we place particular emphasis on location and time information. For this purpose, our annotation tag-set combines general categories and biology-specific categories (including taxonomic names) with location and time ontologies. The application of the annotation categories is regimented by annotation guidelines (Lücking et al. 2020). Within the next years, our work deliverable will be a semantically accessible and data-extractable text corpus of around two million pages. In this way, BIOfid is creating a new valuable resource that expands our knowledge of biodiversity and its determinants.


2018 ◽  
Vol 2 ◽  
pp. e25488
Author(s):  
Anne-Sophie Archambeau ◽  
Fabien Cavière ◽  
Kourouma Koura ◽  
Marie-Elise Lecoq ◽  
Sophie Pamerlon ◽  
...  

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. They developed an open and free platform for sharing and exploring biodiversity data. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). GBIF Benin, hosted at the University of Abomey-Calavi, has published more than 338 000 occurrence records from 87 datasets and 2 checklists. Through the GBIF Capacity Enhancement Support Programme (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), GBIF Benin, with the help of GBIF France, is in the process of deploying the Beninese data portal using the GBIF France back-end architecture. GBIF Benin is the first African country to implement this module of the ALA infrastructure. In this presentation, we will show you an overview of the registry and the occurrence search engine using the Beninese data portal. We will begin with the administration interface and how to manage metadata, then we will continue with the user interface of the registry and how you can find Beninese occurrences through the hub.


2016 ◽  
Vol 12 (3) ◽  
pp. 20150824 ◽  
Author(s):  
Elizabeth H. Boakes ◽  
Richard A. Fuller ◽  
Philip J. K. McGowan ◽  
Georgina M. Mace

Identifying local extinctions is integral to estimating species richness and geographic range changes and informing extinction risk assessments. However, the species occurrence records underpinning these estimates are frequently compromised by a lack of recorded species absences making it impossible to distinguish between local extinction and lack of survey effort—for a rigorously compiled database of European and Asian Galliformes, approximately 40% of half-degree cells contain records from before but not after 1980. We investigate the distribution of these cells, finding differences between the Palaearctic (forests, low mean human influence index (HII), outside protected areas (PAs)) and Indo-Malaya (grassland, high mean HII, outside PAs). Such cells also occur more in less peaceful countries. We show that different interpretations of these cells can lead to large over/under-estimations of species richness and extent of occurrences, potentially misleading prioritization and extinction risk assessment schemes. To avoid mistakes, local extinctions inferred from sightings records need to account for the history of survey effort in a locality.


2019 ◽  
Vol 7 ◽  
Author(s):  
Stefano Mammola ◽  
Pedro Cardoso ◽  
Dorottya Angyal ◽  
Gergely Balázs ◽  
Theo Blick ◽  
...  

Spiders (Arachnida: Araneae) are widespread in subterranean ecosystems worldwide and represent an important component of subterranean trophic webs. Yet, global-scale diversity patterns of subterranean spiders are still mostly unknown. In the frame of the CAWEB project, a European joint network of cave arachnologists, we collected data on cave-dwelling spider communities across Europe in order to explore their continental diversity patterns. Two main datasets were compiled: one listing all subterranean spider species recorded in numerous subterranean localities across Europe and another with high resolution data about the subterranean habitat in which they were collected. From these two datasets, we further generated a third dataset with individual geo-referenced occurrence records for all these species. Data from 475 geo-referenced subterranean localities (caves, mines and other artificial subterranean sites, interstitial habitats) are herein made available. For each subterranean locality, information about the composition of the spider community is provided, along with local geomorphological and habitat features. Altogether, these communities account for > 300 unique taxonomic entities and 2,091 unique geo-referenced occurrence records, that are made available via the Global Biodiversity Information Facility (GBIF) (Mammola and Cardoso 2019). This dataset is unique in that it covers both a large geographic extent (from 35° south to 67° north) and contains high-resolution local data on geomorphological and habitat features. Given that this kind of high-resolution data are rarely associated with broad-scale datasets used in macroecology, this dataset has high potential for helping researchers in tackling a range of biogeographical and macroecological questions, not necessarily uniquely related to arachnology or subterranean biology.


Author(s):  
Michael Trizna ◽  
Torsten Dikow

Taxonomic revisions contain crucial biodiversity data in the material examined sections for each species. In entomology, material examined lists minimally include the collecting locality, date of collection, and the number of specimens of each collection event. Insect species might be represented in taxonomic revisions by only a single specimen or hundreds to thousands of specimens. Furthermore, revisions of insect genera might treat small genera with few species or include tens to hundreds of species. Summarizing data from such large and complex material examined lists and revisions is cumbersome, time-consuming, and prone to errors. However, providing data on the seasonal incidence, abundance, and collecting period of species is an important way to mobilize primary biodiversity data to understand a species’s occurrence or rarity. Here, we present SpOccSum (Species Occurrence Summary)—a tool to easily obtain metrics of seasonal incidence from specimen occurrence data in taxonomic revisions. SpOccSum is written in Python (Python Software Foundation 2019) and accessible through the Anaconda Python/R Data Science Platform as a Jupyter Notebook (Kluyver et al. 2016). The tool takes a simple list of specimen data containing species name, locality, date of collection (preferably separated by day, month, and year), and number of specimens in CSV format and generates a series of tables and graphs summarizing: number of specimens per species, number of specimens collected per month, number of unique collection events, as well as earliest, and most recent collecting year of each species. number of specimens per species, number of specimens collected per month, number of unique collection events, as well as earliest, and most recent collecting year of each species. The results can be exported as graphics or as csv-formatted tables and can easily be included in manuscripts for publication. An example of an early version of the summary produced by SpOccSum can be viewed in Tables 1, 2 from Markee and Dikow (2018). To accommodate seasonality in the Northern and Southern Hemispheres, users can choose to start the data display with either January or July. When geographic coordinates are available and species have widespread distributions spanning, for example, the equator, the user can itemize particular regions such as North of Tropic of Cancer (23.5˚N), Tropic of Cancer to the Equator, Equator to Tropic of Capricorn, and South of Tropic of Capricorn (23.5˚S). Other features currently in development include the ability to produce distribution maps from the provided data (when geographic coordinates are included) and the option to export specimen occurrence data as a Darwin-Core Archive ready for upload to the Global Biodiversity Information Facility (GBIF).


Author(s):  
Erica Krimmel ◽  
Austin Mast ◽  
Deborah Paul ◽  
Robert Bruhn ◽  
Nelson Rios ◽  
...  

Genomic evidence suggests that the causative virus of COVID-19 (SARS-CoV-2) was introduced to humans from horseshoe bats (family Rhinolophidae) (Andersen et al. 2020) and that species in this family as well as in the closely related Hipposideridae and Rhinonycteridae families are reservoirs of several SARS-like coronaviruses (Gouilh et al. 2011). Specimens collected over the past 400 years and curated by natural history collections around the world provide an essential reference as we work to understand the distributions, life histories, and evolutionary relationships of these bats and their viruses. While the importance of biodiversity specimens to emerging infectious disease research is clear, empowering disease researchers with specimen data is a relatively new goal for the collections community (DiEuliis et al. 2016). Recognizing this, a team from Florida State University is collaborating with partners at GEOLocate, Bionomia, University of Florida, the American Museum of Natural History, and Arizona State University to produce a deduplicated, georeferenced, vetted, and versioned data product of the world's specimens of horseshoe bats and relatives for researchers studying COVID-19. The project will serve as a model for future rapid data product deployments about biodiversity specimens. The project underscores the value of biodiversity data aggregators iDigBio and the Global Biodiversity Information Facility (GBIF), which are sources for 58,617 and 79,862 records, respectively, as of July 2020, of horseshoe bat and relative specimens held by over one hundred natural history collections. Although much of the specimen-based biodiversity data served by iDigBio and GBIF is high quality, it can be considered raw data and therefore often requires additional wrangling, standardizing, and enhancement to be fit for specific applications. The project will create efficiencies for the coronavirus research community by producing an enhanced, research-ready data product, which will be versioned and published through Zenodo, an open-access repository (see doi.org/10.5281/zenodo.3974999). In this talk, we highlight lessons learned from the initial phases of the project, including deduplicating specimen records, standardizing country information, and enhancing taxonomic information. We also report on our progress to date, related to enhancing information about agents (e.g., collectors or determiners) associated with these specimens, and to georeferencing specimen localities. We seek also to explore how much we can use the added agent information (i.e., ORCID iDs and Wikidata Q identifiers) to inform our georeferencing efforts and to support crediting those collecting and doing identifications. The project will georeference approximately one third of our specimen records, based on those lacking geospatial coordinates but containing textual locality descriptions. We furthermore provide an overview of our holistic approach to enhancing specimen records, which we hope will maximize the value of the bat specimens at the center of what has been recently termed the "extended specimen network" (Lendemer et al. 2020). The centrality of the physical specimen in the network reinforces the importance of archived materials for reproducible research. Recognizing this, we view the collections providing data to iDigBio and GBIF as essential partners, as we expect that they will be responsible for the long-term management of enhanced data associated with the physical specimens they curate. We hope that this project can provide a model for better facilitating the reintegration of enhanced data back into local specimen data management systems.


2020 ◽  
Vol 21 (8) ◽  
Author(s):  
Iyan Robiansyah ◽  
Wita Wardani

Abstract. Robiansyah I, Wardani W. 2020. Increasing accuracy: The advantage of using open access species occurrence database in the Red List assessment. Biodiversitas 21: 3658-3664. IUCN Red List is the most widely used instrument to assess and advise the extinction risk of a species. One of the criteria used in IUCN Red List is geographical range of the species assessed (criterion B) in the form of extent of occurrence (EOO) and/or area of occupancy (AOO). While this criterion is presumed to be the easiest to be completed as it is based mainly on species occurrence data, there are some assessments that failed to maximize freely available databases. Here, we reassessed the conservation status of Cibotium arachnoideum, a tree fern distributed in Sumatra and Borneo. This species was previously assessed by Praptosuwiryo (2020, Biodiversitas 21 (4): 1379-1384) which classified the species as Endangered (EN) under criteria B2ab(i,ii,iii); C2a(ii). Using additional data from herbarium specimens recorded in the Global Biodiversity Information Facility (GBIF) website and from peer-reviewed scientific papers, in the present paper we show that C. arachnoideum has a larger extent of occurrence (EOO) and area of occupancy (AOO), more locations and different conservation status compared to those in Praptosuwiryo (2020). Our results are supported by the predicted suitable habitat map of C. arachnoideum produced by MaxEnt modelling method. Based on our assessment, we propose the category of Vulnerable (VU) C2a(i) as the global conservation status for C. arachnoideum. Our study implies the advantage of using open access databases to increase the accuracy of extinction risk assessment under the IUCN Red List criteria in regions like Indonesia, where adequate taxonomical information is not always readily available.


Author(s):  
Raul Sierra-Alcocer ◽  
Christopher Stephens ◽  
Juan Barrios ◽  
Constantino González‐Salazar ◽  
Juan Carlos Salazar Carrillo ◽  
...  

SPECIES (Stephens et al. 2019) is a tool to explore spatial correlations in biodiversity occurrence databases. The main idea behind the SPECIES project is that the geographical correlations between the distributions of taxa records have useful information. The problem, however, is that if we have thousands of species (Mexico's National System of Biodiversity Information has records of around 70,000 species) then we have millions of potential associations, and exploring them is far from easy. Our goal with SPECIES is to facilitate the discovery and application of meaningful relations hiding in our data. The main variables in SPECIES are the geographical distributions of species occurrence records. Other types of variables, like the climatic variables from WorldClim (Hijmans et al. 2005), are explanatory data that serve for modeling. The system offers two modes of analysis. In one, the user defines a target species, and a selection of species and abiotic variables; then the system computes the spatial correlations between the target species and each of the other species and abiotic variables. The request from the user can be as small as comparing one species to another, or as large as comparing one species to all the species in the database. A user may wonder, for example, which species are usual neighbors of the jaguar, this mode could help answer this question. The second mode of analysis gives a network perspective, in it, the user defines two groups of taxa (and/or environmental variables), the output in this case is a correlation network where the weight of a link between two nodes represents the spatial correlation between the variables that the nodes represent. For example, one group of taxa could be hummingbirds (Trochilidae family) and the second flowers of the Lamiaceae family. This output would help the user analyze which pairs of hummingbird and flower are highly correlated in the database. SPECIES data architecture is optimized to support fast hypotheses prototyping and testing with the analysis of thousands of biotic and abiotic variables. It has a visualization web interface that presents descriptive results to the user at different levels of detail. The methodology in SPECIES is relatively simple, it partitions the geographical space with a regular grid and treats a species occurrence distribution as a present/not present boolean variable over the cells. Given two species (or one species and one abiotic variable) it measures if the number of co-occurrences between the two is more (or less) than expected. If it is more than expected indicates a signal of a positive relation, whereas if it is less it would be evidence of disjoint distributions. SPECIES provides an open web application programming interface (API) to request the computation of correlations and statistical dependencies between variables in the database. Users can create applications that consume this 'statistical web service' or use it directly to further analyze the results in frameworks like R or Python. The project includes an interactive web application that does exactly that: requests analysis from the web service and lets the user experiment and visually explore the results. We believe this approach can be used on one side to augment the services provided from data repositories; and on the other side, facilitate the creation of specialized applications that are clients of these services. This scheme supports big-data-driven research for a wide range of backgrounds because end users do not need to have the technical know-how nor the infrastructure to handle large databases. Currently, SPECIES hosts: all records from Mexico's National Biodiversity Information System (CONABIO 2018) and a subset of Global Biodiversity Information Facility data that covers the contiguous USA (GBIF.org 2018b) and Colombia (GBIF.org 2018a). It also includes discretizations of environmental variables from WorldClim, from the Environmental Rasters for Ecological Modeling project (Title and Bemmels 2018), from CliMond (Kriticos et al. 2012), and topographic variables (USGS EROS Center 1997b, USGS EROS Center 1997a). The long term plan, however, is to incrementally include more data, specially all data from the Global Biodiversity Information Facility. The code of the project is open source, and the repositories are available online (Front-end, Web Services Application Programming Interface, Database Building scripts). This presentation is a demonstration of SPECIES' functionality and its overall design.


Sign in / Sign up

Export Citation Format

Share Document