scholarly journals SpOccSum: An easy-to-use Python tool to summarize species occurrence data from material examined lists in taxonomic revisions

Author(s):  
Michael Trizna ◽  
Torsten Dikow

Taxonomic revisions contain crucial biodiversity data in the material examined sections for each species. In entomology, material examined lists minimally include the collecting locality, date of collection, and the number of specimens of each collection event. Insect species might be represented in taxonomic revisions by only a single specimen or hundreds to thousands of specimens. Furthermore, revisions of insect genera might treat small genera with few species or include tens to hundreds of species. Summarizing data from such large and complex material examined lists and revisions is cumbersome, time-consuming, and prone to errors. However, providing data on the seasonal incidence, abundance, and collecting period of species is an important way to mobilize primary biodiversity data to understand a species’s occurrence or rarity. Here, we present SpOccSum (Species Occurrence Summary)—a tool to easily obtain metrics of seasonal incidence from specimen occurrence data in taxonomic revisions. SpOccSum is written in Python (Python Software Foundation 2019) and accessible through the Anaconda Python/R Data Science Platform as a Jupyter Notebook (Kluyver et al. 2016). The tool takes a simple list of specimen data containing species name, locality, date of collection (preferably separated by day, month, and year), and number of specimens in CSV format and generates a series of tables and graphs summarizing: number of specimens per species, number of specimens collected per month, number of unique collection events, as well as earliest, and most recent collecting year of each species. number of specimens per species, number of specimens collected per month, number of unique collection events, as well as earliest, and most recent collecting year of each species. The results can be exported as graphics or as csv-formatted tables and can easily be included in manuscripts for publication. An example of an early version of the summary produced by SpOccSum can be viewed in Tables 1, 2 from Markee and Dikow (2018). To accommodate seasonality in the Northern and Southern Hemispheres, users can choose to start the data display with either January or July. When geographic coordinates are available and species have widespread distributions spanning, for example, the equator, the user can itemize particular regions such as North of Tropic of Cancer (23.5˚N), Tropic of Cancer to the Equator, Equator to Tropic of Capricorn, and South of Tropic of Capricorn (23.5˚S). Other features currently in development include the ability to produce distribution maps from the provided data (when geographic coordinates are included) and the option to export specimen occurrence data as a Darwin-Core Archive ready for upload to the Global Biodiversity Information Facility (GBIF).

Author(s):  
Scott A Chamberlain ◽  
Carl Boettiger

Background. The number of individuals of each species in a given location forms the basis for many sub-fields of ecology and evolution. Data on individuals, including which species, and where they're found can be used for a large number of research questions. Global Biodiversity Information Facility (hereafter, GBIF) is the largest of these. Programmatic clients for GBIF would make research dealing with GBIF data much easier and more reproducible. Methods. We have developed clients to access GBIF data for each of the R, Python, and Ruby programming languages: rgbif, pygbif, gbifrb. Results. For all clients we describe their design and utility, and demonstrate some use cases. Discussion. Programmatic access to GBIF will facilitate more open and reproducible science - the three GBIF clients described herein are a significant contribution towards this goal.


Author(s):  
Gerald Guala

Biodiversity Information Serving Our Nation (BISON - bison.usgs.gov) is the US Node application for the Global Biodiversity Information Facility (GBIF) and the most comprehensive source of species occurrence data for the United States of America. It currently contains more than 460 million records and provides significant augmentation and integration of US occurrence data in terrestrial, marine and freshwater systems. Publicly released in 2013, BISON has generated a large community of stakeholders and they have passed on a lot of questions over the years through email ([email protected]), presentations and other means. In this presentation, some of the most common questions will be addressed in detail. For example: why all BISON data isn't in GBIF; how is BISON different from GBIF; what is the relationship between BISON and other US providers to GBIF; and what is the exact role of the Integrated Taxonomic Information System (ITIS - www.itis.gov) in BISON.


Author(s):  
Yi-Ming Gan ◽  
Maxime Sweetlove ◽  
Anton Van de Putte

The Antarctic Biodiversity portal (biodiversity.aq) is a gateway to a wide variety of Antarctic biodiversity information and tools. Launched in 2005 as the Scientific Committee on Antarctic Research (SCAR) - Marine Biodiversity Information Network (SCAR-MarBIN, scarmarbin.be) and the Register of Antarctic Marine Species (RAMS, marinespecies.org/rams/), the system has grown in scope from purely marine to include terrestrial information. Biodiversity.aq is a SCAR product, currently supported by Belspo (Belgian Science Policy) as one of the Belgian contributions to the European Lifewatch-European Research Infrastructure Consortium (Lifewatch-ERIC). The goal of Lifewatch is to provide access to: distributed observatories/sensor networks; interoperable databases, existing (data-)networks, using accepted standards; high performance computing (HPC) and grid power, including the use of the state-of-the-art of cloud and big data paradigm technologies; software and tools for visualization, analysis and modeling. Here we provide an overview of the most recent advances in the biodiversity.aq online ecosystem, a number of use cases as well as an overview of future directions. Some of the most notable components are: The Register of Antarctic Species (RAS, ras.biodiversity.aq) is a component of the Lifewatch Taxonomic Backbone and provides an authoritative and comprehensive list of names of marine and terrestrial species in Antarctica and the Southern Ocean. It serves as a reference guide for users to interpret taxonomic literature, as valid names and other names in use are both provided. Integrated Publishing Toolkit (IPT, ipt.biodiversity.aq) allows disseminating Antarctic biodiversity data into global initiatives such as the Ocean Biogeographic Information System (OBIS, obis.org) as Antarctic node of OBIS (Ant-OBIS, also formerly known as SCAR-MarBIN) and the Global Biodiversity Information Facility (GBIF, gbif.org) as Antarctic Biodiversity Information Facility (AntaBIF). Data that can be made available include metadata, species checklists, species occurrence data and more recently, sampling event-based data. Data from these international portals can be accessed through data.biodiversity.aq. The Register of Antarctic Species (RAS, ras.biodiversity.aq) is a component of the Lifewatch Taxonomic Backbone and provides an authoritative and comprehensive list of names of marine and terrestrial species in Antarctica and the Southern Ocean. It serves as a reference guide for users to interpret taxonomic literature, as valid names and other names in use are both provided. Integrated Publishing Toolkit (IPT, ipt.biodiversity.aq) allows disseminating Antarctic biodiversity data into global initiatives such as the Ocean Biogeographic Information System (OBIS, obis.org) as Antarctic node of OBIS (Ant-OBIS, also formerly known as SCAR-MarBIN) and the Global Biodiversity Information Facility (GBIF, gbif.org) as Antarctic Biodiversity Information Facility (AntaBIF). Data that can be made available include metadata, species checklists, species occurrence data and more recently, sampling event-based data. Data from these international portals can be accessed through data.biodiversity.aq. Through SCAR, Biodiversity.aq builds on an international network of expert that provide expert knowledge on taxonomy, species distribution,and ecology. It provides a strong and tested platform for sharing, integrating, discovering and analysing Antarctic biodiversity information originating from a variety of sources into a distributed system.


2020 ◽  
Vol 15 (4) ◽  
pp. 411-437 ◽  
Author(s):  
Marcos Zárate ◽  
Germán Braun ◽  
Pablo Fillottrani ◽  
Claudio Delrieux ◽  
Mirtha Lewis

Great progress to digitize the world’s available Biodiversity and Biogeography data have been made recently, but managing data from many different providers and research domains still remains a challenge. A review of the current landscape of metadata standards and ontologies in Biodiversity sciences suggests that existing standards, such as the Darwin Core terminology, are inadequate for describing Biodiversity data in a semantically meaningful and computationally useful way. As a contribution to fill this gap, we present an ontology-based system, called BiGe-Onto, designed to manage data together from Biodiversity and Biogeography. As data sources, we use two internationally recognized repositories: the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS). BiGe-Onto system is composed of (i) BiGe-Onto Architecture (ii) a conceptual model called BiGe-Onto specified in OntoUML, (iii) an operational version of BiGe-Onto encoded in OWL 2, and (iv) an integrated dataset for its exploitation through a SPARQL endpoint. We will show use cases that allow researchers to answer questions that manage information from both domains.


Author(s):  
Scott A Chamberlain ◽  
Carl Boettiger

Background. The number of individuals of each species in a given location forms the basis for many sub-fields of ecology and evolution. Data on individuals, including which species, and where they're found can be used for a large number of research questions. Global Biodiversity Information Facility (hereafter, GBIF) is the largest of these. Programmatic clients for GBIF would make research dealing with GBIF data much easier and more reproducible. Methods. We have developed clients to access GBIF data for each of the R, Python, and Ruby programming languages: rgbif, pygbif, gbifrb. Results. For all clients we describe their design and utility, and demonstrate some use cases. Discussion. Programmatic access to GBIF will facilitate more open and reproducible science - the three GBIF clients described herein are a significant contribution towards this goal.


Author(s):  
Yi-Ming Gan ◽  
Maxime Sweetlove ◽  
Anton Van de Putte

The Antarctic Biodiversity portal (biodiversity.aq) is a gateway to a wide variety of Antarctic biodiversity information and tools. Launched in 2015 as the Scientific Committee on Antarctic Research (SCAR) - Marine Biodiversity Information Network (SCAR-MarBIN, scarmarbin.be) and the Register of Antarctic Marine Species (RAMS, marinespecies.org/rams/), the system has grown in scope from purely marine to include terrestrial information. Biodiversity.aq is a SCAR product, currently supported by Belspo (Belgian Science Policy) as one of the Belgian contributions to the European Lifewatch-European Research Infrastructure Consortium (Lifewatch-ERIC). The goal of Lifewatch is to provide access to: distributed observatories/sensor networks; interoperable databases, existing (data-)networks, using accepted standards; high performance computing (HPC) and grid power, including the use of the state-of-the-art of cloud and big data paradigm technologies; software and tools for visualization, analysis and modeling. Here we provide an overview of the most recent advances in the biodiversity.aq online ecosystem, a number of use cases as well as an overview of future directions. Some of the most notable components are: The Register of Antarctic Species (RAS, ras.biodiversity.aq) is a component of the Lifewatch Taxonomic Backbone and provides an authoritative and comprehensive list of names of marine and terrestrial species in Antarctica and the Southern Ocean. It serves as a reference guide for users to interpret taxonomic literature, as valid names and other names in use are both provided. Integrated Publishing Toolkit (IPT, ipt.biodiversity.aq) allows disseminating Antarctic biodiversity data into global initiatives such as the Ocean Biogeographic Information System (OBIS, obis.org) as Antarctic node of OBIS (Ant-OBIS, also formerly known as SCAR-MarBIN) and the Global Biodiversity Information Facility (GBIF, gbif.org) as Antarctic Biodiversity Information Facility (AntaBIF). Data that can be made available include metadata, species checklists, species occurrence data and more recently, sampling event-based data. Data from these international portals can be accessed through data.biodiversity.aq. The Register of Antarctic Species (RAS, ras.biodiversity.aq) is a component of the Lifewatch Taxonomic Backbone and provides an authoritative and comprehensive list of names of marine and terrestrial species in Antarctica and the Southern Ocean. It serves as a reference guide for users to interpret taxonomic literature, as valid names and other names in use are both provided. Integrated Publishing Toolkit (IPT, ipt.biodiversity.aq) allows disseminating Antarctic biodiversity data into global initiatives such as the Ocean Biogeographic Information System (OBIS, obis.org) as Antarctic node of OBIS (Ant-OBIS, also formerly known as SCAR-MarBIN) and the Global Biodiversity Information Facility (GBIF, gbif.org) as Antarctic Biodiversity Information Facility (AntaBIF). Data that can be made available include metadata, species checklists, species occurrence data and more recently, sampling event-based data. Data from these international portals can be accessed through data.biodiversity.aq. Through SCAR, Biodiversity.aq builds on an international network of expert that provide expert knowledge on taxonomy, species distribution,and ecology. It provides a strong and tested platform for sharing, integrating, discovering and analysing Antarctic biodiversity information originating from a variety of sources into a distributed system.


Author(s):  
Arthur Chapman ◽  
Lee Belbin ◽  
Paula Zermoglio ◽  
John Wieczorek ◽  
Paul Morris ◽  
...  

The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community. The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values. Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.


Author(s):  
Edward Gilbert ◽  
Corinna Gries ◽  
Nico Franz ◽  
Landrum Leslie R. ◽  
Thomas H. Nash III

The SEINet Portal Network has a complex social and development history spanning nearly two decades. Initially established as a basic online search engine for a select handful of biological collections curated within the southwestern United States, SEINet has since matured into a biodiversity data network incorporating more than 330 institutions and 1,900 individual data contributors. Participating institutions manage and publish over 14 million specimen records, 215,000 observations, and 8 million images. Approximately 70% of the collections make use of the data portal as their primary "live" specimen management platform. The SEINet interface now supports 13 regional data portals distributed across the United States and northern Mexico (http://symbiota.org/docs/seinet/). Through many collaborative efforts, it has matured into a tool for biodiversity data exploration, which includes species inventories, interactive identification keys, specimen and field images, taxonomic information, species distribution maps, and taxonomic descriptions. SEINet’s initial developmental goals were to construct a read-only interface that integrated specimen records harvested from a handful of distributed natural history databases. Intermittent network conductivity and inconsistent data exchange protocols frequently restricted data persistence. National funding opportunities supported a complete redesign towards the development of a centralized data cache model with periodic "snapshot" updates from original data sources. A service-based management infrastructure was integrated into the interface to mobilize small- to medium-sized collections (<1 million specimen records) that commonly lack consistent infrastructure and technical expertise to maintain a standard compliant specimen database. These developments were the precursors to the Symbiota software project (Gries et al. 2014). Through further development of Symbiota, SEINet transformed into a robust specimen management system specifically geared toward specimen digitization with features including data entry from label images, harvesting data from specimen duplicates, batch georeferencing, data validation and cleaning, generating progress reports, and additional tools to improve the efficiency of the digitization process. The central developmental paradigm focused on data mobilization through the production of: a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. User interfaces consist of a decentralized network of regional data portals, all connecting to a centralized shared data source. Each of the 13 data portals are configured to present a regional perspective specifically tailored to represent the needs of the local research community. This infrastructure has supported the formation of regional consortia, who provide network support to aid local institutions in digitizing and publishing their collections within the network. The community-based infrastructure creates a sense of ownership – perhaps even good-natured competition – by the data providers and provides extra incentive to improve data quality and expand the network. Certain areas of development remain challenging in spite of the project's overall success. For instance, data managers continuously struggle to maintain a current local taxonomic thesaurus used for name validation, data cleaning, and to resolve taxonomic discrepancies commonly encountered when integrating collection datasets. We will discuss the successes and challenges associated with the long-term sustainability model and explore potential future paths for SEINet that support the long-term goal of maintaining a data provider that is in full compliance with the FAIR use principles of making the datasets findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).


Author(s):  
Raul Sierra-Alcocer ◽  
Christopher Stephens ◽  
Juan Barrios ◽  
Constantino González‐Salazar ◽  
Juan Carlos Salazar Carrillo ◽  
...  

SPECIES (Stephens et al. 2019) is a tool to explore spatial correlations in biodiversity occurrence databases. The main idea behind the SPECIES project is that the geographical correlations between the distributions of taxa records have useful information. The problem, however, is that if we have thousands of species (Mexico's National System of Biodiversity Information has records of around 70,000 species) then we have millions of potential associations, and exploring them is far from easy. Our goal with SPECIES is to facilitate the discovery and application of meaningful relations hiding in our data. The main variables in SPECIES are the geographical distributions of species occurrence records. Other types of variables, like the climatic variables from WorldClim (Hijmans et al. 2005), are explanatory data that serve for modeling. The system offers two modes of analysis. In one, the user defines a target species, and a selection of species and abiotic variables; then the system computes the spatial correlations between the target species and each of the other species and abiotic variables. The request from the user can be as small as comparing one species to another, or as large as comparing one species to all the species in the database. A user may wonder, for example, which species are usual neighbors of the jaguar, this mode could help answer this question. The second mode of analysis gives a network perspective, in it, the user defines two groups of taxa (and/or environmental variables), the output in this case is a correlation network where the weight of a link between two nodes represents the spatial correlation between the variables that the nodes represent. For example, one group of taxa could be hummingbirds (Trochilidae family) and the second flowers of the Lamiaceae family. This output would help the user analyze which pairs of hummingbird and flower are highly correlated in the database. SPECIES data architecture is optimized to support fast hypotheses prototyping and testing with the analysis of thousands of biotic and abiotic variables. It has a visualization web interface that presents descriptive results to the user at different levels of detail. The methodology in SPECIES is relatively simple, it partitions the geographical space with a regular grid and treats a species occurrence distribution as a present/not present boolean variable over the cells. Given two species (or one species and one abiotic variable) it measures if the number of co-occurrences between the two is more (or less) than expected. If it is more than expected indicates a signal of a positive relation, whereas if it is less it would be evidence of disjoint distributions. SPECIES provides an open web application programming interface (API) to request the computation of correlations and statistical dependencies between variables in the database. Users can create applications that consume this 'statistical web service' or use it directly to further analyze the results in frameworks like R or Python. The project includes an interactive web application that does exactly that: requests analysis from the web service and lets the user experiment and visually explore the results. We believe this approach can be used on one side to augment the services provided from data repositories; and on the other side, facilitate the creation of specialized applications that are clients of these services. This scheme supports big-data-driven research for a wide range of backgrounds because end users do not need to have the technical know-how nor the infrastructure to handle large databases. Currently, SPECIES hosts: all records from Mexico's National Biodiversity Information System (CONABIO 2018) and a subset of Global Biodiversity Information Facility data that covers the contiguous USA (GBIF.org 2018b) and Colombia (GBIF.org 2018a). It also includes discretizations of environmental variables from WorldClim, from the Environmental Rasters for Ecological Modeling project (Title and Bemmels 2018), from CliMond (Kriticos et al. 2012), and topographic variables (USGS EROS Center 1997b, USGS EROS Center 1997a). The long term plan, however, is to incrementally include more data, specially all data from the Global Biodiversity Information Facility. The code of the project is open source, and the repositories are available online (Front-end, Web Services Application Programming Interface, Database Building scripts). This presentation is a demonstration of SPECIES' functionality and its overall design.


2018 ◽  
Vol 2 ◽  
pp. e26369
Author(s):  
Michael Trizna

As rapid advances in sequencing technology result in more branches of the tree of life being illuminated, there has actually been a decrease in the percentage of sequence records that are backed by voucher specimens Trizna 2018b. The good news is that there are tools Trizna (2017), NCBI (2005), Biocode LLC (2014) to enable well-databased museum vouchers to automatically validate and format specimen and collection metadata for high quality sequence records. Another problem is that there are millions of existing sequence records that are known to contain either incorrect or incomplete specimen data. I will show an end-to-end example of sequencing specimens from a museum, depositing their sequence records in NCBI's (National Center for Biotechnology Information) GenBank database, and then providing updates to GenBank as the museum database revises identifications. I will also talk about linking records from specimen databases as well. Over one million records in the Global Biodiversity Information Facility (GBIF) Trizna (2018a) contain a value in the Darwin Core term "associatedSequences", and I will examine what is currently contained in these entries, and how best to format them to ensure that a tight connection is made to sequence records.


Sign in / Sign up

Export Citation Format

Share Document