scholarly journals Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data

Author(s):  
Arthur Chapman ◽  
Lee Belbin ◽  
Paula Zermoglio ◽  
John Wieczorek ◽  
Paul Morris ◽  
...  

The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community. The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values. Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.

2020 ◽  
Vol 15 (4) ◽  
pp. 411-437 ◽  
Author(s):  
Marcos Zárate ◽  
Germán Braun ◽  
Pablo Fillottrani ◽  
Claudio Delrieux ◽  
Mirtha Lewis

Great progress to digitize the world’s available Biodiversity and Biogeography data have been made recently, but managing data from many different providers and research domains still remains a challenge. A review of the current landscape of metadata standards and ontologies in Biodiversity sciences suggests that existing standards, such as the Darwin Core terminology, are inadequate for describing Biodiversity data in a semantically meaningful and computationally useful way. As a contribution to fill this gap, we present an ontology-based system, called BiGe-Onto, designed to manage data together from Biodiversity and Biogeography. As data sources, we use two internationally recognized repositories: the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS). BiGe-Onto system is composed of (i) BiGe-Onto Architecture (ii) a conceptual model called BiGe-Onto specified in OntoUML, (iii) an operational version of BiGe-Onto encoded in OWL 2, and (iv) an integrated dataset for its exploitation through a SPARQL endpoint. We will show use cases that allow researchers to answer questions that manage information from both domains.


Author(s):  
Michael Trizna ◽  
Torsten Dikow

Taxonomic revisions contain crucial biodiversity data in the material examined sections for each species. In entomology, material examined lists minimally include the collecting locality, date of collection, and the number of specimens of each collection event. Insect species might be represented in taxonomic revisions by only a single specimen or hundreds to thousands of specimens. Furthermore, revisions of insect genera might treat small genera with few species or include tens to hundreds of species. Summarizing data from such large and complex material examined lists and revisions is cumbersome, time-consuming, and prone to errors. However, providing data on the seasonal incidence, abundance, and collecting period of species is an important way to mobilize primary biodiversity data to understand a species’s occurrence or rarity. Here, we present SpOccSum (Species Occurrence Summary)—a tool to easily obtain metrics of seasonal incidence from specimen occurrence data in taxonomic revisions. SpOccSum is written in Python (Python Software Foundation 2019) and accessible through the Anaconda Python/R Data Science Platform as a Jupyter Notebook (Kluyver et al. 2016). The tool takes a simple list of specimen data containing species name, locality, date of collection (preferably separated by day, month, and year), and number of specimens in CSV format and generates a series of tables and graphs summarizing: number of specimens per species, number of specimens collected per month, number of unique collection events, as well as earliest, and most recent collecting year of each species. number of specimens per species, number of specimens collected per month, number of unique collection events, as well as earliest, and most recent collecting year of each species. The results can be exported as graphics or as csv-formatted tables and can easily be included in manuscripts for publication. An example of an early version of the summary produced by SpOccSum can be viewed in Tables 1, 2 from Markee and Dikow (2018). To accommodate seasonality in the Northern and Southern Hemispheres, users can choose to start the data display with either January or July. When geographic coordinates are available and species have widespread distributions spanning, for example, the equator, the user can itemize particular regions such as North of Tropic of Cancer (23.5˚N), Tropic of Cancer to the Equator, Equator to Tropic of Capricorn, and South of Tropic of Capricorn (23.5˚S). Other features currently in development include the ability to produce distribution maps from the provided data (when geographic coordinates are included) and the option to export specimen occurrence data as a Darwin-Core Archive ready for upload to the Global Biodiversity Information Facility (GBIF).


2018 ◽  
Vol 2 ◽  
pp. e25310
Author(s):  
Fhatani Ranwashe

Georeferencing helps to fill in biodiversity information gaps, allowing biodiversity data to be represented spatially to allow for valuable assessments to be conducted. The South African National Biodiversity Institute has embarked on a number of projects that have required the georeferencing of biodiversity data to assist in assessments for redlisting of species and measuring the protection levels of species. Data quality in biodiversity information is an important aspect. Due to a lack of standardisation in collection and recording methods historical biodiversity data collections provide a challenge when it comes to ascertaining fitness for use or determining the quality of data. The quality of historical locality information recorded in biodiversity data collections faces the scrutiny of fitness for use as these information is critical in performing assessments. The lack of descriptive locality information, or ambiguous locality information deems most historical biodiversity records unfit for use. Georeferencing should essentially improve the quality of biodiversity data, but how do you measure the fitness for use of georeferenced data? Through the use of the Darwin Core coordinateUncertaintyinMeters, georeferenced data can be queried to investigate and determine the quality of the georeferenced data produced. My presentation will cover the scope of ascertaining georeferenced data quality through the use of the DarwinCore term coordinateUncertatintyInMeters, the impacts of using a controlled vocabulary in representing the coordinateUncertaintyInMeters, and will highlight how SANBI’s georeferencing efforts have contributed to data quality within the management of biodiversity information.


Author(s):  
Edward Gilbert ◽  
Corinna Gries ◽  
Nico Franz ◽  
Landrum Leslie R. ◽  
Thomas H. Nash III

The SEINet Portal Network has a complex social and development history spanning nearly two decades. Initially established as a basic online search engine for a select handful of biological collections curated within the southwestern United States, SEINet has since matured into a biodiversity data network incorporating more than 330 institutions and 1,900 individual data contributors. Participating institutions manage and publish over 14 million specimen records, 215,000 observations, and 8 million images. Approximately 70% of the collections make use of the data portal as their primary "live" specimen management platform. The SEINet interface now supports 13 regional data portals distributed across the United States and northern Mexico (http://symbiota.org/docs/seinet/). Through many collaborative efforts, it has matured into a tool for biodiversity data exploration, which includes species inventories, interactive identification keys, specimen and field images, taxonomic information, species distribution maps, and taxonomic descriptions. SEINet’s initial developmental goals were to construct a read-only interface that integrated specimen records harvested from a handful of distributed natural history databases. Intermittent network conductivity and inconsistent data exchange protocols frequently restricted data persistence. National funding opportunities supported a complete redesign towards the development of a centralized data cache model with periodic "snapshot" updates from original data sources. A service-based management infrastructure was integrated into the interface to mobilize small- to medium-sized collections (<1 million specimen records) that commonly lack consistent infrastructure and technical expertise to maintain a standard compliant specimen database. These developments were the precursors to the Symbiota software project (Gries et al. 2014). Through further development of Symbiota, SEINet transformed into a robust specimen management system specifically geared toward specimen digitization with features including data entry from label images, harvesting data from specimen duplicates, batch georeferencing, data validation and cleaning, generating progress reports, and additional tools to improve the efficiency of the digitization process. The central developmental paradigm focused on data mobilization through the production of: a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. User interfaces consist of a decentralized network of regional data portals, all connecting to a centralized shared data source. Each of the 13 data portals are configured to present a regional perspective specifically tailored to represent the needs of the local research community. This infrastructure has supported the formation of regional consortia, who provide network support to aid local institutions in digitizing and publishing their collections within the network. The community-based infrastructure creates a sense of ownership – perhaps even good-natured competition – by the data providers and provides extra incentive to improve data quality and expand the network. Certain areas of development remain challenging in spite of the project's overall success. For instance, data managers continuously struggle to maintain a current local taxonomic thesaurus used for name validation, data cleaning, and to resolve taxonomic discrepancies commonly encountered when integrating collection datasets. We will discuss the successes and challenges associated with the long-term sustainability model and explore potential future paths for SEINet that support the long-term goal of maintaining a data provider that is in full compliance with the FAIR use principles of making the datasets findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).


2018 ◽  
Vol 2 ◽  
pp. e25608 ◽  
Author(s):  
Lee Belbin ◽  
Arthur Chapman ◽  
John Wieczorek ◽  
Paula Zermoglio ◽  
Alex Thompson ◽  
...  

Task Group 2 of the TDWG Data Quality Interest Group aims to provide a standard suite of tests and resulting assertions that can assist with filtering occurrence records for as many applications as possible. Currently ‘data aggregators’ such as the Global Biodiversity Information Facility (GBIF), the Atlas of Living Australia (ALA) and iDigBio run their own suite of tests over records received and report the results of these tests (the assertions): there is, however, no standard reporting mechanisms. We reasoned that the availability of an internationally agreed set of tests would encourage implementations by the aggregators, and at the data sources (museums, herbaria and others) so that issues could be detected and corrected early in the process. All the tests are limited to Darwin Core terms. The ~95 tests refined from over 250 in use around the world, were classified into four output types: validations, notifications, amendments and measures. Validations test one of more Darwin Core terms, for example, that dwc:decimalLatitude is in a valid range (i.e. between -90 and +90 inclusive). Notifications report a status that a user of the record should know about, for example, if there is a user-annotation associated with the record. Amendments are made to one or more Darwin Core terms when the information across the record can be improved, for example, if there is no value for dwc:scientificName, it can be filled in from a valid dwc:taxonID. Measures report values that may be useful for assessing the overall quality of a record, for example, the number of validation tests passed. Evaluation of the tests was complex and time-consuming, but the important parameters of each test have been consistently documented. Each test has a globally unique identifier, a label, an output type, a resource type, the Darwin Core terms used, a description, a dimension (from the Framework on Data Quality from TG1), an example, references, implementations (if any), test-prerequisites and notes. For each test, generic code is being written that should be easy for institutions to implement – be they aggregators or data custodians. A valuable product of the work of TG2 has been a set of general principles. One example is “Darwin Core terms are either: literal verbatim (e.g., dwc:verbatimLocality) and cannot be assumed capable of validation, open-ended (e.g., dwc:behavior) and cannot be assumed capable of validation, or bounded by an agreed vocabulary or extents, and therefore capable of validation (e.g., dwc:countryCode)”. Another is “criteria for including tests is that they are informative, relatively simple to implement, mandatory for amendments and have power in that they will not likely result in 0% or 100% of all record hits.” A third: “Do not ascribe precision where it is unknown.” GBIF, the ALA and iDigBio have committed to implementing the tests once they have been finalized. We are confident that many museums and herbaria will also implement the tests over time. We anticipate that demonstration code and a test dataset that will validate the code will be available on project completion.


Author(s):  
Matt Woodburn ◽  
Gabriele Droege ◽  
Sharon Grant ◽  
Quentin Groom ◽  
Janeen Jones ◽  
...  

The utopian vision is of a future where a digital representation of each object in our collections is accessible through the internet and sustainably linked to other digital resources. This is a long term goal however, and in the meantime there is an urgent need to share data about our collections at a higher level with a range of stakeholders (Woodburn et al. 2020). To sustainably achieve this, and to aggregate this information across all natural science collections, the data need to be standardised (Johnston and Robinson 2002). To this end, the Biodiversity Information Standards (TDWG) Collection Descriptions (CD) Interest Group has developed a data standard for describing collections, which is approaching formal review for ratification as a new TDWG standard. It proposes 20 classes (Suppl. material 1) and over 100 properties that can be used to describe, categorise, quantify, link and track digital representations of natural science collections, from high-level approximations to detailed breakdowns depending on the purpose of a particular implementation. The wide range of use cases identified for representing collection description data means that a flexible approach to the standard and the underlying modelling concepts is essential. These are centered around the ‘ObjectGroup’ (Fig. 1), a class that may represent any group (of any size) of physical collection objects, which have one or more common characteristics. This generic definition of the ‘collection’ in ‘collection descriptions’ is an important factor in making the standard flexible enough to support the breadth of use cases. For any use case or implementation, only a subset of classes and properties within the standard are likely to be relevant. In some cases, this subset may have little overlap with those selected for other use cases. This additional need for flexibility means that very few classes and properties, representing the core concepts, are proposed to be mandatory. Metrics, facts and narratives are represented in a normalised structure using an extended MeasurementOrFact class, so that these can be user-defined rather than constrained to a set identified by the standard. Finally, rather than a rigid underlying data model as part of the normative standard, documentation will be developed to provide guidance on how the classes in the standard may be related and quantified according to relational, dimensional and graph-like models. So, in summary, the standard has, by design, been made flexible enough to be used in a number of different ways. The corresponding risk is that it could be used in ways that may not deliver what is needed in terms of outputs, manageability and interoperability with other resources of collection-level or object-level data. To mitigate this, it is key for any new implementer of the standard to establish how it should be used in that particular instance, and define any necessary constraints within the wider scope of the standard and model. This is the concept of the ‘collection description scheme,’ a profile that defines elements such as: which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. which classes and properties should be included, which should be mandatory, and which should be repeatable; which controlled vocabularies and hierarchies should be used to make the data interoperable; how the collections should be broken down into individual ObjectGroups and interlinked, and how the various classes should be related to each other. Various factors might influence these decisions, including the types of information that are relevant to the use case, whether quantitative metrics need to be captured and aggregated across collection descriptions, and how many resources can be dedicated to amassing and maintaining the data. This process has particular relevance to the Distributed System of Scientific Collections (DiSSCo) consortium, the design of which incorporates use cases for storing, interlinking and reporting on the collections of its member institutions. These include helping users of the European Loans and Visits System (ELViS) (Islam 2020) to discover specimens for physical and digital loans by providing descriptions and breakdowns of the collections of holding institutions, and monitoring digitisation progress across European collections through a dynamic Collections Digitisation Dashboard. In addition, DiSSCo will be part of a global collections data ecosystem requiring interoperation with other infrastructures such as the GBIF (Global Biodiversity Information Facility) Registry of Scientific Collections, the CETAF (Consortium of European Taxonomic Facilities) Registry of Collections and Index Herbariorum. In this presentation, we will introduce the draft standard and discuss the process of defining new collection description schemes using the standard and data model, and focus on DiSSCo requirements as examples of real-world collection descriptions use cases.


2018 ◽  
Vol 2 ◽  
pp. e26369
Author(s):  
Michael Trizna

As rapid advances in sequencing technology result in more branches of the tree of life being illuminated, there has actually been a decrease in the percentage of sequence records that are backed by voucher specimens Trizna 2018b. The good news is that there are tools Trizna (2017), NCBI (2005), Biocode LLC (2014) to enable well-databased museum vouchers to automatically validate and format specimen and collection metadata for high quality sequence records. Another problem is that there are millions of existing sequence records that are known to contain either incorrect or incomplete specimen data. I will show an end-to-end example of sequencing specimens from a museum, depositing their sequence records in NCBI's (National Center for Biotechnology Information) GenBank database, and then providing updates to GenBank as the museum database revises identifications. I will also talk about linking records from specimen databases as well. Over one million records in the Global Biodiversity Information Facility (GBIF) Trizna (2018a) contain a value in the Darwin Core term "associatedSequences", and I will examine what is currently contained in these entries, and how best to format them to ensure that a tight connection is made to sequence records.


Author(s):  
Gil Nelson ◽  
Deborah L Paul

Integrated Digitized Biocollections (iDigBio) is the United States’ (US) national resource and coordinating center for biodiversity specimen digitization and mobilization. It was established in 2011 through the US National Science Foundation’s (NSF) Advancing Digitization of Biodiversity Collections (ADBC) program, an initiative that grew from a working group of museum-based and other biocollections professionals working in concert with NSF to make collections' specimen data accessible for science, education, and public consumption. The working group, Network Integrated Biocollections Alliance (NIBA), released two reports (Beach et al. 2010, American Institute of Biological Sciences 2013) that provided the foundation for iDigBio and ADBC. iDigBio is restricted in focus to the ingestion of data generated by public, non-federal museum and academic collections. Its focus is on specimen-based (as opposed to observational) occurrence records. iDigBio currently serves about 118 million transcribed specimen-based records and 29 million specimen-based media records from approximately 1600 datasets. These digital objects have been contributed by about 700 collections representing nearly 400 institutions and is the most comprehensive biodiversity data aggregator in the US. Currently, iDigBio, DiSSCo (Distributed System of Scientific Collections), GBIF (Global Biodiversity Information Facility), and the Atlas of Living Australia (ALA) are collaborating on a global framework to harmonize technologies towards standardizing and synchronizing ingestion strategies, data models and standards, cyberinfrastructure, APIs (application programming interface), specimen record identifiers, etc. in service to a developing consolidated global data product that can provide a common source for the world’s digital biodiversity data. The collaboration strives to harness and combine the unique strengths of its partners in ways that ensure the individual needs of each partner’s constituencies are met, design pathways for accommodating existing and emerging aggregators, simultaneously strengthen and enhance access to the world’s biodiversity data, and underscore the scope and importance of worldwide biodiversity informatics activities. Collaborators will share technology strategies and outputs, align conceptual understandings, and establish and draw from an international knowledge base. These collaborators, along with Biodiversity Information Standards (TDWG), will join iDigBio and the Smithsonian National Museum of Natural History as they host Biodiversity 2020 in Washington, DC. Biodiversity 2020 will combine an international celebration of the worldwide progress made in biodiversity data accessibility in the 21st century with a biodiversity data conference that extends the life of Biodiversity Next. It will provide a venue for the GBIF governing board meeting, TDWG annual meeting, and the annual iDigBio Summit as well as three days of plenary and concurrent sessions focused on the present and future of biodiversity data generation, mobilization, and use.


ZooKeys ◽  
2018 ◽  
Vol 751 ◽  
pp. 129-146 ◽  
Author(s):  
Robert Mesibov

A total of ca 800,000 occurrence records from the Australian Museum (AM), Museums Victoria (MV) and the New Zealand Arthropod Collection (NZAC) were audited for changes in selected Darwin Core fields after processing by the Atlas of Living Australia (ALA; for AM and MV records) and the Global Biodiversity Information Facility (GBIF; for AM, MV and NZAC records). Formal taxon names in the genus- and species-groups were changed in 13–21% of AM and MV records, depending on dataset and aggregator. There was little agreement between the two aggregators on processed names, with names changed in two to three times as many records by one aggregator alone compared to records with names changed by both aggregators. The type status of specimen records did not change with name changes, resulting in confusion as to the name with which a type was associated. Data losses of up to 100% were found after processing in some fields, apparently due to programming errors. The taxonomic usefulness of occurrence records could be improved if aggregators included both original and the processed taxonomic data items for each record. It is recommended that end-users check original and processed records for data loss and name replacements after processing by aggregators.


2016 ◽  
Vol 11 ◽  
Author(s):  
Alex Asase ◽  
A. Townsend Peterson

Providing comprehensive, informative, primary, research-grade biodiversity information represents an important focus of biodiversity informatics initiatives. Recent efforts within Ghana have digitized >90% of primary biodiversity data records associated with specimen sheets in Ghanaian herbaria; additional herbarium data are available from other institutions via biodiversity informatics initiatives such as the Global Biodiversity Information Facility. However, data on the plants of Ghana have not as yet been integrated and assessed to establish how complete site inventories are, so that appropriate levels of confidence can be applied. In this study, we assessed inventory completeness and identified gaps in current Digital Accessible Knowledge (DAK) of the plants of Ghana, to prioritize areas for future surveys and inventories. We evaluated the completeness of inventories at ½° spatial resolution using statistics that summarize inventory completeness, and characterized gaps in coverage in terms of geographic distance and climatic difference from well-documented sites across the country. The southwestern and southeastern parts of the country held many well-known grid cells; the largest spatial gaps were found in central and northern parts of the country. Climatic difference showed contrasting patterns, with a dramatic gap in coverage in central-northern Ghana. This study provides a detailed case study of how to prioritize for new botanical surveys and inventories based on existing DAK.


Sign in / Sign up

Export Citation Format

Share Document