scholarly journals Data Quality Task Group 2: Tests and Assertions

2018 ◽  
Vol 2 ◽  
pp. e25608 ◽  
Author(s):  
Lee Belbin ◽  
Arthur Chapman ◽  
John Wieczorek ◽  
Paula Zermoglio ◽  
Alex Thompson ◽  
...  

Task Group 2 of the TDWG Data Quality Interest Group aims to provide a standard suite of tests and resulting assertions that can assist with filtering occurrence records for as many applications as possible. Currently ‘data aggregators’ such as the Global Biodiversity Information Facility (GBIF), the Atlas of Living Australia (ALA) and iDigBio run their own suite of tests over records received and report the results of these tests (the assertions): there is, however, no standard reporting mechanisms. We reasoned that the availability of an internationally agreed set of tests would encourage implementations by the aggregators, and at the data sources (museums, herbaria and others) so that issues could be detected and corrected early in the process. All the tests are limited to Darwin Core terms. The ~95 tests refined from over 250 in use around the world, were classified into four output types: validations, notifications, amendments and measures. Validations test one of more Darwin Core terms, for example, that dwc:decimalLatitude is in a valid range (i.e. between -90 and +90 inclusive). Notifications report a status that a user of the record should know about, for example, if there is a user-annotation associated with the record. Amendments are made to one or more Darwin Core terms when the information across the record can be improved, for example, if there is no value for dwc:scientificName, it can be filled in from a valid dwc:taxonID. Measures report values that may be useful for assessing the overall quality of a record, for example, the number of validation tests passed. Evaluation of the tests was complex and time-consuming, but the important parameters of each test have been consistently documented. Each test has a globally unique identifier, a label, an output type, a resource type, the Darwin Core terms used, a description, a dimension (from the Framework on Data Quality from TG1), an example, references, implementations (if any), test-prerequisites and notes. For each test, generic code is being written that should be easy for institutions to implement – be they aggregators or data custodians. A valuable product of the work of TG2 has been a set of general principles. One example is “Darwin Core terms are either: literal verbatim (e.g., dwc:verbatimLocality) and cannot be assumed capable of validation, open-ended (e.g., dwc:behavior) and cannot be assumed capable of validation, or bounded by an agreed vocabulary or extents, and therefore capable of validation (e.g., dwc:countryCode)”. Another is “criteria for including tests is that they are informative, relatively simple to implement, mandatory for amendments and have power in that they will not likely result in 0% or 100% of all record hits.” A third: “Do not ascribe precision where it is unknown.” GBIF, the ALA and iDigBio have committed to implementing the tests once they have been finalized. We are confident that many museums and herbaria will also implement the tests over time. We anticipate that demonstration code and a test dataset that will validate the code will be available on project completion.

Author(s):  
Lee Belbin ◽  
Arthur Chapman ◽  
John Wieczorek ◽  
Paula Zermoglio ◽  
Paul Morris

‘Data Quality Test and Assertions’ Task Group 2 (https://www.tdwg.org/community/bdq/tg-2/) has taken another year to clarify the 102 tests (https://github.com/tdwg/bdq/issues?q=is%3Aissue+is%3Aopen+label%3ATest). The original mandate to develop a core suite of tests that could be widely applied from data collection to user evaluation of aggregated data seemed straight-forward. Two years down the track, we have proven that to be incorrect. Among the final tests are complexities that none of the core group anticipated. For example, the need for a definition of ‘empty’ or the ‘Expected response’ from the test under various scenarios. The record-based tests apply to Darwin Core terms (https://dwc.tdwg.org/terms/) and have been classified as of type validation (66), amendment (29), notification (3) or measure (5). Validations test one or more Darwin Core terms against known characteristics, for example, VALIDATION_MONTH_NOTSTANDARD. Amendments may be applied to Darwin Core terms where we can unambiguously offer an improvement to the record, for example, AMENDMENT_MONTH_STANDARDIZED. Notifications are made where we believe a flag will help alert users to an issue that needs evaluation, for example, NOTIFICATION_DATAGENERALIZATIONS_NOTEMPTY. Measures are summaries of test outcomes at the record level, for example, MEASURE_AMENDMENTS_PROPOSED. We note that 41 require some parameters to be established at the time of test implementation, 20 tests require access to a currently accepted vocabulary and 3 tests rely on ISO/DCMI standards. The dependency on vocabularies to circumscribe permissible values for Darwin Core terms led to the establishment by Paula Zermoglio of DQ Task Group 4 (https://github.com/tdwg/bdq/tree/master/Vocabularies). A vocabulary of 154 terms that are associated with the tests and assertions have been developed. As at the time of writing this abstract, test data and demonstration code implementation of each test are yet to be completed. We hope these will be finalized by the time of this presentation.


Author(s):  
Arthur Chapman ◽  
Lee Belbin ◽  
Paula Zermoglio ◽  
John Wieczorek ◽  
Paul Morris ◽  
...  

The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community. The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values. Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.


2018 ◽  
Vol 2 ◽  
pp. e26369
Author(s):  
Michael Trizna

As rapid advances in sequencing technology result in more branches of the tree of life being illuminated, there has actually been a decrease in the percentage of sequence records that are backed by voucher specimens Trizna 2018b. The good news is that there are tools Trizna (2017), NCBI (2005), Biocode LLC (2014) to enable well-databased museum vouchers to automatically validate and format specimen and collection metadata for high quality sequence records. Another problem is that there are millions of existing sequence records that are known to contain either incorrect or incomplete specimen data. I will show an end-to-end example of sequencing specimens from a museum, depositing their sequence records in NCBI's (National Center for Biotechnology Information) GenBank database, and then providing updates to GenBank as the museum database revises identifications. I will also talk about linking records from specimen databases as well. Over one million records in the Global Biodiversity Information Facility (GBIF) Trizna (2018a) contain a value in the Darwin Core term "associatedSequences", and I will examine what is currently contained in these entries, and how best to format them to ensure that a tight connection is made to sequence records.


2018 ◽  
Vol 2 ◽  
pp. e25317
Author(s):  
Stijn Van Hoey ◽  
Peter Desmet

The ability to communicate and assess the quality and fitness for use of data is crucial to ensure maximum utility and re-use. Data consumers have certain requirements for the data they seek and need to be able to check if a data set conforms with these requirements. Data publishers aim to provide data with the highest possible quality and need to be able to identify potential errors that can be addressed with the available information at hand. The development and adoption of data publication guidelines is one approach to define and meet those requirements. However, the use of a guideline, the mapping decisions, and the requirements a dataset is expected to meet, are generally not communicated with the provided data. Moreover, these guidelines are typically intended for humans only. In this talk, we will present 'whip': a proposed syntax for data specifications. With whip, one can define column-based constraints for tabular (tidy) data using a number of rules, e.g. how data is structured following Darwin Core, how a term uses controlled vocabulary values, or what the expected minimum and maximum values are. These rules are human- and machine-readable, which communicates the specifications, and allows to automatically validate those in pipelines for data publication and quality assessment, such as Kurator. Whip can be formatted as a (yaml) text file that can be provided with the published data, communicating the specifications a dataset is expected to meet. The scope of these specifications can be specific to a dataset, but can also be used to express expected data quality and fitness for use of a publisher, consumer or community, allowing bottom-up and top-down adoption. As such, these specifications are complementary to the core set of data quality tests as currently under development by the TDWG Biodiversity Data Quality Task 2 Group 2. Whip rules are currently generic, but more specific ones can be defined to address requirements for biodiversity information.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Lien Reyserhove ◽  
Peter Desmet ◽  
Damiano Oldoni ◽  
Tim Adriaens ◽  
Diederik Strubbe ◽  
...  

Abstract Species checklists are a crucial source of information for research and policy. Unfortunately, many traditional species checklists vary wildly in their content, format, availability and maintenance. The fact that these are not open, findable, accessible, interoperable and reusable (FAIR) severely hampers fast and efficient information flow to policy and decision-making that are required to tackle the current biodiversity crisis. Here, we propose a reproducible, semi-automated workflow to transform traditional checklist data into a FAIR and open species registry. We showcase our workflow by applying it to the publication of the Manual of Alien Plants, a species checklist specifically developed for the Tracking Invasive Alien Species (TrIAS) project. Our approach combines source data management, reproducible data transformation to Darwin Core using R, version control, data documentation and publication to the Global Biodiversity Information Facility (GBIF). This checklist publication workflow is openly available for data holders and applicable to species registries varying in thematic, taxonomic or geographical scope and could serve as an important tool to open up research and strengthen environmental decision-making.


ZooKeys ◽  
2018 ◽  
Vol 751 ◽  
pp. 129-146 ◽  
Author(s):  
Robert Mesibov

A total of ca 800,000 occurrence records from the Australian Museum (AM), Museums Victoria (MV) and the New Zealand Arthropod Collection (NZAC) were audited for changes in selected Darwin Core fields after processing by the Atlas of Living Australia (ALA; for AM and MV records) and the Global Biodiversity Information Facility (GBIF; for AM, MV and NZAC records). Formal taxon names in the genus- and species-groups were changed in 13–21% of AM and MV records, depending on dataset and aggregator. There was little agreement between the two aggregators on processed names, with names changed in two to three times as many records by one aggregator alone compared to records with names changed by both aggregators. The type status of specimen records did not change with name changes, resulting in confusion as to the name with which a type was associated. Data losses of up to 100% were found after processing in some fields, apparently due to programming errors. The taxonomic usefulness of occurrence records could be improved if aggregators included both original and the processed taxonomic data items for each record. It is recommended that end-users check original and processed records for data loss and name replacements after processing by aggregators.


2018 ◽  
Vol 2 ◽  
pp. e26083
Author(s):  
Teresa Mayfield

At an institution without a permanent collections manager or curators, who has time to publish data or research issues on that data? Collections with little or no institutional support often benefit from passionate volunteers who continually seek ways to keep them relevant. The University of Texas at El Paso Biodiversity Collections (UTEP-BC) has been cared for in this manner by a small group of dedicated faculty and emeritus curators who have managed with no budget to care for the specimens, perform and publish research about them, and publish a good portion of the collections data. An IMLS grant allowed these dedicated volunteers to hire a Collections Manager who would migrate the already published data from the collections and add unpublished specimen records from the in-house developed FileMaker Pro database to a new collection management system (Arctos) that would allow for better records management and ease of publication. Arctos is a publicly searchable web-based system, but most collections also see the benefit of participation with biodiversity data aggregators such as the Global Biodiversity Information Facility (GBIF), iDigBio, and a multitude of discipline-specific aggregators. Publication of biodiversity data to aggregators is loaded with hidden pathways, acronyms, and tech-speak with which a curator, registrar, or collections manager may not be familiar. After navigating the process to publish the data the reward is feedback! Now data can be improved, and everyone wins, right? In the case of UTEP-BC data, the feedback sits idle as the requirements of the grant under which the Collection Manager was hired take precedence. It will likely remain buried until long after the grant has run its course. Fortunately, the selection of Arctos as a collection management system allowed the UTEP-BC Collection Manager to confer with others publishing biodiversity data to the data aggregators. Members of the Arctos Community have carried on multiple conversations about publishing to aggregators and how to handle the resulting data quality flags. These conversations provide a synthesis of the challenges experienced by collections in over 20 institutions when publishing biodiversity data to aggregators and responding (or not) to their data quality flags. This presentation will cover the experiences and concerns of one Collection Manager as well as those of the Arctos Community related to publishing data to aggregators, deciphering their data quality flags, and development of appropriate responses to those flags.


Author(s):  
David Shorthouse ◽  
Roderic Page

Through the Bloodhound proof-of-concept, https://bloodhound-tracker.net an international audience of collectors and determiners of natural history specimens are engaged in the emotive act of claiming their specimens and attributing other specimens to living and deceased mentors and colleagues. Behind the scenes, these claims build links between Open Researcher and Contributor Identifiers (ORCID, https://orcid.org) or Wikidata identifiers for people and Global Biodiversity Information Facility (GBIF) specimen identifiers, predicated by the Darwin Core terms, recordedBy (collected) and identifiedBy (determined). Here we additionally describe the socio-technical challenge in unequivocally resolving people names in legacy specimen data and propose lightweight and reusable solutions. The unique identifiers for the affiliations of active researchers are obtained from ORCID whereas the unique identifiers for institutions where specimens are actively curated are resolved through Wikidata. By constructing closed loops of links between person, specimen, and institution, an interesting suite of potential metrics emerges, all due to the activities of employees and their network of professional relationships. This approach balances a desire for individuals to receive formal recognition for their efforts in natural history collections with that of an institutional-level need to alter budgets in response to easily obtained numeric trends in national and international reach. If handled in a coordinating fashion, this reporting technique may be a significant new driver for specimen digitization efforts on par with Altmetric, https://www.altmetric.com, an important new tool that tracks the impact of publications and delights administrators and authors alike.


2018 ◽  
Vol 2 ◽  
pp. e25738 ◽  
Author(s):  
Arturo Ariño ◽  
Daniel Noesgaard ◽  
Angel Hjarding ◽  
Dmitry Schigel

Standards set up by Biodiversity Information Standards-Taxonomic Databases Working Group (TDWG), initially developed as a way to share taxonomical data, greatly facilitated the establishment of the Global Biodiversity Information Facility (GBIF) as the largest index to digitally-accessible primary biodiversity information records (PBR) held by many institutions around the world. The level of detail and coverage of the body of standards that later became the Darwin Core terms enabled increasingly precise retrieval of relevant records useful for increased digitally-accessible knowledge (DAK) which, in turn, may have helped to solve ecologically-relevant questions. After more than a decade of data accrual and release, an increasing number of papers and reports are citing GBIF either as a source of data or as a pointer to the original datasets. GBIF has curated a list of over 5,000 citations that were examined for contents, and to which tags were applied describing such contents as additional keywords. The list now provides a window on what users want to accomplish using such DAK. We performed a preliminary word frequency analysis of this literature, starting at titles, which refers to GBIF as a resource. Through a standardization and mapping of terms, we examined how the facility-enabled data seem to have been used by scientists and other practitioners through time: what concepts/issues are pervasive, which taxon groups are mostly addressed, and whether data concentrate around specific geographical or biogeographical regions. We hoped to cast light on which types of ecological problems the community believes are amenable to study through the judicious use of this data commons and found that, indeed, a few themes were distinctly more frequently mentioned than others. Among those, generally-perceived issues such as climate change and its effect on biodiversity at global and regional scales seemed prevalent. The taxonomic groups were also unevenly mentioned, with birds and plants being the most frequently named. However, the entire list of potential subjects that might have used GBIF-enabled data is now quite wide, showing that the availability of well-structured data has spawned a widening spectrum of possible use cases. Among them, some enjoy early and continuous presence (e.g. species, biodiversity, climate) while others have started to show up only later, once a critical mass of data seemed to have been attained (e.g. ecosystems, suitability, endemism). Biodiversity information in the form of standards-compliant DAK may thus already have become a commodity enabling insight into an increasingly more complex and diverse body of science. Paraphrasing Tennyson, more things were wrought by data than TDWG dreamt of.


Author(s):  
Yanina Sica ◽  
Paula Zermoglio

Biodiversity inventories, i.e., recording multiple species at a specific place and time, are routinely performed and offer high-quality data for characterizing biodiversity and its change. Digitization, sharing and reuse of incidental point records (i.e., records that are not readily associated with systematic sampling or monitoring, typically museum specimens and many observations from citizen science projects) has been the focus for many years in the biodiversity data community. Only more recently, attention has been directed towards mobilizing data from both new and longstanding inventories and monitoring efforts. These kinds of studies provide very rich data that can enable inferences about species absence, but their reliability depends on the methodology implemented, the survey effort and completeness. The information about these elements has often been regarded as metadata and captured in an unstructured manner, thus making their full use very challenging. Unlocking and integrating inventory data requires data standards that can facilitate capture and sharing of data with the appropriate depth. The Darwin Core standard (Wieczorek et al. 2012) currently enables reporting some of the information contained in inventories, particularly using Darwin Core Event terms such as samplingProtocol, sampleSizeValue, sampleSizeUnit, samplingEffort. However, it is limited in its ability to accommodate spatial, temporal, and taxonomic scopes, and other key aspects of the inventory sampling process, such as direct or inferred measures of sampling effort and completeness. The lack of a standardized way to share inventory data has hindered their mobilization, integration, and broad reuse. In an effort to overcome these limitations, a framework was developed to standardize inventory data reporting: Humboldt Core (Guralnick et al. 2018). Humboldt Core identified three types of inventories (single, elementary, and summary inventories) and proposed a series of terms to report their content. These terms were organized in six categories: dataset and identification; geospatial and habitat scope; temporal scope; taxonomic scope; methodology description; and completeness and effort. While originally planned as a new TDWG standard and being currently implemented in Map of Life (https://mol.org/humboldtcore/), ratification was not pursued at the time, thus limiting broader community adoption. In 2021 the TDWG Humboldt Core Task Group was established to review how to best integrate the terms proposed in the original publication with existing standards and implementation schemas. The first goal of the task group was to determine whether a new, separate standard was needed or if an extension to Darwin Core could accommodate the terms necessary to describe the relevant information elements. Since the different types of inventories can be thought of as Events with different nesting levels (events within events, e.g., plots within sites), and after an initial mapping to existing Darwin Core terms, it was deemed appropriate to start from a Darwin Core Event Core and build an extension to include Humboldt Core terms. The task group members are currently revising all original Humboldt Core terms, reformulating definitions, comments, and examples, and discarding or adding new terms where needed. We are also gathering real datasets to test the use of the extension once an initial list of revised terms is ready, before undergoing a public review period as established by the TDWG process. Through the ratification of Humboldt Core as a TDWG extension, we expect to provide the community with a solution to share and use inventory data, which improves biodiversity data discoverability, interoperability and reuse while lowering the reporting burden at different levels (data collection, integration and sharing).


Sign in / Sign up

Export Citation Format

Share Document