scholarly journals A community-developed extension to Darwin Core for reporting the chronometric age of specimens

2021 ◽  
Author(s):  
Laura Brenskelle ◽  
John Wieczorek ◽  
Edward Davis ◽  
Kitty Emery ◽  
Neill J. Wallis ◽  
...  

Darwin Core, the data standard used for sharing modern biodiversity and paleodiversity occurrence records, has previously lacked proper mechanisms for reporting what is known about the estimated age range of specimens from deep time. This has led to data providers putting these data in fields where they cannot easily be found by users, which impedes the reuse and improvement of these data by other researchers. Here we describe the development of the Chronometric Age Extension to Darwin Core, a ratified, community-developed extension that enables the reporting of ages of specimens from deeper time and the evidence supporting these estimates. The extension standardizes reporting about the methods or assays used to determine an age and other critical information like uncertainty. It gives data providers flexibility about the level of detail reported, focusing on the minimum information needed for reuse while still allowing for significant detail if providers have it. Providing a standardized format for reporting these data will make them easier to find and search and enable researchers to pinpoint specimens of interest for data improvement or accumulate more data for broad temporal studies. The Chronometric Age Extension was also the first community-managed vocabulary to undergo the new Biodiversity Informatics Standards (TDWG) review and ratification process, thus providing a blueprint for future Darwin Core extension development.

ZooKeys ◽  
2018 ◽  
Vol 751 ◽  
pp. 129-146 ◽  
Author(s):  
Robert Mesibov

A total of ca 800,000 occurrence records from the Australian Museum (AM), Museums Victoria (MV) and the New Zealand Arthropod Collection (NZAC) were audited for changes in selected Darwin Core fields after processing by the Atlas of Living Australia (ALA; for AM and MV records) and the Global Biodiversity Information Facility (GBIF; for AM, MV and NZAC records). Formal taxon names in the genus- and species-groups were changed in 13–21% of AM and MV records, depending on dataset and aggregator. There was little agreement between the two aggregators on processed names, with names changed in two to three times as many records by one aggregator alone compared to records with names changed by both aggregators. The type status of specimen records did not change with name changes, resulting in confusion as to the name with which a type was associated. Data losses of up to 100% were found after processing in some fields, apparently due to programming errors. The taxonomic usefulness of occurrence records could be improved if aggregators included both original and the processed taxonomic data items for each record. It is recommended that end-users check original and processed records for data loss and name replacements after processing by aggregators.


Author(s):  
José Augusto Salim ◽  
Antonio Saraiva

For those biologists and biodiversity data managers who are unfamiliar with information science data practices of data standardization, the use of complex software to assist in the creation of standardized datasets can be a barrier to sharing data. Since the ratification of the Darwin Core Standard (DwC) (Darwin Core Task Group 2009) by the Biodiversity Information Standards (TDWG) in 2009, many datasets have been published and shared through a variety of data portals. In the early stages of biodiversity data sharing, the protocol Distributed Generic Information Retrieval (DiGIR), progenitor of DwC, and later the protocols BioCASe and TDWG Access Protocol for Information Retrieval (TAPIR) (De Giovanni et al. 2010) were introduced for discovery, search and retrieval of distributed data, simplifying data exchange between information systems. Although these protocols are still in use, they are known to be inefficient for transferring large amounts of data (GBIF 2017). Because of that, in 2011 the Global Biodiversity Information Facility (GBIF) introduced the Darwin Core Archive (DwC-A), which allows more efficient data transfer, and has become the preferred format for publishing data in the GBIF network. DwC-A is a structured collection of text files, which makes use of the DwC terms to produce a single, self-contained dataset. Many tools for assisting data sharing using DwC-A have been introduced, such as the Integrated Publishing Toolkit (IPT) (Robertson et al. 2014), the Darwin Core Archive Assistant (GBIF 2010) and the Darwin Core Archive Validator. Despite promoting and facilitating data sharing, many users have difficulties using such tools, mainly because of the lack of training in information science in the biodiversity curriculum (Convention on Biological Diversiity 2012, Enke et al. 2012). However, most users are very familiar with spreadsheets to store and organize their data, but the adoption of the available solutions requires data transformation and training in information science and more specifically, biodiversity informatics. For an example of how spreadsheets can simplify data sharing see Stoev et al. (2016). In order to provide a more "familiar" approach to data sharing using DwC-A, we introduce a new tool as a Google Sheet Add-on. The Add-on, called Darwin Core Archive Assistant Add-on can be installed in the user's Google Account from the G Suite MarketPlace and used in conjunction with the Google Sheets application. The Add-on assists the mapping of spreadsheet columns/fields to DwC terms (Fig. 1), similar to IPT, but with the advantage that it does not require the user to export the spreadsheet and import it into another software. Additionally, the Add-on facilitates the creation of a star schema in accordance with DwC-A, by the definition of a "CORE_ID" (e.g. occurrenceID, eventID, taxonID) field between sheets of a document (Fig. 2). The Add-on also provides an Ecological Metadata Language (EML) (Jones et al. 2019) editor (Fig. 3) with minimal fields to be filled in (i.e., mandatory fields required by IPT), and helps users to generate and share DwC-Archives stored in the user's Google Drive, which can be downloaded as a DwC-A or automatically uploaded to another public storage resource like a user's Zenodo Account (Fig. 4). We expect that the Google Sheet Add-on introduced here, in conjunction with IPT, will promote biodiversity data sharing in a standardized format, as it requires minimal training and simplifies the process of data sharing from the user's perspective, mainly for those users not familiar with IPT, but that historically have worked with spreadsheets. Although the DwC-A generated by the add-on still needs to be published using IPT, it does provide a simpler interface (i.e., spreadsheet) for mapping data sets to DwC than IPT. Even though the IPT includes many more features than the Darwin Core Assistant Add-on, we expect that the Add-on can be a "starting point" for users unfamiliar with biodiversity informatics before they move on to more advanced data publishing tools. On the other hand, Zenodo integration allows users to share and cite their standardized data sets without publishing them via IPT, which can be useful for users without access to an IPT installation. Additionally, we are working on new features and future releases will include the automatic generation of Global Unique Identifiers for shared records, the possibility of adding additional data standards and DwC extensions, integration with GBIF REST API and with IPT REST API.


Author(s):  
Peter Brenton

The Humboldt extension to the Darwin Core Standard Event Core has been proposed in order to provide a standard framework to capture important information about the context in which biodiversity occurrence observations and samples are recorded. This information includes methods and effort, which are critical for determining species abundance and other measures of population dynamics, as well as completeness of survey coverage. As this set of terms is being developed, we are using real-world use cases to ensure that these terms can address all known situations. We are also considering approaches to implementation of the new standard to maximise opportunities for uptake and adoption. In this presentation I provide an example of how the Humboldt extension will be implemented in the Atlas of Living Australia’s (ALA) BioCollect application. BioCollect is a cloud-based multi-project platform for all types of biodiversity and ecological field data collection and is particularly suited for capturing fully described complex protocol-based systematic surveys. For example, BioCollect supports a wide array of customised survey event-based data schemas, which can be configured for different kinds of stratified (and other) sampling protocols. These schemas can record sampling effort at the event level and event effort can be aggregated across a dataset to provide a calculated measure of effort based on the whole dataset. Such data-driven approaches to providing useful dataset-level metadata can also be applied to measures of taxonomic completeness as well as spatial and temporal coverage. In addition, BioCollect automatically parses biodiversity occurrence records from event records for harvest by the ALA. In this process, the semantic relationship between the occurrence records and their respective event records is also preserved and linkages between them enable cross-navigation for improved contextual interpretation. The BioCollect application demonstrates one approach to a practical implementation of the Humboldt extension.


2017 ◽  
Vol 1 ◽  
pp. e20126
Author(s):  
Laura Brenskelle ◽  
John Wieczorek ◽  
Robert Guralnick ◽  
Kitty Emery ◽  
Michelle LeFebvre

2018 ◽  
Vol 2 ◽  
pp. e25990 ◽  
Author(s):  
Manuel Vargas ◽  
María Mora ◽  
William Ulate ◽  
José Cuadra

The Atlas of Living Costa Rica (http://www.crbio.cr/) is a biodiversity data portal, based on the Atlas of Living Australia (ALA), which provides integrated, free, and open access to data and information about Costa Rican biodiversity in order to support science, education, and conservation. It is managed by the Biodiversity Informatics Research Center (CRBio) and the National Biodiversity Institute (INBio). Currently, the Atlas of Living Costa Rica includes nearly 8 million georeferenced species occurrence records, mediated by the Global Biodiversity Information Facility (GBIF), which come from more than 900 databases and have been published by research centers in 36 countries. Half of those records are published by Costa Rican institutions. In addition, CRBio is making a special effort to enrich and share more than 5000 species pages, developed by INBio, about Costa Rican vertebrates, arthropods, molluscs, nematodes, plants and fungi. These pages contain information elements pertaining to, for instance, morphological descriptions, distribution, habitat, conservation status, management, nomenclature and multimedia. This effort is aligned with collaboration established by Costa Rica with other countries such as Spain, Mexico, Colombia and Brazil to standarize this type of information through Plinian Core (https://github.com/PlinianCore), a set of vocabulary terms that can be used to describe different aspects of biological species. The Biodiversity Information Explorer (BIE) is one of the modules made available by ALA which indexes taxonomic and species content and provides a search interface for it. We will present how CRBio is implementing BIE as part of the Atlas of Living Costa Rica in order to share all the information elements contained in the Costa Rican species pages.


2018 ◽  
Vol 2 ◽  
pp. e27087
Author(s):  
Donald Hobern ◽  
Andrea Hahn ◽  
Tim Robertson

For more than a decade, the biodiversity informatics community has recognised the importance of stable resolvable identifiers to enable unambiguous references to data objects and the associated concepts and entities, including museum/herbarium specimens and, more broadly, all records serving as evidence of species occurrence in time and space. Early efforts built on the Darwin Core institutionCode, collectionCode and catalogueNumber terms, treated as a triple and expected to uniquely to identify a specimen. Following review of current technologies for globally unique identifiers, TDWG adopted Life Science Identifiers (LSIDs) (Pereira et al. 2009). Unfortunately, the key stakeholders in the LSID consortium soon withdrew support for the technology, leaving TDWG committed to a moribund technology. Subsequently, publishers of biodiversity data have adopted a range of technologies to provide unique identifiers, including (among others) HTTP Universal Resource Identifiers (URIs), Universal Unique Identifiers (UUIDs), Archival Resource Keys (ARKs), and Handles. Each of these technologies has merit but they do not provide consistent guarantees of persistence or resolvability. More importantly, the heterogeneity of these solutions hampers delivery of services that can treat all of these data objects as part of a consistent linked-open-data domain. The geoscience community has established the System for Earth Sample Registration (SESAR) that enables collections to publish standard metadata records for their samples and for each of these to be associated with an International Geo Sample Number (IGSN http://www.geosamples.org/igsnabout). IGSNs follow a standard format, distribute responsibility for uniqueness between SESAR and the publishing collections, and support resolution via HTTP URI or Handles. Each IGSN resolves to a standard metadata page, roughly equivalent in detail to a Darwin Core specimen record. The standardisation of identifiers has allowed the community to secure support from some journal publishers for promotion and use of IGSNs within articles. The biodiversity informatics community encompasses a much larger number of publishers and greater pre-existing variation in identifier formats. Nevertheless, it would be possible to deliver a shared global identifier scheme with the same features as IGSNs by building off the aggregation services offered by the Global Biodiversity Information Facility (GBIF). The GBIF data index includes normalised Darwin Core metadata for all data records from registered data sources and could serve as a platform for resolution of HTTP URIs and/or Handles for all specimens and for all occurrence records. The most significant trade-off requiring consideration would be between autonomy for collections and other publishers in how they format identifiers within their own data and the benefits that may arise from greater consistency and predictability in the form of resolvable identifiers.


2018 ◽  
Vol 2 ◽  
pp. e25728 ◽  
Author(s):  
Abigail Benson ◽  
Ward Appeltans ◽  
Lenore Bajona ◽  
Samuel Bosch ◽  
Paul Cowley ◽  
...  

The Ocean Biogeographic Information System (OBIS) began in 2000 as the repository for data from the Census of Marine Life. Since that time, OBIS has expanded its goals beyond simply hosting data to supporting more aspects of marine conservation (Pooter et al. 2017). In order to accomplish those goals, the OBIS secretariat in partnership with its European node (EurOBIS) hosted at the Flanders Marine Institute (VLIZ, Belgium), and the Intergovernmental Oceanographic Commission (IOC) Committee on International Oceanographic Data and Information Exchange (IODE, 23rd session, March 2015, Brugge) established a 2-year pilot project to address a particularly problematic issue that environmental data collected as part of marine biological research were being disassociated from the biological data.  OBIS-Event-Data is the solution that was developed from that pilot project, which devised a method for keeping environmental data together with the biological data (Pooter et al. 2017). OBIS is seeking early adopters of the new data standard OBIS-Event-Data from among the marine biodiversity monitoring communities, to further validate the data standard, and develop data products and scientific applications to support the enhancement of Biological and Ecosystem Essential Ocean Variables (EOVs) in the framework of the Global Ocean Observing System (GOOS) and the Marine Biodiversity Observation Network of the Group on Earth Observations (GEO BON MBON). After the successful 2-year IODE pilot project OBIS-ENV-DATA, the IOC established a new 2-year IODE pilot project OBIS-Event-Data for Scientific Applications (2017-2019). The OBIS-Event-Data data standard, building on Darwin Core, provides a technical solution for combined biological and environmental data, and incorporates details about sampling methods and effort, including event hierarchy. It also implements standardization of parameters involved in biological, environmental, and sampling details using an international standard controlled vocabulary (British Oceanographic Data Centre Natural Environment Research Council). A workshop organized by IODE/OBIS in April brought together major animal tagging and tracking networks such as the Ocean Tracking Network (OTN), the Animal Telemetry Network (ATN), the Integrated Marine Observing System (IMOS), the European Tracking Network (ETN) and the Acoustic Tracking Array Platform (ATAP) to test the OBIS-Event-Data standard through the development of some data products and science applications. Additionally, this workshop contributes to the further maturation of the GOOS EOV on fish as well as the EOV on birds, mammals and turtles. We will present the outcomes as well as any lessons learned from this workshop on problems, solutions, and applications of using Darwin Core/OBIS-Event-Data for bio-logging data.


Author(s):  
Ian Engelbrecht ◽  
Hester Steyn

RESTful APIs (REpresentational State Transfer Application Programming Interfaces) are the most commonly used mechanism for biodiversity informatics databases to provide open access to their content. In its simplest form an API provides an interface based on the HTTP protocol whereby any client can perform an action on a data resource identified by a URL using an HTTP verb (GET, POST, PUT, DELETE) to specify the intended action. For example, a GET request to a particular URL (informally called an endpoint) will return data to the client, typically in JSON format, which the client converts to the format it needs. A client can either be custom written software or commonly used programs for data analysis such as R (programming language), Microsoft Excel (everybody’s favorite data management tool), OpenRefine, or business intelligence software. APIs are therefore a valuable mechanism for making biodiversity data FAIR (findable, accessible, interoperable, reusable). There is currently no standard specifying how RESTful APIs should be designed, resulting in a variety of URL and response data formats for different APIs. This presents a challenge for API users who are not technically proficient or familiar with programming if they have to work with many different and inconsistent data sources. We undertook a brief review of eight existing APIs that provide data about taxa to assess consistency and the extent to which the Darwin Core standard (Wieczorek et al. 2021) for data exchange is applied. We assessed each API based on aspects of URL construction and the format of the response data (Fig. 1). While only cursory and limited in scope, our survey suggests that consistency across APIs is low. For example, some APIs use nouns for their endpoints (e.g. ‘taxon’ or ‘species’), emphasising their content, whereas others use verbs (e.g. ‘search’), emphasising their functionality. Response data seldom use Darwin Core terms (two out of eight examples) and a wide range of terms can be used to represent the same concept (e.g. six different terms are used for dwc:scientificNameAuthorship). Terms that can be considered metadata for a response, such as pagination details, also vary considerably. Interestingly, the public interfaces for the majority of APIs assessed do not provide POST, PUT or DELETE endpoints that modify the database. POST is only used for providing more detailed request bodies to retrieve data than possible with GET. This indicates the primary use of APIs by biodiversity informatics platforms for data sharing. An API design guideline is a document that provides a set of rules or recommendations for how APIs should be designed in order to improve their consistency and useability. API design guidelines are typically created by particular organizations to standardize API development within the organization, or as a guideline for programmers using an organization’s software to build APIs (e.g., Microsoft and Google). The API Stylebook is an online resource that provides access to a wide range of existing design guidelines, and there is an abundance of other resources available online. This presentation will cover some of the general concepts of API design, demonstrate some examples of how existing APIs vary, and discuss potential options to encourage standardization. We hope our analysis, the available body of knowledge on API design, and the collective experience of the biodiversity informatics community working with APIs may help answer the question “Does TDWG need an API design guideline?”


Author(s):  
Holly Little ◽  
Talia Karim ◽  
Erica Krimmel

As we atomize and expand the digital representation of specimen information through data standards, it is critical to evaluate the implementation of these developments, including how well they serve discipline-specific needs. In particular, fossil specimens often present challenges because they require information to be captured that is seemingly parallel to, but not entirely aligned with, that of their extant counterparts. Previous work to evaluate data sharing practices of paleontology collections has shown an imbalance in the use of Darwin Core (DwC) (Wieczorek et al. 2012) terms and many instances of underutilized terms (Little 2018). To expand upon that broad assessment and encourage better adoption of evolving standards and data practices by fossil collections, a more in-depth review of term usage is necessary. Here we review specific DwC terms that are underutilized or that present challenges for fossil occurrence records, and we examine the subsequent impact on data discovery of paleo specimens. We conclude by sharing options for improving standards implementation within a paleo context. We see key patterns and challenges in current implementation of DwC in paleo collections, as evidenced by evaluations of the typical mappings found in occurrence records for fossil specimens, data flags applied by aggregators, and discussions within the paleo collections community. These can be organized into three broad groupings. Group 1: Some DwC terms (or classes of terms) are clear to implement, but are underutilized due to issues that are also found within the neontological community. Example: Location. In the case of terms related to the Location class, paleontology has a need for a way to deal with sensitive locality information. The sensitivity here typically relates to laws restricting the sharing of locality information to protect fossil sites versus neontological requirements to protect threatened, rare, or endangered species. The end goal of needing to fuzz locality information without completely making the specimen record undiscoverable or unusable is the same. There is a need for better education at the paleo data provider-level related to standards for recording and sharing information in this category, which could be based on existing neontological community standards. Group 2: A second group of DwC terms often seem clear to implement, but the terminology used to describe and define them might be unfamiliar to paleontologists or read as unnecessary for fossil occurrences. This uncertainty about the applicability of a term to paleo data can often result in data not being mapped or fully shared. Example: recordedBy (= collector). In these cases, a simple translation of what the definition means in verbiage that is familiar to paleontologists, or the inclusion of paleo-oriented examples in the DwC documentation, can make implementation clear. Group 3: A third group of issues relates to DwC terms, classes, and/or extensions that are more complicated in the context of fossil vs. neontological data. In some cases use of these terms is complicated for neontological data as well, but perhaps for different reasons. The terms impacted by these challenges can sometimes have the same general use, but due to the nature of fossil preservation, or because a term has a different meaning within the discipline of paleontology, additional layers of uncertainty or ambiguity are present. Examples: Resource Relationship/Interactions, Individual count, Preparations, Taxon. Review of these terms and their related classes and/or the extensions they are part of has revealed that they might require qualification, further explanation, additional vocabulary terms, or even the need for special handling instructions when data are ingested and normalized at the aggregator level. This group of issues is more complicated to resolve, but the problems are not intractable and can progress toward solutions through further discussion within the community, active participation in the standards development and review process, and development of clear guidelines. Strategically assessing these terms and generating discipline-specific guidelines to be used by the paleo community can improve the mobilization and discovery of fossil occurrence data. Documenting these paleo data practices not only helps data providers, it also increases the utility of these data within the broader research community by clearly outlining how the terms were used. Overall, this discipline-focused approach to understanding the implementation of data standards like DwC at the term level, helps to increase knowledge sharing across the paleo community, improves data quality and standards adoption, and moves these datasets towards alignment with best practices like the FAIR (Findable, Accessible, Interoperable, Reusable) data principles.


2014 ◽  
Vol 2 ◽  
pp. e1039 ◽  
Author(s):  
Ed Baker ◽  
Simon Rycroft ◽  
Vincent Smith

Sign in / Sign up

Export Citation Format

Share Document