scholarly journals Technical Considerations for a Transactional Model to Realize the Digital Extended Specimen

Author(s):  
Nelson Rios ◽  
Sharif Islam ◽  
James Macklin ◽  
Andrew Bentley

Technological innovations over the past two decades have given rise to the online availability of more than 150 million specimen and species-lot records from biological collections around the world through large-scale biodiversity data-aggregator networks. In the present landscape of biodiversity informatics, collections data are captured and managed locally in a wide variety of databases and collection management systems and then shared online as point-in-time Darwin Core archive snapshots. Data providers may publish periodic revisions to these data files, which are retrieved, processed and re-indexed by data aggregators. This workflow has resulted in data latencies and lags of months to years for some data providers. The Darwin Core Standard Wieczorek et al. (2012) provides guidelines for representing biodiversity information digitally, yet varying institutional practices and lack of interoperability between Collection Management Systems continue to limit semantic uniformity, particularly with regard to the actual content of data within each field. Although some initiatives have begun to link data elements, our ability to comprehensively link all of the extended data associated with a specimen, or related specimens, is still limited due to the low uptake and usage of persistent identifiers. The concept now under consideration is to create a Digital Extended Specimen (DES) that adheres to the tenets of Findable, Accessible, Interoperable and Reusable (FAIR) data management of stewardship principles and is the cumulative digital representation of all data, derivatives and products associated with a physical specimen, which are individually distinguished and linked by persistent identifiers on the Internet to create a web of knowledge. Biodiversity data aggregators that mobilize data across multiple institutions routinely perform data transformations in an attempt to provide a clean and consistent interpretation of the data. These aggregators are typically unable to interact directly with institutional data repositories, thereby limiting potentially fruitful opportunities for annotation, versioning, and repatriation. The ability to track such data transactions and satisfy the accompanying legal implications (e.g. Nagoya Protocol) is becoming a necessary component of data publication which existing standards do not adequately address. Furthermore, no mechanisms exist to assess the “trustworthiness” of data, critical to scientific integrity, reproducibility or to provide attribution metrics for collections to advocate for their contribution or effectiveness in supporting such research. Since the introduction of Darwin Core Archives Wieczorek et al. (2012) little has changed in the underlying mechanisms for publishing natural science collections data and we are now at a point where new innovations are required to meet current demand for continued digitization, access, research and management. One solution may involve changing the biodiversity data publication paradigm to one based on the atomized transactions relevant to each individual data record. These transactions, when summed over time, allows us us to realize the most recently accepted revision as well as historical and alternative perspectives. In order to realize the Digital Extended Specimen ideals and the linking of data elements, this transactional model combined with open and FAIR data protocols, application programming interfaces (APIs), repositories, and workflow engines can provide the building blocks for the next generation of natural science collections and biodiversity data infrastructures and services. These and other related topics have been the focus of phase 2 of the global consultation on converging Digital Specimens and Extended Specimens. Based on these discussions, this presentation will explore a conceptual solution leveraging elements from distributed version control, cryptographic ledgers and shared redundant storage to overcome many of the shortcomings of contemporary approaches.

2018 ◽  
Vol 2 ◽  
pp. e25635
Author(s):  
Mikko Heikkinen ◽  
Falko Glöckler ◽  
Markus Englund

The DINA Symposium (“DIgital information system for NAtural history data”, https://dina-project.net) ends with a plenary session involving the audience to discuss the interplay of collection management and software tools. The discussion will touch different areas and issues such as: (1) Collection management using modern technology: How should and could collections be managed using current technology – What is the ultimate objective of using a new collection management system? How should traditional management processes be changed? (2) Development and community Why are there so many collection management systems? Why is it so difficult to create one system that fits everyone’s requirements? How could the community of developers and collection staff be built around DINA project in the future? (3) Features and tools How to identify needs that are common to all collections? What are the new tools and technologies that could facilitate collection management? How could those tools be implemented as DINA compliant services? (4) Data What data must be captured about collections and specimens? What criteria need to be applied in order to distinguish essential and “nice-to-have” information? How should established data standards (e.g. Darwin Core & ABCD (Access to Biological Collection Data)) be used to share data from rich and diverse data models? In addition to the plenary discussion around these questions, we will agree on a streamlined format for continuing the discussion in order to write a white paper on these questions. The results and outcome of the session will constitute the basis of the paper and will be subsequently refined.


Author(s):  
David Shorthouse

Bionomia, https://bionomia.net previously called Bloodhound Tracker, was launched in August 2018 with the aim of illustrating the breadth and depth of expertise required to collect and identify natural history specimens represented in the Global Biodiversity Information Facility (GBIF). This required that specimens and people be uniquely identified and that a granular expression of actions (e.g. "collected", "identified") be adopted. The Darwin Core standard presently combines agents and their actions into the conflated terms recordedBy and identifiedBy whose values are typically unresolved and unlinked text strings. Bionomia consists of tools, web services, and a responsive website, which are all used to efficiently guide users to resolve and unequivocally link people to specimens via first-class actions collected or identified. It also shields users from the complexity of casting together and seamlessly integrating the services of four giant initiatives: ORCID, Wikidata, GBIF, and Zenodo. All of these initiatives are financially sustainable and well-used by many stakeholders, well-outside this narrow user-case. As a result, the links between person and specimen made by users of Bionomia are given every opportunity to persist, to represent credit for effort, and to flow into collection management systems as meaningful new entries. To date, 13M links between people and specimens have been made including 2M negative associations on 12.5M specimen records. These links were either made by the collectors themselves or by 84 people who have attributed specimen records to their peers, mentors and others they revere. Integration With ORCID and Wikidata People are identified in Bionomia through synchronization with ORCID and Wikidata by reusing their unique identifiers and drawing in their metadata. ORCID identifiers are used by living researchers to link their identites to their research outputs. ORCID services include OAuth2 pass-through authentication for use by developers and web services for programmatic access to its store of public profiles. These contain elements of metadata such as full name, aliases, keywords, countries, education, employment history, affiliations, and links to publications. Bionomia seeds its search directory of people by periodically querying ORCID for specific user-assigned keywords as well as directly though account creation via OAuth2 authentication. Deceased people are uniquely identified in Bionomia through integration with Wikidata by caching unique 'Q' numbers (identifiers), full names and aliases, countries, occupations, as well as birth and death dates. Profiles are seeded from Wikidata through daily queries for properties that are likely to be assigned to collectors of natural history specimens such as "Entomologists of the World ID" (= P5370) or "Harvard Index of Botanists ID" (= P6264). Because Wikidata items may be merged, Bionomia captures these merge events, re-associates previously made links to specimen records, and mirrors Wikidata's redirect behaviour. A Wikidata property called "Bionomia ID" (= P6944), whose values are either ORCID identifiers or Wikidata 'Q' numbers, helps facilitate additional integration and reuse. Integration with GBIF Specimen data are downloaded wholesale as Darwin Core Archives from GBIF every two weeks. The purpose of this schedule is to maintain a reasonable synchrony with source data that balances computation time with the expections of users who desire the most up-to-date view of their specimen records. Collectors with ORCID accounts who have elected to receive notice, are informed via email message when the authors of newly published papers have made use of their specimen records downloaded from GBIF. Integration with Zenodo Finally, users of Bionomia may integrate their ORCID OAuth2 authentication with Zenodo, an industry-recognized archive for research data, which enjoys support from the Conseil Européen pour la Recherche Nucléaire (CERN). At the user's request, their specimen data represented as CSV (comma-separated values) and JSON-LD (JavaScript Object Notation for Linked Data) documents are pushed into Zenodo, a DataCite DOI is assigned, and a formatted citation appears on their Bionomia profile. New versions of these files are pushed to Zenodo on the user's behalf when new specimen records are linked to them. If users have configured their ORCID account to listen for new entries in DataCite, a new work entry will also be made in their ORCID profile, thus sealing a perpetual, semi-automated loop betwen GBIF and ORCID that tidily showcases their efforts at collecting and identifying natural history specimens. Technologies Used Bionomia uses Apache Spark via scripts written in Scala, a human name parser written in Ruby called dwc_agent, queues of jobs executed through Sidekiq, scores of pairwise similarities in the structure of human names stored in Neo4j, data persistence in MySQL, and a search layer in Elasticsearch. Here, I expand on lessons learned in the construction and maintenance of Bionomia, emphasize the criticality of recognizing the early efforts made by a fledgling community of enthusiasts, and describe useful tools and services that may be integrated into collection management systems to help churn strings of unresolved, unlinked collector and determiner names into actionable identifiers that are gateways to rich sources of information.


2021 ◽  
Vol 6 (2) ◽  
pp. 18
Author(s):  
Alireza Sassani ◽  
Omar Smadi ◽  
Neal Hawkins

Pavement markings are essential elements of transportation infrastructure with critical impacts on safety and mobility. They provide road users with the necessary information to adjust driving behavior or make calculated decisions about commuting. The visibility of pavement markings for drivers can be the boundary between a safe trip and a disastrous accident. Consequently, transportation agencies at the local or national levels allocate sizeable budgets to upkeep the pavement markings under their jurisdiction. Infrastructure asset management systems (IAMS) are often biased toward high-capital-cost assets such as pavements and bridges, not providing structured asset management (AM) plans for low-cost assets such as pavement markings. However, recent advances in transportation asset management (TAM) have promoted an integrated approach involving the pavement marking management system (PMMS). A PMMS brings all data items and processes under a comprehensive AM plan and enables managing pavement markings more efficiently. Pavement marking operations depend on location, conditions, and AM policies, highly diversifying the pavement marking management practices among agencies and making it difficult to create a holistic image of the system. Most of the available resources for pavement marking management focus on practices instead of strategies. Therefore, there is a lack of comprehensive guidelines and model frameworks for developing PMMS. This study utilizes the existing body of knowledge to build a guideline for developing and implementing PMMS. First, by adapting the core AM concepts to pavement marking management, a model framework for PMMS is created, and the building blocks and elements of the framework are introduced. Then, the caveats and practical points in PMMS implementation are discussed based on the US transportation agencies’ experiences and the relevant literature. This guideline is aspired to facilitate PMMS development for the agencies and pave the way for future pavement marking management tools and databases.


Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


2017 ◽  
Vol 12 (1) ◽  
pp. 88-105 ◽  
Author(s):  
Sünje Dallmeier-Tiessen ◽  
Varsha Khodiyar ◽  
Fiona Murphy ◽  
Amy Nurnberger ◽  
Lisa Raymond ◽  
...  

The data curation community has long encouraged researchers to document collected research data during active stages of the research workflow, to provide robust metadata earlier, and support research data publication and preservation. Data documentation with robust metadata is one of a number of steps in effective data publication. Data publication is the process of making digital research objects ‘FAIR’, i.e. findable, accessible, interoperable, and reusable; attributes increasingly expected by research communities, funders and society. Research data publishing workflows are the means to that end. Currently, however, much published research data remains inconsistently and inadequately documented by researchers. Documentation of data closer in time to data collection would help mitigate the high cost that repositories associate with the ingest process. More effective data publication and sharing should in principle result from early interactions between researchers and their selected data repository. This paper describes a short study undertaken by members of the Research Data Alliance (RDA) and World Data System (WDS) working group on Publishing Data Workflows. We present a collection of recent examples of data publication workflows that connect data repositories and publishing platforms with research activity ‘upstream’ of the ingest process. We re-articulate previous recommendations of the working group, to account for the varied upstream service components and platforms that support the flow of contextual and provenance information downstream. These workflows should be open and loosely coupled to support interoperability, including with preservation and publication environments. Our recommendations aim to stimulate further work on researchers’ views of data publishing and the extent to which available services and infrastructure facilitate the publication of FAIR data. We also aim to stimulate further dialogue about, and definition of, the roles and responsibilities of research data services and platform providers for the ‘FAIRness’ of research data publication workflows themselves.


Author(s):  
G Deepank ◽  
R Tharun Raj ◽  
Aditya Verma

Electronic medical records represent rich data repositories loaded with valuable patient information. As artificial intelligence and machine learning in the field of medicine is becoming more popular by the day, ways to integrate it are always changing. One such way is processing the clinical notes and records, which are maintained by doctors and other medical professionals. Natural language processing can record this data and read more deeply into it than any human. Deep learning techniques such as entity extraction which involves identifying and returning of key data elements from an electronic medical record, and other techniques involving models such as BERT for question answering, when applied to all these medical records can create bespoke and efficient treatment plans for the patients, which can help in a swift and carefree recovery.


Author(s):  
Lauren Weatherdon

Ensuring that we have the data and information necessary to make informed decisions is a core requirement in an era of increasing complexity and anthropogenic impact. With cumulative challenges such as the decline in biodiversity and accelerating climate change, the need for spatially-explicit and methodologically-consistent data that can be compiled to produce useful and reliable indicators of biological change and ecosystem health is growing. Technological advances—including satellite imagery—are beginning to make this a reality, yet uptake of biodiversity information standards and scaling of data to ensure its applicability at multiple levels of decision-making are still in progress. The complementary Essential Biodiversity Variables (EBVs) and Essential Ocean Variables (EOVs), combined with Darwin Core and other data and metadata standards, provide the underpinnings necessary to produce data that can inform indicators. However, perhaps the largest challenge in developing global, biological change indicators is achieving consistent and holistic coverage over time, with recognition of biodiversity data as global assets that are critical to tracking progress toward the UN Sustainable Development Goals and Targets set by the international community (see Jensen and Campbell (2019) for discussion). Through this talk, I will describe some of the efforts towards producing and collating effective biodiversity indicators, such as those based on authoritative datasets like the World Database on Protected Areas (https://www.protectedplanet.net/), and work achieved through the Biodiversity Indicators Partnership (https://www.bipindicators.net/). I will also highlight some of the characteristics of effective indicators, and global biodiversity reporting and communication needs as we approach 2020 and beyond.


2019 ◽  
Vol 2 ◽  
Author(s):  
Lyubomir Penev

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.


2021 ◽  
Vol 35 (1) ◽  
pp. 1-20
Author(s):  
Breda M. Zimkus ◽  
Linda S. Ford ◽  
Paul J. Morris

Abstract A growing number of domestic and international legal issues are confronting biodiversity collections, which require immediate access to information documenting the legal aspects of specimen ownership and restrictions regarding use. The Nagoya Protocol, which entered into force in 2014, established a legal framework for access and benefit-sharing of genetic resources and has notable implications for collecting, researchers working with specimens, and biodiversity collections. Herein, we discuss how this international protocol mandates operating changes within US biodiversity collections. Given the new legal landscape, it is clear that digital solutions for tracking records at all stages of a specimen's life cycle are needed. We outline how the Harvard Museum of Comparative Zoology (MCZ) has made changes to its procedures and museum-wide database, MCZbase (an independent instance of the Arctos collections management system), linking legal compliance documentation to specimens and transactions (i.e., accessions, loans). We used permits, certificates, and agreements associated with MCZ specimens accessioned in 2018 as a means to assess a new module created to track compliance documentation, a controlled vocabulary categorizing these documents, and the automatic linkages established among documentation, specimens, and transactions. While the emphasis of this work was a single year test case, its successful implementation may be informative to policies and collection management systems at other institutions.


Sign in / Sign up

Export Citation Format

Share Document