Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature

Biodiversity Data Journal ◽

10.3897/bdj.6.e27539 ◽

2018 ◽

Vol 6 ◽

pp. e27539 ◽

Cited By ~ 1

Author(s):

Roderic Page

Keyword(s):

Web Server ◽

Data Publishing ◽

Knowledge Graph ◽

Biodiversity Knowledge ◽

Plant Names ◽

Cross Links ◽

Minimal Effort

Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic namesand identifiers for the taxonomic articles that published those names.

Download Full-text

Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature

10.1101/343996 ◽

2018 ◽

Author(s):

Roderic D. M. Page

Keyword(s):

Web Server ◽

Data Publishing ◽

Knowledge Graph ◽

Biodiversity Knowledge ◽

Plant Names ◽

Cross Links ◽

Minimal Effort

AbstractConstructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a “datasette”, a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic names, and identifiers for the taxonomic articles that published those names.

Download Full-text

The Pensoft Data Publishing Workflow: The FAIRway from articles to Linked Open Data

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35902 ◽

2019 ◽

Vol 3 ◽

Author(s):

Lyubomir Penev ◽

Teodor Georgiev ◽

Viktor Senderov ◽

Mariya Dimitrova ◽

Pavel Stoev

Keyword(s):

Open Data ◽

Structured Data ◽

Linked Open Data ◽

Data Publishing ◽

Knowledge Graph ◽

Supplementary File ◽

Biodiversity Data ◽

Text Format ◽

Biodiversity Knowledge ◽

Data Elements

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.

Download Full-text

Peer Review #1 of "Ozymandias: a biodiversity knowledge graph (v0.1)"

10.7287/peerj.6739v0.1/reviews/1 ◽

2019 ◽

Author(s):

CO Webb

Keyword(s):

Peer Review ◽

Knowledge Graph ◽

Biodiversity Knowledge

Download Full-text

Data ownership and data publishing

ARPHA Conference Abstracts ◽

10.3897/aca.2.e39250 ◽

2019 ◽

Vol 2 ◽

Author(s):

Lyubomir Penev

Keyword(s):

Data Protection ◽

Open Data ◽

Data Publishing ◽

Supplementary File ◽

Biodiversity Data ◽

Biodiversity Knowledge ◽

Data Ownership ◽

Data Hoarding ◽

Data Elements ◽

Access To Data

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.

Download Full-text

Developing a high-performance web server in Concurrent Haskell

Journal of Functional Programming ◽

10.1017/s095679680200432x ◽

2002 ◽

Vol 12 (4-5) ◽

pp. 359-374 ◽

Cited By ~ 5

Author(s):

SIMON MARLOW

Keyword(s):

Fault Tolerance ◽

Programming Language ◽

High Performance ◽

Web Server ◽

Unique Combination ◽

Web Servers ◽

Improved Performance ◽

Critical Components ◽

Time Critical ◽

Minimal Effort

Server applications, and in particular network-based server applications, place a unique combination of demands on a programming language: lightweight concurrency, high I/O throughput, and fault tolerance are all important. This paper describes a prototype web server written in Concurrent Haskell (with extensions), and presents two useful results: firstly, a conforming server could be written with minimal effort, leading to an implementation in less than 1500 lines of code, and secondly the naive implementation produced reasonable performance. Furthermore, making minor modifications to a few time-critical components improved performance to a level acceptable for anything but the most heavily loaded web servers.

Download Full-text

Ozymandias: A biodiversity knowledge graph

10.1101/485854 ◽

2018 ◽

Cited By ~ 3

Author(s):

Roderic D. M. Page

Keyword(s):

Shared Knowledge ◽

Knowledge Graph ◽

Web Interface ◽

Biodiversity Data ◽

Knowledge Space ◽

Link Type ◽

Biodiversity Knowledge

AbstractEnormous quantities of biodiversity data are being made available online, but much of this data remains isolated in their own silos. One approach to breaking these silos is to map local, often database-specific identifiers to shared global identifiers. This mapping can then be used to construct a knowledge graph, where entities such as taxa, publications, people, places, specimens, sequences, and institutions are all part of a single, shared knowledge space. Motivated by the 2018 GBIF Ebbe Nielsen Challenge I explore the feasibility of constructing a “biodiversity knowledge graph” for the Australian fauna. These steps involved in constructing the graph are described, and examples its application are discussed. A web interface to the knowledge graph (called “Ozymandias”) is available at https://ozymandias-demo.herokuapp.com.

Download Full-text

Towards a biodiversity knowledge graph

Research Ideas and Outcomes ◽

10.3897/rio.2.e8767 ◽

2016 ◽

Vol 2 ◽

pp. e8767 ◽

Cited By ~ 28

Author(s):

Roderic Page

Keyword(s):

Knowledge Graph ◽

Biodiversity Knowledge

Download Full-text

Strategies for Assembling the Biodiversity Knowledge Graph

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59126 ◽

2020 ◽

Vol 4 ◽

Author(s):

Roderic Page

Keyword(s):

Life Sciences ◽

Biomedical Literature ◽

Knowledge Graph ◽

Supporting Evidence ◽

Domain Specific ◽

Biodiversity Knowledge ◽

Comprehensive Knowledge ◽

International Image ◽

Knowledge Graphs ◽

Semantic Publishing

This talk explores different strategies for assembling the “biodiversity knowledge graph” (Page 2016). The first is a centralised, crowd-sourced approach using Wikidata as the foundation. Wikidata is becoming increasingly attractive as a knowledge graph for the life sciences (Waagmeester et al. 2020), and I will discuss some of its strengths and limitations, particularly as a source of bibliographic and taxonomic information. For example, Wikidata’s handling of taxonomy is somewhat problematic given the lack of clear separation of taxa and their names. A second approach is to build biodiversity knowledge graphs from scratch, such as OpenBioDiv (Penev et al. 2019) and my own Ozymandias (Page 2019). These approaches use either generalised vocabularies such as schema.org, or domain specific ones such as TaxPub (Catapano 2010) and the Semantic Publishing and Referencing Ontologies (SPAR) (Peroni and Shotton 2018), and to date tend to have restricted focus, whether geographic (e.g., Australian animals in Ozymandias) or temporal (recent taxonomic literature, OpenBioDiv). A growing number of data sources are now using schema.org to describe their data, including ORCID and Zenodo, and efforts to extend schema.org into biology (Bioschemas) suggest we may soon be able to build comprehensive knowledge graphs using just schema.org and its derivatives. A third approach is not to build an entire knowledge graph, but instead focus on constructing small pieces of the graph tightly linked to supporting evidence, for example via annotations. Annotations are increasingly used to mark up both the biomedical literature (e.g., Kim et al. 2015, Venkatesan et al. 2017) and the biodiversity literature (Batista-Navarro et al. 2017). One could argue that taxonomic databases are essentially lists of annotations (“this name appears in this publication on this page”), which suggests we could link literature projects such as the Biodiversity Heritage Library (BHL) to taxonomic databases via annotations. Given that the International Image Interoperability Framework (IIIF) provides a framework for treating publications themselves as a set of annotations (e.g., page images) upon which other annotations can be added (Zundert 2018), this suggests ways that knowledge graphs could lead directly to visualising the links between taxonomy and the taxonomic literature. All three approaches will be discussed, accompanied by working examples.

Download Full-text

Training and hackathon on building biodiversity knowledge graphs

Research Ideas and Outcomes ◽

10.3897/rio.5.e36152 ◽

2019 ◽

Vol 5 ◽

Cited By ~ 1

Author(s):

Joel Sachs ◽

Roderic Page ◽

Steven J Baskauf ◽

Jocelyn Pender ◽

Beatriz Lujan-Toro ◽

...

Keyword(s):

Knowledge Graph ◽

Graph Database ◽

Biodiversity Informatics ◽

Biodiversity Data ◽

Specific Knowledge ◽

Biodiversity Knowledge ◽

Training Event ◽

Web Infrastructure ◽

Ongoing Development ◽

Knowledge Graphs

Knowledge graphs have the potential to unite disconnected digitized biodiversity data, and there are a number of efforts underway to build biodiversity knowledge graphs. More generally, the recent popularity of knowledge graphs, driven in part by the advent and success of the Google Knowledge Graph, has breathed life into the ongoing development of semantic web infrastructure and prototypes in the biodiversity informatics community. We describe a one week training event and hackathon that focused on applying three specific knowledge graph technologies – the Neptune graph database; Metaphactory; and Wikidata - to a diverse set of biodiversity use cases.We give an overview of the training, the projects that were advanced throughout the week, and the critical discussions that emerged. We believe that the main barriers towards adoption of biodiversity knowledge graphs are the lack of understanding of knowledge graphs and the lack of adoption of shared unique identifiers. Furthermore, we believe an important advancement in the outlook of knowledge graph development is the emergence of Wikidata as an identifier broker and as a scoping tool. To remedy the current barriers towards biodiversity knowledge graph development, we recommend continued discussions at workshops and at conferences, which we expect to increase awareness and adoption of knowledge graph technologies.

Download Full-text

Progress in Authority Management of People Names for Collections

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35074 ◽

2019 ◽

Vol 3 ◽

Author(s):

Quentin Groom ◽

Chloé Besombes ◽

Josh Brown ◽

Simon Chagnoux ◽

Teodor Georgiev ◽

...

Keyword(s):

Service Providers ◽

International Workshop ◽

Knowledge Graph ◽

Data Mapping ◽

Global Biodiversity Information Facility ◽

Integrated Network ◽

Biographical Information ◽

Geographically Dispersed ◽

Biodiversity Knowledge ◽

German Authority

The concept of building a network of relationships between entities, a knowledge graph, is one of the most effective methods to understand the relations between data. By organizing data, we facilitate the discovery of complex patterns not otherwise evident in the raw data. Each datum at the nodes of a knowledge graph needs a persistent identifier (PID) to reference it unambiguously. In the biodiversity knowledge graph, people are key elements (Page 2016). They collect and identify specimens, they publish, observe, work with each other and they name organisms. Yet biodiversity informatics has been slow to adopt PIDs for people and people are currently represented in collection management systems as text strings in various formats. These text strings often do not separate individuals within a collecting team and little biographical information is collected to disambiguate collectors. In March 2019 we organised an international workshop to find solutions to the problem of PIDs for people in collections with the aim of identifying people unambiguously across the world's natural history collections in all of their various roles. Stakeholders were represented from 11 countries, representing libraries, collections, publishers, developers and name registers. We want to identify people for many reasons. Cross-validation of information about a specimen with biographical information on the specimen can be used to clean data. Mapping specimens from individual collectors across multiple herbaria can geolocate specimens accurately. By linking literature to specimens through their authors and collectors we can create collaboration networks leading to a much better understanding of the scientific contribution of collectors and their institutions. For taxonomists, it will be easier to identify nomenclatural type and syntype material, essential for reliable typification. Overall, it will mean that geographically dispersed specimens can be treated much more like a single distributed infrastructure of specimens as is envisaged in the European Distributed Systems of Scientific Collections Infrastructure (DiSSCo). There are several person identifier systems in use. For example, the Virtual International Authority File (VIAF) is a widely used system for published authors. The International Standard Name Identifier (ISNI), has broader scope and incorporates VIAF. The ORCID identifier system provides self-registration of living researchers. Also, Wikidata has identifiers of people, which have the advantage of being easy to add to and correct. There are also national systems, such as the French and German authority files, and considerable sharing of identifiers, particularly on Wikidata. This creates an integrated network of identifiers that could act as a brokerage system. Attendees agreed that no one identifier system should be recommended, however, some are more appropriate for particular circumstances. Some difficulties have still to be resolved to use those identifier schemes for biodiversity : 1) duplicate entries in the same identifier system; 2) handling collector teams and preserving the order of collectors; 3) how we integrate identifiers with standards such as Darwin Core, ABCD and in the Global Biodiversity Information Facility; and 4) many living and dead collectors are only known from their specimens and so they may not pass notability standards required by many authority systems. The participants of the workshop are now working on a number of fronts to make progress on the adoption of PIDs for people in collections. This includes extending pilots that have already been trialed, working with identifier systems to make them more suitable for specimen collectors and talking to service providers to encourage them to use ORCID iDs to identify their users. It was concluded that resolving the problem of person identifiers for collections is largely not a lack of a solution, but a need to implement solutions that already exist.

Download Full-text