scholarly journals A Workflow for the Semantic Annotation of Field Books and Specimen Labels

2018 ◽  
Vol 2 ◽  
pp. e25839
Author(s):  
Lise Stork ◽  
Andreas Weber ◽  
Eulàlia Miracle ◽  
Katherine Wolstencroft

Geographical and taxonomical referencing of specimens and documented species observations from within and across natural history collections is vital for ongoing species research. However, much of the historical data such as field books, diaries and specimens, are challenging to work with. They are computationally inaccessable, refer to historical place names and taxonomies, and are written in a variety of languages. In order to address these challenges and elucidate historical species observation data, we developed a workflow to (i) crowd-source semantic annotations from handwritten species observations, (ii) transform them into RDF (Resource Description Framework) and (iii) store and link them in a knowledge base. Instead of full-transcription we directly annotate digital field books scans with key concepts that are based on Darwin Core standards. Our workflow stresses the importance of verbatim annotation. The interpretation of the historical content, such a resolving a historical taxon to a current one, can be done by individual researchers after the content is published as linked open data. Through the storage of annotion provenance, who created the annotation and when, we allow multiple interpretations of the content to exist in parallel, stimulating scientific discourse. The semantic annotation process is supported by a web application, the Semantic Field Book (SFB)-Annotator, driven by an application ontology. The ontology formally describes the content and meta-data required to semantically annotate species observations. It is based on the Darwin Core standard (DwC), Uberon and the Geonames ontology. The provenance of annotations is stored using the Web Annotation Data Model. Adhering to the principles of FAIR (Findable, Accessible, Interoperable & Reusable) and Linked Open Data, the content of the specimen collections can be interpreted homogeneously and aggregated across datasets. This work is part of the Making Sense project: makingsenseproject.org. The project aims to disclose the content of a natural history collection: a 17,000 page account of the exploration of the Indonesian Archipelago between 1820 and 1850 (Natuurkundige Commissie voor Nederlands-Indie) With a knowledge base, researchers are given easy access to the primary sources of natural history collections. For their research, they can aggregate species observations, construct rich queries to browse through the data and add their own interpretations regarding the meaning of the historical content.

Author(s):  
David Fichtmueller ◽  
Walter G. Berendsohn ◽  
Gabriele Droege ◽  
Falko Glöckler ◽  
Anton Güntsch ◽  
...  

The TDWG standard ABCD (Access to Biological Collections Data task group 2007) was aimed at harmonizing terminologies used for modelling biological collection information and is used as a comprehensive data format for transferring collection and observation data between software components. The project ABCD 3.0 (A community platform for the development and documentation of the ABCD standard for natural history collections) was financed by the German Research Council (DFG). It addressed the transformation of ABCD into a semantic web-compliant ontology by deconstructing the XML-schema into individually addressable RDF (Resource Description Framework) resources published via the TDWG Terms Wiki (https://terms.tdwg.org/wiki/ABCD_2). In a second step, informal properties and concept-relations described by the original ABCD-schema were transformed into a machine-readable ontology and revised (Güntsch et al. 2016). The project was successfully finished in January 2019. The ABCD 3 setup allows for the creation of standard-conforming application schemas. The XML variant of ABCD 3.0 was restructured, simplified and made more consistent in terms of element names and types as compared to version 2.x. The XML elements are connected to their semantic concepts using the W3C SAWSDL (Semantic Annotation for Web Services Description Language and XML Schema) standard. The creation of specialized applications schemas is encouraged, the first use case was the application schema for zoology. It will also be possible to generate application schemas that break the traditional unit-centric structure of ABCD. Further achievements of the project include creating a Wikibase instance as the editing platform, with related tools for maintenance queries, such as checking for inconsistencies in the ontology and automated export into RDF. This allows for fast iterations of new or updated versions, e.g. when additional mappings to other standards are done. The setup is agnostic to the data standard created, it can therefore also be used to create or model other standards. Mappings to other standards like Darwin Core (https://dwc.tdwg.org/) and Audubon Core (https://tdwg.github.io/ac/) are now machine readable as well. All XPaths (XML Paths) of ABCD 3.0 XML have been mapped to all variants of ABCD 2.06 and 2.1, which will ease transition to the new standard. The ABCD 3 Ontology will also be uploaded to the GFBio Terminology Server (Karam et al. 2016), where individual concepts can be easily searched or queried, allowing for better interactive modelling of ABCD concepts. ABCD documentation now adheres to TDWG’s Standards Documentation Standard (SDS, https://www.tdwg.org/standards/sds/) and is located at https://abcd.tdwg.org/. The new site is hosted on Github: https://github.com/tdwg/abcd/tree/gh-pages.


Author(s):  
David Shorthouse ◽  
Roderic Page

Through the Bloodhound proof-of-concept, https://bloodhound-tracker.net an international audience of collectors and determiners of natural history specimens are engaged in the emotive act of claiming their specimens and attributing other specimens to living and deceased mentors and colleagues. Behind the scenes, these claims build links between Open Researcher and Contributor Identifiers (ORCID, https://orcid.org) or Wikidata identifiers for people and Global Biodiversity Information Facility (GBIF) specimen identifiers, predicated by the Darwin Core terms, recordedBy (collected) and identifiedBy (determined). Here we additionally describe the socio-technical challenge in unequivocally resolving people names in legacy specimen data and propose lightweight and reusable solutions. The unique identifiers for the affiliations of active researchers are obtained from ORCID whereas the unique identifiers for institutions where specimens are actively curated are resolved through Wikidata. By constructing closed loops of links between person, specimen, and institution, an interesting suite of potential metrics emerges, all due to the activities of employees and their network of professional relationships. This approach balances a desire for individuals to receive formal recognition for their efforts in natural history collections with that of an institutional-level need to alter budgets in response to easily obtained numeric trends in national and international reach. If handled in a coordinating fashion, this reporting technique may be a significant new driver for specimen digitization efforts on par with Altmetric, https://www.altmetric.com, an important new tool that tracks the impact of publications and delights administrators and authors alike.


Author(s):  
Zhengzhe Wu ◽  
Jere Kahanpää ◽  
Pasi Sihvonen ◽  
Anne Koivunen ◽  
Hannu Saarenmaa

Digitisation of natural history collections draws increasing attention. The digitised specimens not only facilitate the long-term preservation of biodiversity information but also boost the easy access and sharing of information. There are more than two billion specimens in the world’s natural history collections and pinned insect specimens compose of more than half of them (Tegelberg et al. 2014, Tegelberg et al. 2017). However, it is still a challenge to digitise pinned insect specimens with current state-of-art systems. The slowness of imaging pinned insects is due to the fact that they are essentially 3D objects and associated labels are pinned under the insect specimen. During the imaging process, the labels are often removed manually, which slows down the whole process. How can we avoid handling the labels pinned under often fragile and valuable specimens in order to increase the speed of digitsation? In our work (Saarenmaa et al. 2019) for T3.1.2 task in the ICEDIG (https://www.icedig.eu) project, we first briefly reviewed the state-of-the-art approaches on small insect digitisation. Then recent promising technological advances on imaging were presented, some of which have not yet been used for insect digitisation. It seems that one single approach will not be enough to digitise all insect collections efficiently. The approach has to be optimized based on the features of the specimens and their associated labels. To obtain a breakthrough in insect digitisation, it is necessary to utilize a combination of existing and new technologies in novel workflows. To explore the options, we identified six approaches for digitising pinned insects with the goal of minimum manipulations of labels as follows. Minimal labels: Image selected individual specimens without removing labels from the pin by using two cameras. This method suits for small insects with only one or a few well-spaced labels. Multiple webcams: Similar to the minimal labels approach, but with multiple webcams at different positions. This has been implemented in a prototype system with 12 cameras (Hereld et al. 2017) and in the ALICE system with six DSLR cameras (Price et al. 2018). Imaging of units: Similar to the multiple webcams approach, but image the entire unit (“Units” are small boxes or trays contained in drawers of collection cabinets, and are being used in most major insect collections). Camera in robot arm: Image the individual specimen or the unit with the camera mounted at a robot arm to capture large number of images from different views. Camera on rails: Similar to camera in robot arm approach, but the camera is mounted on rails to capture the unit. A 3D model of the insects and/or units can be created, and then labels are extracted. This is being prototyped by the ENTODIG-3D system (Ylinampa and Saarenmaa 2019). Terahertz time-gated multispectral imaging: Image the individual specimen with terahertz time-gated multispectral imaging devices. Minimal labels: Image selected individual specimens without removing labels from the pin by using two cameras. This method suits for small insects with only one or a few well-spaced labels. Multiple webcams: Similar to the minimal labels approach, but with multiple webcams at different positions. This has been implemented in a prototype system with 12 cameras (Hereld et al. 2017) and in the ALICE system with six DSLR cameras (Price et al. 2018). Imaging of units: Similar to the multiple webcams approach, but image the entire unit (“Units” are small boxes or trays contained in drawers of collection cabinets, and are being used in most major insect collections). Camera in robot arm: Image the individual specimen or the unit with the camera mounted at a robot arm to capture large number of images from different views. Camera on rails: Similar to camera in robot arm approach, but the camera is mounted on rails to capture the unit. A 3D model of the insects and/or units can be created, and then labels are extracted. This is being prototyped by the ENTODIG-3D system (Ylinampa and Saarenmaa 2019). Terahertz time-gated multispectral imaging: Image the individual specimen with terahertz time-gated multispectral imaging devices. Experiments on selected approaches 2 and 5 are in progress and the preliminary results will be presented.


Author(s):  
Lise Stork ◽  
Andreas Weber ◽  
Katherine Wolstencroft

Biodiversity research expeditions to the globe’s most biodiverse areas have been conducted for several hundred years. Natural history museums contain a wealth of historical materials from such expeditions, but they are stored in a fragmented way. As a consequence links between the various resources, e.g., specimens, illustrations and field notes, are often lost and are not easily re-established. Natural history museums have started to use persistent identifiers for physical collection objects, such as specimens, as well as associated information resources, such as web pages and multimedia. As a result, these resources can more easily be linked, using Linked Open Data (LOD), to information sources on the web. Specimens can be linked to taxonomic backbones of data providers, e.g., the Encyclopedia Of Life (EOL), the Global Biodiversity Information Facility (GBIF), or publications with Digital Object Identifiers (DOI). For the content of biodiversity expedition archives, (e.g. field notes), no such formalisations exist. However, linking the specimens to specific handwritten notes taken in the field can increase their scientific value. Specimens are generally accompanied by a label containing the location of the site where the specimen was collected, the collector’s name and the classification. Field notes often augment the basic metadata found with specimens with important details concerning, for instance, an organism’s habitat and morphology. Therefore, inter-collection interoperability of multimodal resources is just as important as intra-collection interoperability of unimodal resources. The linking of field notes and illustrations to specimens entails a number of challenges: historical handwritten content is generally difficult to read and interpret, especially due to changing taxonomic systems, nomenclature and collection practices. It is vital that: the content is structured in a similar way as the specimens, so that links can more easily be re-established either manually or in an automated way; for consolidation, the content is enriched with outgoing links to semantic resources, such as Geonames or Virtual International Authority File (VIAF); and this process is a transparent one: how links are established, why and by whom, should be stored to encourage scholarly discussions and to promote the attribution of efforts. the content is structured in a similar way as the specimens, so that links can more easily be re-established either manually or in an automated way; for consolidation, the content is enriched with outgoing links to semantic resources, such as Geonames or Virtual International Authority File (VIAF); and this process is a transparent one: how links are established, why and by whom, should be stored to encourage scholarly discussions and to promote the attribution of efforts. In order to address some of these issues, we have built a tool, the Semantic Field Book Annotator (SFB-A), that allows for the direct annotation of digitised (scanned) pages of field books and illustrations with Linked Open Data (LOD). The tool guides the user through the annotation process, so that semantic links are automatically generated in a formalised way. These annotations and links are subsequently stored in an RDF triplestore. As the use of the Darwin Core standard is considered best practice among collection managers for the digitisation of their specimens, our tool is equipped with an ontology based on Darwin Core terms, the NHC-Ontology, which extends the Darwin Semantic Web (DSW) ontology. The tool can annotate any image, be it an image of a specimen with a textual label, an illustration with a textual label or a handwritten species description. Interoperability of annotations between the various resources within a collection is therefore ensured. Terms in the ontology are structured using OWL web ontology language. This allows for more complex tasks such as OWL reasoning and semantic queries, and facilitates the creation of a richer knowledge base that is more amenable to research.


Informatics ◽  
2021 ◽  
Vol 8 (3) ◽  
pp. 50
Author(s):  
Wirapong Chansanam ◽  
Kanyarat Kwiecien ◽  
Marut Buranarach ◽  
Kulthida Tuamsuk

This research was aimed at constructing a thesaurus of the ethnic groups in the Mekong River Basin that is a compilation of controlled vocabularies of both Thai and English language, with a digital platform that enables semantic search and linked open data. The research method involved four steps: (1) organization of knowledge content; (2) construction of the thesaurus; (3) development of a digital thesaurus platform; and (4) evaluation. The concepts and theories used in the research comprised knowledge organization, thesaurus construction, digital platform development, and system evaluation. The tool for developing the digital thesaurus was the Tematres web application. The research results are: (1) there are 4273 principle words related to the ethnic groups that have been compiled and classified by the terms for each of the eight deep levels, 2596 were found to have hierarchical relationships, and 6858 had associative relationships; (2) the digital thesaurus platform was able to manage the controlled vocabularies related to the Mekong ethnic groups by storing both Thai and English vocabularies. When retrieved, the vocabulary, details of the broader term, narrow term, related term, cross reference, and scope note are displayed. Thus, semantic search is viable through applications, linked open data technology, and web services.


ZooKeys ◽  
2012 ◽  
Vol 209 ◽  
pp. 75-86 ◽  
Author(s):  
Riitta Tegelberg ◽  
Jaana Haapala ◽  
Tero Mononen ◽  
Mika Pajari ◽  
Hannu Saarenmaa

Digitarium is a joint initiative of the Finnish Museum of Natural History and the University of Eastern Finland. It was established in 2010 as a dedicated shop for the large-scale digitisation of natural history collections. Digitarium offers service packages based on the digitisation process, including tagging, imaging, data entry, georeferencing, filtering, and validation. During the process, all specimens are imaged, and distance workers take care of the data entry from the images. The customer receives the data in Darwin Core Archive format, as well as images of the specimens and their labels. Digitarium also offers the option of publishing images through Morphbank, sharing data through GBIF, and archiving data for long-term storage. Service packages can also be designed on demand to respond to the specific needs of the customer. The paper also discusses logistics, costs, and intellectual property rights (IPR) issues related to the work that Digitarium undertakes.


Author(s):  
Abraham Nieva de la Hidalga ◽  
Nicolas Cazenave ◽  
Donat Agosti ◽  
Zhengzhe Wu ◽  
Mathias Dillen ◽  
...  

Digitisation of Natural History Collections (NHC) has evolved from transcription of specimen catalogues in databases to web portals providing access to data, digital images, and 3D models of specimens. These portals increase global accessibility to specimens and help preserve the physical specimens by reducing their handling. The size of the NHC requires developing high-throughput digitisation workflows, as well as research into novel acquisition systems, image standardisation, curation, preservation, and publishing. Nowadays, herbarium sheet digitisation workflows (and fast digitisation stations) can digitise up to 6,000 specimens per day. Operating those digitisation stations in parallel, can increase the digitisation capacity. The high-resolution images obtained from these specimens, and their volume require substantial bandwidth, and disk space and tapes for storage of original digitised materials, as well as availability of computational processing resources for generating derivatives, information extraction, and publishing. While large institutions have dedicated digitisation teams that manage the whole workflow from acquisition to publishing, other institutions cannot dedicate resources to support all digitisation activities, in particular long-term storage. National and European e-infrastructures can provide an alternative solution by supporting different parts of the digitisation workflows. In the context of the Innovation and consolidation for large scale digitisation of natural heritage (ICEDIG Project 2018), three different e-infrastructures providing long-term storage have been analysed through three pilot studies: EUDAT-CINES, Zenodo, and National Infrastructures. The EUDAT-CINES pilot centred on transferring large digitised herbarium collections from the National Museum of Natural History France (MNHN) to the storage infrastructure provided by the Centre Informatique National de l’Enseignement Supérieur (CINES 2014), a European trusted digital repository. The upload, processing, and access services are supported by a combination of services provided by the European Collaborative Data Infrastructure (EUDAT CDI 2019) and CINES . The Zenodo pilot included the upload of herbarium collections from Meise Botanic Garden (APM) and other European herbaria into the Zenodo repository (Zenodo 2019). The upload, processing and access services are supported by Zenodo services, accessed by APM. The National Infrastructures pilot facilitated the upload of digital assets derived from specimens of herbarium and entomology collections held at the Finnish Museum of Natural History (LUOMUS) into the Finnish Biodiversity Information Facility (FinBIF 2019). This pilot concentrates on simplifying the integration of digitisation facilities to Finnish national e-infrastructures, using services developed by LUOMUS to access FinBIF resources. The data models employed in the pilots allow defining data schemas according to the types of collection and specimen images stored. For EUDAT-CINES, data were composed of the specimen data and its business metadata (those the institution making the deposit, in this case MNHN, considers relevant for the data objects being stored), enhanced by archiving metadata, added during the archiving process (institution, licensing, identifiers, project, archiving date, etc). EUDAT uses ePIC identifiers (ePIC 2019) to identify each deposit. The Zenodo pilot was designed to allow defining specimen data and metadata supporting indexing and access to resources. Zenodo uses DataCite Digital Object Identifiers (DOI) and the underlying data types as the main identifiers for the resources, augmented with fields based on standard TDWG vocabularies. FinBIF compiles Finnish biodiversity information to one single service for open access sharing. In FinBIF, HTTP URI based identifiers are used for all data, which link the specimen data with other information, such as images. The pilot infrastructure design reports describe features, capacities, functions and costs for each model, in three specific contexts are relevant for the implementation of the Distributed Systems of Scientific Collections (DiSSCo 2019) research infrastructure, informing the options for long-term storage and archiving digitised specimen data. The explored options allow preservation of assets and support easy access. In a wider context, the results provide a template for service evaluation in the European Open Science Cloud (EOSC 2019) which can guide similar efforts.


Author(s):  
Jeremy Miller ◽  
Donat Agosti ◽  
Marcus Guidoti ◽  
Francisco Andres Rivera Quiroz

Citing the specimens used to describe new species or augment existing taxa is integral to the scholarship of taxonomic and related biodiversity-oriented publications. These so-called material citations (Darwin Core Term MaterialCitation), linked to the natural history collections in which they are archived, are the mechanism by which readers may return to the source material upon which reported observations are based. This is integral to the scientific nature of the project of documenting global biodiversity. Material citation records typically contain such information as the location and date associated with the collection of a specimen, along with other data, and taxonomic identification. Thus, material citations are a key line of evidence for biodiversity informatics, along with other evidence classes such as database records of specimens archived in natural history collections, human observations not linked to specimens, and DNA sequences that may or may not be linked to a specimen. Natural history collections are not completely databased and records of some occurrences are only available as material citations. In other cases, material citations can be linked to the record of the physical specimen in a collections database. Taxonomic treatments, sections of publications documenting the features or distribution of a related group of organisms (Catapano 2019), may contain citations of DNA sequences, which can be linked to database records. There is potential for bidirectional linking that could contribute data elements or entire records to collections and DNA databases, based on content found in material citations. We compare material citations data to other major sources of biodiversity records (preserved specimens, human observations, and material samples). We present pilot project data that reconcile material citations with their database records, and track all material citations across the taxonomic history of a species.


Author(s):  
Serena Sorrentino ◽  
Sonia Bergamaschi ◽  
Elisa Fusari ◽  
Domenico Beneventano

Sign in / Sign up

Export Citation Format

Share Document