scholarly journals Semantic Annotation of Botanical Collection Data

Author(s):  
Dominik Röpert ◽  
Fabian Reimeier ◽  
Jörg Holetschek ◽  
Anton Güntsch

Herbarium specimens have been digitized at the Botanical Garden and Botanical Museum, Berlin (BGBM) since the year 2000. As part of the digitization process, specimen data have been recorded manually for specific basic data elements. Additional elements were usually added later based on the digital images. During the last twenty years, data were transcribed exactly as they were written on the labels, a widely used procedure in European herbaria. This approach led to a large number of orthographic variations especially with regard to person and place names. To improve interoperability between records within our own collection database and across collection databases provided by the community, we have started to enrich our metadata with Linked Open Data (LOD)-based links to semantic resources starting with collectors and geographic entities. Preferred resources for semantic enrichment (e.g., WikiData, GeoNames) have been agreed on by members of the Consortium of European Taxonomic Facilities (CETAF) in order to exploit the potential of semantically enriched collection data in the best possible way. To be able to annotate many collection records in a relatively short time, priority was given to concepts (e.g., specific collector names) that occur on many specimen labels and that have an existing and easy-to-find semantic representation in an external resource. With this approach, we were able to annotate 52,000 specimen records in just a few weeks of working time of a student assistant. The integration of our semantic annotation workflows with other data integration, cleaning, and import processes at the BGBM is carried out using an OpenRefine-based platform with specific extensions for services and functions related to label transcription activities (Kirchhoff et al. 2018). Our semantically enriched collection data will contribute to a “Botany Pilot,” which is presently being developed by member organizations of CETAF to demonstrate the potential of Linked Open Collection Data and their integration with existing semantic resources.

Author(s):  
Stefano Pacifico ◽  
Janez Starc ◽  
Janez Brank ◽  
Luka Bradesko ◽  
Marko Grobelnik

Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


Author(s):  
Indira Lanza-Cruz ◽  
Rafael Berlanga ◽  
María José Aramburu

Social Business Intelligence (SBI) enables companies to capture strategic information from public social networks. Contrary to traditional Business Intelligence (BI), SBI has to face the high dynamicity of both the social network contents and the company analytical requests, as well as the enormous amount of noisy data. Effective exploitation of these continuous sources of data requires efficient processing of the streamed data to be semantically shaped into insightful facts. In this paper, we propose a multidimensional formalism to represent and evaluate social indicators directly from fact streams derived in turn from social network data. This formalism relies on two main aspects: the semantic representation of facts via Linked Open Data and the support of OLAP-like multidimensional analysis models. Contrary to traditional BI formalisms, we start the process by modeling the required social indicators according to the strategic goals of the company. From these specifications, all the required fact streams are modeled and deployed to trace the indicators. The main advantages of this approach are the easy definition of on-demand social indicators, and the treatment of changing dimensions and metrics through streamed facts. We demonstrate its usefulness by introducing a real scenario user case in the automotive sector.


2019 ◽  
Vol 2 ◽  
Author(s):  
Lyubomir Penev

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.


2021 ◽  
Vol 48 (3) ◽  
pp. 231-247
Author(s):  
Xu Tan ◽  
Xiaoxi Luo ◽  
Xiaoguang Wang ◽  
Hongyu Wang ◽  
Xilong Hou

Digital images of cultural heritage (CH) contain rich semantic information. However, today’s semantic representations of CH images fail to fully reveal the content entities and context within these vital surrogates. This paper draws on the fields of image research and digital humanities to propose a systematic methodology and a technical route for semantic enrichment of CH digital images. This new methodology systematically applies a series of procedures including: semantic annotation, entity-based enrichment, establishing internal relations, event-centric enrichment, defining hierarchy relations between properties text annotation, and finally, named entity recognition in order to ultimately provide fine-grained contextual semantic content disclosure. The feasibility and advantages of the proposed semantic enrichment methods for semantic representation are demonstrated via a visual display platform for digital images of CH built to represent the Wutai Mountain Map, a typical Dunhuang mural. This study proves that semantic enrichment offers a promising new model for exposing content at a fine-grained level, and establishing a rich semantic network centered on the content of digital images of CH.


2020 ◽  
Vol 10 (17) ◽  
pp. 5882
Author(s):  
Federico Desimoni ◽  
Sergio Ilarri ◽  
Laura Po ◽  
Federica Rollo ◽  
Raquel Trillo-Lado

Modern cities face pressing problems with transportation systems including, but not limited to, traffic congestion, safety, health, and pollution. To tackle them, public administrations have implemented roadside infrastructures such as cameras and sensors to collect data about environmental and traffic conditions. In the case of traffic sensor data not only the real-time data are essential, but also historical values need to be preserved and published. When real-time and historical data of smart cities become available, everyone can join an evidence-based debate on the city’s future evolution. The TRAFAIR (Understanding Traffic Flows to Improve Air Quality) project seeks to understand how traffic affects urban air quality. The project develops a platform to provide real-time and predicted values on air quality in several cities in Europe, encompassing tasks such as the deployment of low-cost air quality sensors, data collection and integration, modeling and prediction, the publication of open data, and the development of applications for end-users and public administrations. This paper explicitly focuses on the modeling and semantic annotation of traffic data. We present the tools and techniques used in the project and validate our strategies for data modeling and its semantic enrichment over two cities: Modena (Italy) and Zaragoza (Spain). An experimental evaluation shows that our approach to publish Linked Data is effective.


Oryx ◽  
2015 ◽  
Vol 50 (3) ◽  
pp. 446-449 ◽  
Author(s):  
Bin Wang ◽  
Yongpeng Ma ◽  
Gao Chen ◽  
Congren Li ◽  
Zhiling Dao ◽  
...  

AbstractMagnolia sinica, a Critically Endangered tree endemic to Yunnan, China, is one of the 20 plant species with extremely small populations approved by the Yunnan government for urgent rescue action before 2015. Information on the geographical distribution and population size of this species had not previously been reported, hindering effective conservation. We therefore carried out a survey of the literature and of herbarium specimens, followed by a detailed field survey and morphological measurements and observations of surviving individuals. We located 52 individuals in the wild, in eight localities. Two distinguishing morphological characters (tepal colour and tepal number) were revised based on observations of all remaining wild individuals that produced flowers and on one 30-year-old flowering plant in Kunming Botanical Garden. The survival rate of individuals propagated from seed for ex situ conservation at the Garden was 100% over 5 years; of 100 individuals transplanted to each of two reinforcement sites, 20 and 18, respectively, were alive after 6 years. We propose two groups of measures to protect M. sinica: (1) in situ conservation, population monitoring, and public engagement, and (2) ex situ conservation with reinforcement or reintroduction.


Author(s):  
Felicitas Löffler ◽  
Birgitta König-Ries

Semantic annotations of datasets are very useful to support quality assurance, discovery, interpretability, linking and integration of datasets. However, providing such annotations manually is often a time-consuming task . If the process is to be at least partially automated and still provide good semantic annotations, precise information extraction is needed. The recognition of entity names (e.g., person, organization, location) from textual resources is the first step before linking the identified term or phrase to other semantic resources such as concepts in ontologies. A multitude of tools and techniques have been developed for information extraction. One of the big players is the text mining framework GATE (Cunningham et al. 2013) that supports annotation rules, semantic techniques and machine learning approaches. We will run GATE's default ANNIE pipeline on collection datasets to automatically detect persons, locations and time. We will also present extensions to extract organisms (Naderi et al. 2011), environmental terms, data parameters and biological processes and how to link them to ontologies and LOD resources, e.g., DBPedia (Sateli and Witte 2015). We would like to discuss the results with the conference participants and welcome comments and feedbacks on the current solution. The audience is also welcome to provide their own datasets in preparation for this session.


Author(s):  
É. B. Héthelyi ◽  
L. Gy. Szabó ◽  
K. Korány

In the temperate zone live about 150 species of the Nepeta genus. Our investigations covered the examinations of the volatile oil containing species of the genus endemic in Hungary, Nepeta cataria and Nepeta parviflora. Latter is a relict of the ancient steppe-flora and endemic in Hungary as well. Phytochemical examination of the volatile oil containing plant material has also been carried out. Catnip growing in the Botanical Garden of PTE Department of Botany contained 0,67% volatile oil in May and 0,14% in November. Chemical character of the volatile oils were measured by gas chromatography/mass spectrometry and citronellol, citral-A, citral-B and geraniol components were identified. The composition of the oil of November samples shifted towards citronellol (65%). In both samples insecticide and repellent activity bearing compounds (+)-cis-p-menthane-3,8-diol, and (—)-trans-p-menthane-3,8-diol in 2-2.5 and 4-4.5% amount have been found. The catnip sample deriving from Germany contained a small amount of anetol, citronellol, neral, geraniol and geranial (6-13%), and possibly two isomers of nepetalactone in 23-31%. The Nepeta parviflora endemic in the Nagyvolgy valley near Nagykaracsony consisted of the same compounds in the investigated years (1998-2000). Its limonene, methyl chavicol, b-cariophyllene, b-selinene, b-cubebene, davanone, germacrene-D constituents have been identified. In the year 2000 different GC % of these compounds were detected in the different organs of the plants. The closely related species Nepeta cataria var. citriodora contained 83% citral, and the N. glechoma (= Glechoma hederacea) contained 41% a-cubebene, 20% patchoulenol, 7,7% spathulenol respectively. These compounds were identified by gas chromatography and gas chromatography / mass spectrometry.  


2012 ◽  
Vol 7 ◽  
Author(s):  
Kiril Simov ◽  
Petya Osenova

The paper describes the construction of a Bulgarian-English treebank aligned on the word and semantic level. We consider the manual word level alignment easier and more reliable than the manual alignment on syntactic and semantic level. Thus, after manual word level alignment we apply an automatic procedure for the construction of semantic level alignments. Our work presents the main steps of this automatic procedure which exploits the syntactic analysis of both sentences, morphosyntactic annotation, manual word level alignment in producing the semantic annotation of the sentences and semantic alignment. Last, but not least, a method for identification of potential errors is discussed using the automatically constructed semantic analyses of Bulgarian sentences and their comparison to the semantic representation of the English sentences.


Sign in / Sign up

Export Citation Format

Share Document