Robustifying Scholia: paving the way for knowledge discovery and research assessment through Wikidata

Research Ideas and Outcomes ◽

10.3897/rio.5.e35820 ◽

2019 ◽

Vol 5 ◽

Cited By ~ 2

Author(s):

Lane Rasberry ◽

Egon Willighagen ◽

Finn Nielsen ◽

Daniel Mietchen

Keyword(s):

Knowledge Discovery ◽

Large Scale ◽

Knowledge Workers ◽

Open Data ◽

Linked Open Data ◽

Data Curation ◽

Research Assessment ◽

Knowledge Graph ◽

Incomplete Datasets ◽

Active Contributor

Knowledge workers like researchers, students, journalists, research evaluators or funders need tools to explore what is known, how it was discovered, who made which contributions, and where the scholarly record has gaps. Existing tools and services of this kind are not available as Linked Open Data, but Wikidata is. It has the technology, active contributor base, and content to build a large-scale knowledge graph for scholarship, also known as WikiCite. Scholia visualizes this graph in an exploratory interface with profiles and links to the literature. However, it is just a working prototype. This project aims to "robustify Scholia" with back-end development and testing based on pilot corpora. The main objective at this stage is to attain stability in challenging cases such as server throttling and handling of large or incomplete datasets. Further goals include integrating Scholia with data curation and manuscript writing workflows, serving more languages, generating usage stats, and documentation.

Download Full-text

The Pensoft Data Publishing Workflow: The FAIRway from articles to Linked Open Data

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35902 ◽

2019 ◽

Vol 3 ◽

Author(s):

Lyubomir Penev ◽

Teodor Georgiev ◽

Viktor Senderov ◽

Mariya Dimitrova ◽

Pavel Stoev

Keyword(s):

Open Data ◽

Structured Data ◽

Linked Open Data ◽

Data Publishing ◽

Knowledge Graph ◽

Supplementary File ◽

Biodiversity Data ◽

Text Format ◽

Biodiversity Knowledge ◽

Data Elements

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.

Download Full-text

Constructing Biomedical Knowledge Graph Based on SemMedDB and Linked Open Data

2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2018.8621568 ◽

2018 ◽

Cited By ~ 6

Author(s):

Qing Cong ◽

Zhiyong Feng ◽

Fang Li ◽

Li Zhang ◽

Guozheng Rao ◽

...

Keyword(s):

Open Data ◽

Linked Open Data ◽

Knowledge Graph ◽

Biomedical Knowledge

Download Full-text

Evolving a rapid prototyping environment for visually and analytically exploring large-scale Linked Open Data

2011 IEEE Symposium on Large Data Analysis and Visualization ◽

10.1109/ldav.2011.6092338 ◽

2011 ◽

Cited By ~ 1

Author(s):

Marc Downie ◽

Paul Kaiser ◽

Dylan Enloe ◽

Peter Fox ◽

James Hendler ◽

...

Keyword(s):

Rapid Prototyping ◽

Large Scale ◽

Open Data ◽

Linked Open Data

Download Full-text

A Compact Hybrid Evolutionary Algorithm for Large Scale Instance Matching in Linked Open Data Cloud

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213017500130 ◽

2017 ◽

Vol 26 (04) ◽

pp. 1750013 ◽

Cited By ~ 1

Author(s):

Xingsi Xue ◽

Jianhua Liu

Keyword(s):

Evolutionary Algorithm ◽

Large Scale ◽

Open Data ◽

Linked Open Data ◽

Matching Task ◽

Matching Problem ◽

Matching Problems ◽

Hybrid Evolutionary Algorithm ◽

Instance Matching ◽

Real World Datasets

Establishing correct links among the coreference ontology instances is critical to the success of Linked Open Data (LOD) cloud. However, because of the high level heterogeneity and large scale instance set, matching the coreference instances in LOD cloud is an error prone and time consuming task. To this end, in this work, we present an asymmetrical profile-based similarity measure for instance matching task, construct new optimal models for schema-level and instance-level matching problems, and propose a compact hybrid evolutionary algorithm based ontology matching approach to solve the large scale instance matching problem in LOD cloud. Finally, the experimental results of comprising our approach with the states of the art systems on the instance matching track of OAEI 2015 and real-world datasets show the effectiveness of our approach.

Download Full-text

Medieval manuscripts and their migrations: Using SPARQL to investigate the research potential of an aggregated Knowledge Graph

Digital Medievalist ◽

10.16995/dm.8064 ◽

2021 ◽

Author(s):

Hanno Wijsman ◽

Toby Burrows ◽

Laura Cleaver ◽

Doug Emery ◽

Eero Hyvönen ◽

...

Keyword(s):

Query Language ◽

Open Data ◽

Literacy Skills ◽

Linked Open Data ◽

Knowledge Graph ◽

Manuscript Culture ◽

The People ◽

Computer Scientists ◽

Data Environment ◽

Manuscript Description

Although the RDF query language SPARQL has a reputation for being opaque and difficult for traditional humanists to learn, it holds great potential for opening up vast amounts of Linked Open Data to researchers willing to take on its challenges. This is especially true in the field of premodern manuscripts studies as more and more datasets relating to the study of manuscript culture are made available online. This paper explores the results of a two-year long process of collaborative learning and knowledge transfer between the computer scientists and humanities researchers from the Mapping Manuscript Migrations (MMM) project to learn and apply SPARQL to the MMM dataset. The process developed into a wider investigation of the use of SPARQL to analyse the data, refine research questions, and assess the research potential of the MMM aggregated dataset and its Knowledge Graph. Through an examination of a series of six SPARQL query case studies, this paper will demonstrate how the process of learning and applying SPARQL to query the MMM dataset returned three important and unexpected results: 1) a better understanding of a complex and imperfect dataset in a Linked Open Data environment, 2) a better understanding of how manuscript description and associated data involving the people and institutions involved in the production, reception, and trade of premodern manuscripts needs to be presented to better facilitate computational research, and 3) an awareness of need to further develop data literacy skills among researchers in order to take full advantage of the wealth of unexplored data now available to them in the Semantic Web.

Download Full-text

Anytime Large-Scale Analytics of Linked Open Data

Lecture Notes in Computer Science - The Semantic Web – ISWC 2019 ◽

10.1007/978-3-030-30793-6_33 ◽

2019 ◽

pp. 576-592 ◽

Cited By ~ 2

Author(s):

Arnaud Soulet ◽

Fabian M. Suchanek

Keyword(s):

Large Scale ◽

Open Data ◽

Linked Open Data

Download Full-text

Ein Knowledge Graph für wissenschaftliche Sammlungen : Generierung von Linked Open Data für heterogene museale Sammlungen auf der Basis des ASCH-Modells

10.15771/ma_2019_1 ◽

2019 ◽

Author(s):

◽

Antje Niemann

Keyword(s):

Open Data ◽

Linked Open Data ◽

Knowledge Graph

Download Full-text

Chapter 9. Digging into the Mensural Music Knowledge Graph. Renaissance Polyphony meets Linked Open Data

10.5771/9783956506611-167 ◽

2021 ◽

pp. 167-180

Author(s):

Richard P. Smiraglia ◽

James Bradford Young ◽

Marnix van Berchum

Keyword(s):

Open Data ◽

Linked Open Data ◽

Knowledge Graph ◽

Renaissance Polyphony

Download Full-text

Protein ontology on the semantic web for knowledge discovery

Scientific Data ◽

10.1038/s41597-020-00679-9 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Chuming Chen ◽

Hongzhan Huang ◽

Karen E. Ross ◽

Julie E. Cowart ◽

Cecilia N. Arighi ◽

...

Keyword(s):

Semantic Web ◽

Knowledge Discovery ◽

Open Data ◽

Data Access ◽

Linked Open Data ◽

Biological Knowledge ◽

Disease Associations ◽

Fair Principles ◽

Description Framework ◽

Resource Description

Abstract The Protein Ontology (PRO) provides an ontological representation of protein-related entities, ranging from protein families to proteoforms to complexes. Protein Ontology Linked Open Data (LOD) exposes, shares, and connects knowledge about protein-related entities on the Semantic Web using Resource Description Framework (RDF), thus enabling integration with other Linked Open Data for biological knowledge discovery. For example, proteins (or variants thereof) can be retrieved on the basis of specific disease associations. As a community resource, we strive to follow the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles, disseminate regular updates of our data, support multiple methods for accessing, querying and downloading data in various formats, and provide documentation both for scientists and programmers. PRO Linked Open Data can be browsed via faceted browser interface and queried using SPARQL via YASGUI. RDF data dumps are also available for download. Additionally, we developed RESTful APIs to support programmatic data access. We also provide W3C HCLS specification compliant metadata description for our data. The PRO Linked Open Data is available at https://lod.proconsortium.org/.

Download Full-text

Early Steps Towards Web Scale Information Extraction with LODIE

AI Magazine ◽

10.1609/aimag.v36i1.2567 ◽

2015 ◽

Vol 36 (1) ◽

pp. 55-64 ◽

Cited By ~ 1

Author(s):

Anna Lisa Gentile ◽

Ziqi Zhang ◽

Fabio Ciravegna

Keyword(s):

Information Extraction ◽

Information Needs ◽

Large Scale ◽

Open Data ◽

Linked Open Data ◽

Extraction Techniques ◽

Wrapper Induction ◽

Textual Data ◽

Core Idea ◽

Structured Representation

Information extraction (IE) is the technique for transforming unstructured textual data into structured representation that can be understood by machines. The exponential growth of the Web generates an exceptional quantity of data for which automatic knowledge capture is essential. This work describes the methodology for web scale information extraction in the LODIE project (linked open data information extraction) and highlights results from the early experiments carried out in the initial phase of the project. LODIE aims to develop information extraction techniques able to scale at web level and adapt to user information needs. The core idea behind LODIE is the usage of linked open data, a very large-scale information resource, as a ground-breaking solution for IE, which provides invaluable annotated data on a growing number of domains. This article has two objectives. First, describing the LODIE project as a whole and depicting its general challenges and directions. Second, describing some initial steps taken towards the general solution, focusing on a specific IE subtask, wrapper induction.

Download Full-text