Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud

PeerJ Computer Science ◽

10.7717/peerj-cs.37 ◽

2015 ◽

Vol 1 ◽

pp. e37 ◽

Cited By ~ 14

Author(s):

Bahar Sateli ◽

René Witte

Keyword(s):

Text Mining ◽

Knowledge Base ◽

Language Processing ◽

Digital Libraries ◽

Scientific Literature ◽

Open Data ◽

Semantic Knowledge ◽

Linked Open Data ◽

Named Entities ◽

Semantic Annotations

Motivation.Finding relevant scientific literature is one of the essential tasks researchers are facing on a daily basis. Digital libraries and web information retrieval techniques provide rapid access to a vast amount of scientific literature. However, no further automated support is available that would enable fine-grained access to the knowledge ‘stored’ in these documents. The emerging domain ofSemantic Publishingaims at making scientific knowledge accessible to both humans and machines, by adding semantic annotations to content, such as a publication’s contributions, methods, or application domains. However, despite the promises of better knowledge access, the manual annotation of existing research literature is prohibitively expensive for wide-spread adoption. We argue that a novel combination of three distinct methods can significantly advance this vision in a fully-automated way: (i) Natural Language Processing (NLP) forRhetorical Entity(RE) detection; (ii)Named Entity(NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic knowledge base construction for both NEs and REs using semantic web ontologies that interconnect entities in documents with the machine-readable LOD cloud.Results.We present a complete workflow to transform scientific literature into a semantic knowledge base, based on the W3C standards RDF and RDFS. A text mining pipeline, implemented based on the GATE framework, automatically extracts rhetorical entities of typeClaimsandContributionsfrom full-text scientific literature. These REs are further enriched with named entities, represented as URIs to the linked open data cloud, by integrating the DBpedia Spotlight tool into our workflow. Text mining results are stored in a knowledge base through a flexible export process that provides for a dynamic mapping of semantic annotations to LOD vocabularies through rules stored in the knowledge base. We created a gold standard corpus from computer science conference proceedings and journal articles, whereClaimandContributionsentences are manually annotated with their respective types using LOD URIs. The performance of the RE detection phase is evaluated against this corpus, where it achieves an averageF-measure of 0.73. We further demonstrate a number of semantic queries that show how the generated knowledge base can provide support for numerous use cases in managing scientific literature.Availability.All software presented in this paper is available under open source licenses athttp://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements. Development releases of individual components are additionally available on our GitHub page athttps://github.com/SemanticSoftwareLab.

Download Full-text

Peer Review #1 of "Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud (v0.1)"

10.7287/peerj-cs.37v0.1/reviews/1 ◽

2015 ◽

Author(s):

MA Sultan

Keyword(s):

Peer Review ◽

Scientific Literature ◽

Semantic Representation ◽

Open Data ◽

Linked Open Data ◽

Named Entities

Download Full-text

Peer Review #2 of "Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud (v0.1)"

10.7287/peerj-cs.37v0.1/reviews/2 ◽

2015 ◽

Keyword(s):

Peer Review ◽

Scientific Literature ◽

Semantic Representation ◽

Open Data ◽

Linked Open Data ◽

Named Entities

Download Full-text

ARALD: Arabic Annotation Using Linked Data

Ingénierie des systèmes d information ◽

10.18280/isi.260201 ◽

2021 ◽

Vol 26 (2) ◽

pp. 143-149

Author(s):

Abdelghani Bouziane ◽

Djelloul Bouchiha ◽

Redha Rebhi ◽

Giulio Lorenzini ◽

Noureddine Doumi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Knowledge Base ◽

Language Processing ◽

Linked Data ◽

Open Data ◽

Arabic Language ◽

Linked Open Data ◽

Main Challenge ◽

The Web

The evolution of the traditional Web into the semantic Web makes the machine a first-class citizen on the Web and increases the discovery and accessibility of unstructured Web-based data. This development makes it possible to use Linked Data technology as the background knowledge base for unstructured data, especially texts, now available in massive quantities on the Web. Given any text, the main challenge is determining DBpedia's most relevant information with minimal effort and time. Although, DBpedia annotation tools, such as DBpedia spotlight, mainly targeted English and Latin DBpedia versions. The current situation of the Arabic language is less bright; the Web content of the Arabic language does not reflect the importance of this language. Thus, we have developed an approach to annotate Arabic texts with Linked Open Data, particularly DBpedia. This approach uses natural language processing and machine learning techniques for interlinking Arabic text with Linked Open Data. Despite the high complexity of the independent domain knowledge base and the reduced resources in Arabic natural language processing, the evaluation results of our approach were encouraging.

Download Full-text

Large-scale literature mining to assess the relation between anti-cancer drugs and cancer types

Journal of Translational Medicine ◽

10.1186/s12967-021-02941-z ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Chris Bauer ◽

Ralf Herwig ◽

Matthias Lienhard ◽

Paul Prasse ◽

Tobias Scheffer ◽

...

Keyword(s):

Text Mining ◽

Knowledge Base ◽

Survival Data ◽

Scientific Literature ◽

Entity Recognition ◽

Literature Mining ◽

Cancer Drugs ◽

Classical Text ◽

Anti Cancer ◽

Cancer Types

Abstract Background There is a huge body of scientific literature describing the relation between tumor types and anti-cancer drugs. The vast amount of scientific literature makes it impossible for researchers and physicians to extract all relevant information manually. Methods In order to cope with the large amount of literature we applied an automated text mining approach to assess the relations between 30 most frequent cancer types and 270 anti-cancer drugs. We applied two different approaches, a classical text mining based on named entity recognition and an AI-based approach employing word embeddings. The consistency of literature mining results was validated with 3 independent methods: first, using data from FDA approvals, second, using experimentally measured IC-50 cell line data and third, using clinical patient survival data. Results We demonstrated that the automated text mining was able to successfully assess the relation between cancer types and anti-cancer drugs. All validation methods showed a good correspondence between the results from literature mining and independent confirmatory approaches. The relation between most frequent cancer types and drugs employed for their treatment were visualized in a large heatmap. All results are accessible in an interactive web-based knowledge base using the following link: https://knowledgebase.microdiscovery.de/heatmap. Conclusions Our approach is able to assess the relations between compounds and cancer types in an automated manner. Both, cancer types and compounds could be grouped into different clusters. Researchers can use the interactive knowledge base to inspect the presented results and follow their own research questions, for example the identification of novel indication areas for known drugs.

Download Full-text

MELHISSA: a multilingual entity linking architecture for historical press articles

International Journal on Digital Libraries ◽

10.1007/s00799-021-00319-6 ◽

2021 ◽

Author(s):

Elvys Linhares Pontes ◽

Luis Adrián Cabrera-Diego ◽

Jose G. Moreno ◽

Emanuela Boros ◽

Ahmed Hamdi ◽

...

Keyword(s):

Language Processing ◽

Digital Libraries ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Documents ◽

Entity Linking ◽

Named Entities ◽

European Languages ◽

Meta Information ◽

The Impact

AbstractDigital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances in these NLP models, most of them are built for specific languages and contemporary documents that are not optimized for handling historical material that may for instance contain language variations and optical character recognition (OCR) errors. In this work, we focused on the entity linking (EL) task that is fundamental to the indexation of documents in digital libraries. We developed a Multilingual Entity Linking architecture for HIstorical preSS Articles that is composed of multilingual analysis, OCR correction, and filter analysis to alleviate the impact of historical documents in the EL task. The source code is publicly available. Experimentation has been done over two historical documents covering five European languages (English, Finnish, French, German, and Swedish). Results have shown that our system improved the global performance for all languages and datasets by achieving an F-score@1 of up to 0.681 and an F-score@5 of up to 0.787.

Download Full-text

Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture

Machine Learning and Knowledge Extraction ◽

10.3390/make1020034 ◽

2019 ◽

Vol 1 (2) ◽

pp. 575-589 ◽

Cited By ~ 1

Author(s):

Blaž Škrlj ◽

Jan Kralj ◽

Nada Lavrač ◽

Senja Pollak

Keyword(s):

Text Mining ◽

Language Processing ◽

Text Classification ◽

Deep Neural Networks ◽

Semantic Knowledge ◽

Text Documents ◽

Neural Architecture ◽

Classification Tasks ◽

And Gender ◽

Semantic Resources

Deep neural networks are becoming ubiquitous in text mining and natural language processing, but semantic resources, such as taxonomies and ontologies, are yet to be fully exploited in a deep learning setting. This paper presents an efficient semantic text mining approach, which converts semantic information related to a given set of documents into a set of novel features that are used for learning. The proposed Semantics-aware Recurrent deep Neural Architecture (SRNA) enables the system to learn simultaneously from the semantic vectors and from the raw text documents. We test the effectiveness of the approach on three text classification tasks: news topic categorization, sentiment analysis and gender profiling. The experiments show that the proposed approach outperforms the approach without semantic knowledge, with highest accuracy gain (up to 10%) achieved on short document fragments.

Download Full-text

LODDO: Using Linked Open Data Description Overlap to Measure Semantic Relatedness between Named Entities

The Semantic Web - Lecture Notes in Computer Science ◽

10.1007/978-3-642-29923-0_18 ◽

2012 ◽

pp. 268-283 ◽

Cited By ~ 2

Author(s):

Wenlei Zhou ◽

Haofen Wang ◽

Jiansong Chao ◽

Weinan Zhang ◽

Yong Yu

Keyword(s):

Open Data ◽

Semantic Relatedness ◽

Linked Open Data ◽

Named Entities ◽

Data Description

Download Full-text

Evaluating the quality of linked open data in digital libraries

Journal of Information Science ◽

10.1177/0165551520930951 ◽

2020 ◽

pp. 016555152093095

Author(s):

Gustavo Candela ◽

Pilar Escobar ◽

Rafael C Carrasco ◽

Manuel Marco-Such

Keyword(s):

Digital Libraries ◽

Open Data ◽

Quality Measures ◽

Linked Open Data ◽

Data Sets ◽

Design And Implementation ◽

Bibliographic Data ◽

Description Framework ◽

Resource Description

Cultural heritage institutions have recently started to share their metadata as Linked Open Data (LOD) in order to disseminate and enrich them. The publication of large bibliographic data sets as LOD is a challenge that requires the design and implementation of custom methods for the transformation, management, querying and enrichment of the data. In this report, the methodology defined by previous research for the evaluation of the quality of LOD is analysed and adapted to the specific case of Resource Description Framework (RDF) triples containing standard bibliographic information. The specified quality measures are reported in the case of four highly relevant libraries.

Download Full-text

Enriching city entities in the EKOSS failure cases knowledge base with Linked Open Data

2010 International Conference on Computer Information Systems and Industrial Management Applications (CISIM) ◽

10.1109/cisim.2010.5643505 ◽

2010 ◽

Author(s):

Weisen Guo ◽

Steven B. Kraines

Keyword(s):

Knowledge Base ◽

Open Data ◽

Linked Open Data

Download Full-text

Building a specialized lexicon for breast cancer clinical trial subject eligibility analysis

Health Informatics Journal ◽

10.1177/1460458221989392 ◽

2021 ◽

Vol 27 (1) ◽

pp. 146045822198939

Author(s):

Euisung Jung ◽

Hemant Jain ◽

Atish P Sinha ◽

Carmelo Gaudioso

Keyword(s):

Breast Cancer ◽

Clinical Trial ◽

Clinical Trials ◽

Text Mining ◽

Language Processing ◽

Snomed Ct ◽

Lexical Resources ◽

Named Entities ◽

Domain Experts ◽

Trial Subject

A natural language processing (NLP) application requires sophisticated lexical resources to support its processing goals. Different solutions, such as dictionary lookup and MetaMap, have been proposed in the healthcare informatics literature to identify disease terms with more than one word (multi-gram disease named entities). Although a lot of work has been done in the identification of protein- and gene-named entities in the biomedical field, not much research has been done on the recognition and resolution of terminologies in the clinical trial subject eligibility analysis. In this study, we develop a specialized lexicon for improving NLP and text mining analysis in the breast cancer domain, and evaluate it by comparing it with the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT). We use a hybrid methodology, which combines the knowledge of domain experts, terms from multiple online dictionaries, and the mining of text from sample clinical trials. Use of our methodology introduces 4243 unique lexicon items, which increase bigram entity match by 38.6% and trigram entity match by 41%. Our lexicon, which adds a significant number of new terms, is very useful for matching patients to clinical trials automatically based on eligibility matching. Beyond clinical trial matching, the specialized lexicon developed in this study could serve as a foundation for future healthcare text mining applications.

Download Full-text