scholarly journals Leveraging Concepts in Open Access Publications

2020 ◽  
Vol 2019 ◽  
Author(s):  
Andrea Bertino ◽  
Luca Foppiano ◽  
Laurent Romary ◽  
Pierre Mounier

This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. The software powering this service, called entity-fishing, was initially developed by Inria in the context of the EU FP7 project CENDARI and provides automatic entity recognition and disambiguation using the Wikipedia and Wikidata data sets. The application is distributed with an open-source licence, and it has been deployed as a web service in DARIAH's infrastructure hosted by the French HumaNum. In the paper, we focus on the specific issues related to its integration on five OA platforms specialized in the publication of scholarly monographs in the social sciences and humanities (SSH), as part of the work carried out within the EU H2020 project HIRMEOS (High Integration of Research Monographs in the European Open Science infrastructure). In the first section, we give a brief overview of the current status and evolution of OA publications, considering specifically the challenges that OA monographs are encountering. In the second part, we show how the HIRMEOS project aims to face these challenges by optimizing five OA digital platforms for the publication of monographs from the SSH and ensuring their interoperability. In sections three and four we give a comprehensive description of the entity-fishing service, focusing on its concrete applications in real use cases together with some further possible ideas on how to exploit the annotations generated. We show that entity-fishing annotations can improve both research and publishing process. In the last chapter, we briefly present further possible application scenarios that could be made available through infrastructural projects.

2021 ◽  
pp. 1-13
Author(s):  
Xia Li ◽  
Qinghua Wen ◽  
Zengtao Jiao ◽  
Jiangtao Zhang

Abstract The China Conference on Knowledge Graph and Semantic Computing (CCKS) 2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records. Two annotated data sets and some other additional resources for these two subtasks were provided for participators. This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results. The pre-trained language models are widely applied in this evaluation task. Data argumentation and external resources are also helpful.


2017 ◽  
Author(s):  
Lars Juhl Jensen

AbstractMost BioCreative tasks to date have focused on assessing the quality of text-mining annotations in terms of precision of recall. Interoperability, speed, and stability are, however, other important factors to consider for practical applications of text mining. The new BioCreative/BeCalm TIPS task focuses purely on these. To participate in this task, I implemented a BeCalm API within the real-time tagging server also used by the Reflect and EXTRACT tools. In addition to retrieval of patent abstracts, PubMed abstracts, and Pub-Med Central open-access articles as required in the TIPS task, the BeCalm API implementation facilitates retrieval of documents from other sources specified as custom request parameters. As in earlier tests, the tagger proved to be both highly efficient and stable, being able to consistently process requests of 5000 abstracts in less than half a minute including retrieval of the document text.


2019 ◽  
Vol 28 (1) ◽  
pp. 15-30 ◽  
Author(s):  
Rakesh Patra ◽  
Sujan Kumar Saha

Abstract In this paper, we present a novel word clustering technique to capture contextual similarity among the words. Related word clustering techniques in the literature rely on the statistics of the words collected from a fixed and small word window. For example, the Brown clustering algorithm is based on bigram statistics of the words. However, in the sequential labeling tasks such as named entity recognition (NER), longer context words also carry valuable information. To capture this longer context information, we propose a new word clustering algorithm, which uses parse information of the sentences and a nonfixed word window. This proposed clustering algorithm, named as variable window clustering, performs better than Brown clustering in our experiments. Additionally, to use two different clustering techniques simultaneously in a classifier, we propose a cluster merging technique that performs an output level merging of two sets of clusters. To test the effectiveness of the approaches, we use two different NER data sets, namely, Hindi and BioCreative II Gene Mention Recognition. A baseline NER system is developed using conditional random fields classifier, and then the clusters using individual techniques as well as the merged technique are incorporated to improve the classifier. Experimental results demonstrate that the cluster merging technique is quite promising.


2020 ◽  
Vol 49 (D1) ◽  
pp. D613-D621 ◽  
Author(s):  
Marvin Martens ◽  
Ammar Ammar ◽  
Anders Riutta ◽  
Andra Waagmeester ◽  
Denise N Slenter ◽  
...  

Abstract WikiPathways (https://www.wikipathways.org) is a biological pathway database known for its collaborative nature and open science approaches. With the core idea of the scientific community developing and curating biological knowledge in pathway models, WikiPathways lowers all barriers for accessing and using its content. Increasingly more content creators, initiatives, projects and tools have started using WikiPathways. Central in this growth and increased use of WikiPathways are the various communities that focus on particular subsets of molecular pathways such as for rare diseases and lipid metabolism. Knowledge from published pathway figures helps prioritize pathway development, using optical character and named entity recognition. We show the growth of WikiPathways over the last three years, highlight the new communities and collaborations of pathway authors and curators, and describe various technologies to connect to external resources and initiatives. The road toward a sustainable, community-driven pathway database goes through integration with other resources such as Wikidata and allowing more use, curation and redistribution of WikiPathways content.


2017 ◽  
Author(s):  
David Westergaard ◽  
Hans-Henrik Stærfeldt ◽  
Christian Tønsberg ◽  
Lars Juhl Jensen ◽  
Søren Brunak

AbstractAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.


Data ◽  
2021 ◽  
Vol 6 (8) ◽  
pp. 84
Author(s):  
Jenny Heddes ◽  
Pim Meerdink ◽  
Miguel Pieters ◽  
Maarten Marx

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.


2021 ◽  
Vol 40 (5) ◽  
pp. 8899-8914
Author(s):  
Keming Kang ◽  
Shengwei Tian ◽  
Long Yu

For deep learning’s insufficient learning ability of a small amount of data in the Chinese named entity recognition based on deep learning, this paper proposes a named entity recognition of local adverse drug reactions based on Adversarial Transfer Learning, and constructs a neural network model ASAIBC consisting of Adversarial Transfer Learning, Self-Attention, independently recurrent neural network (IndRNN), Bi-directional long short-term memory (BiLSTM) and conditional random field (CRF). However, of the task of Chinese named entity recognition (NER), there are only few open labeled data sets. Therefore, this article introduces Adversarial Transfer Learning network to fully utilize the boundary of Chinese word segmentation tasks (CWS) and NER tasks for information sharing. Plus, the specific information in the CWS is also filtered. Combing with Self-Attention mechanism and IndRNN, this feature’s expression ability is enhanced, thus allowing the model to concern the important information of different entities from different levels. Along with better capture of the dependence relations of long sentences, the recognition ability of the model is further strengthened. As all the results gained from WeiBoNER and MSRA data sets by ASAIBC model are better than traditional algorithms, this paper conducts an experiment on the data set of Xinjiang local named entity recognition of adverse drug reactions (XJADRNER) based on manual labeling, with the accuracy, precision, recall and F-Score value being 98.97%, 91.01%, 90.21% and 90.57% respectively. These experimental results have shown that ASAIBC model can significantly improve the NER performance of local adverse drug reactions in Xinjiang.


2021 ◽  
Author(s):  
Lisa Langnickel ◽  
Juliane Fluck

Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on available test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize. Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data. We therefore argue that there is a need for larger annotated data sets for training and testing.


2017 ◽  
Author(s):  
Bennett Kleinberg ◽  
Maximilian Mozes ◽  
Yaloe van der Toolen ◽  
Bruno Verschuere

Background: The shift towards open science, implies that researchers should share their data. Often there is a dilemma between publicly sharing data and protecting their subjects' confidentiality. Moreover, the case of unstructured text data (e.g. stories) poses an additional dilemma: anonymizing texts without deteriorating their content for secondary research. Existing text anonymization systems either deteriorate the content of the original or have not been tested empirically. We propose and empirically evaluate NETANOS: named entity-based text anonymization for open science. NETANOS is an open-source context-preserving anonymization system that identifies and modifies named entities (e.g. persons, locations, times, dates). The aim is to assist researchers in sharing their raw text data.Method & Results: NETANOS anonymizes critical, contextual information through a stepwise named entity recognition (NER) implementation: it identifies contextual information (e.g. "Munich") and then replaces them with a context-preserving category label (e.g. "Location_1"). We assessed how good participants were in re-identifying several travel stories (e.g. locations, names) that were presented in the original (“Max”), human anonymized (“Max” → “Person1”), NETANOS (”Max” → “Person1”), and in a context-deteriorating state (“Max” → “XXX”). Bayesian testing revealed that the NETANOS anonymization was practically equivalent to the human baseline anonymization.Conclusions: Named entity recognition can be applied to the anonymization of critical, identifiable information in text data. The proposed stepwise anonymization procedure provides a fully automated, fast system for text anonymization. NETANOS might be an important step to address researchers' dilemmas when sharing text data within the open science movement.


Sign in / Sign up

Export Citation Format

Share Document