Supporting Named Entity Recognition and Syntactic Analysis with Full-Text Queries

Author(s):  
Luísa Coheur ◽  
Ana Guimarães ◽  
Nuno Mamede
Information ◽  
2019 ◽  
Vol 10 (5) ◽  
pp. 178 ◽  
Author(s):  
Denis Maurel ◽  
Enza Morale ◽  
Nicolas Thouvenin ◽  
Patrice Ringot ◽  
Angel Turri

Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of Istex, we implemented a system of named entity recognition on all papers and we offer users the possibility to run searches on these entities. After the presentation of the French Istex project, we detail in this paper the named entity recognition with CasEN, a cascade of graphs, implemented on the Unitex Software. CasEN exists in French, but not in English. The first challenge was to build a new cascade in a short time. The results of its evaluation showed a good Precision measure, even if the Recall was not very good. The Precision was very important for this project to ensure it did not return unwanted papers by a query. The second challenge was the implementation of Unitex to parse around twenty millions of documents. We used a dockerized application. Finally, we explain also how to query the resulting Named entities in the Istex website.


2017 ◽  
Author(s):  
David Westergaard ◽  
Hans-Henrik Stærfeldt ◽  
Christian Tønsberg ◽  
Lars Juhl Jensen ◽  
Søren Brunak

AbstractAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.


2021 ◽  
Author(s):  
Arslan Erdengasileng ◽  
Keqiao Li ◽  
Qing Han ◽  
Shubo Tian ◽  
Jian Wang ◽  
...  

Identification and indexing of chemical compounds in full-text articles are essential steps in biomedical article categorization, information extraction, and biological text mining. BioCreative Challenge was established to evaluate methods for biological text mining and information extraction. Track 2 of BioCreative VII (summer 2021) consists of two subtasks: chemical identification and chemical indexing in full-text PubMed articles. The chemical identification subtask also includes two parts: chemical named entity recognition (NER) and chemical normalization. In this paper, we present our work on developing a hybrid pipeline for chemical named entity recognition, chemical normalization, and chemical indexing in full-text PubMed articles. Specifically, we applied BERT-based methods for chemical NER and chemical indexing, and a sieve-based dictionary matching method for chemical normalization. For subtask 1, we used PubMedBERT with data augmentation on the chemical NER task. Several chemical-MeSH dictionaries including MeSH.XML, SUPP.XML, MRCONSO.RFF, and PubTator chemical annotations are used in a specific order to get the best performance on chemical normalization. We achieved an F1 score of 0.86 and 0.7668 on chemical NER and chemical normalization, respectively. For subtask 2, we formulated it as a binary prediction problem for each individual chemical compound name. We then used a BERT-based model with engineered features and achieved a strict F1 score of 0.4825 on the test set, which is substantially higher than the median F1 score (0.3971) of all the submissions.


2020 ◽  
Author(s):  
Omar ASBAYOU

This paper show how location named entity (LNE) extraction and annotation, which makes part of our named entity recognition (NER) systems, is an important task in managing the great amount of data. In this paper, we try to explain our linguistic approach in our rule-based LNE recognition and classification system based on syntactico-semantic patterns. To reach good results, we have taken into account morpho-syntactic information provided by morpho-syntactic analysis based on DIINAR database, and syntactico-semantic classification of both location name trigger words (TW) and extensions. Formally, different trigger word sense implies different syntactic entity structures. We also show the semantic data that our LNE recognition and classification system can provide to both information extraction (IE) and information retrieval(IR).The XML database output of the LNE system constituted an important resource for IE and IR. Future project will improve this processing output in order to exploit it in computerassisted Translation (CAT).


2020 ◽  
Author(s):  
Shintaro Tsuji ◽  
Andrew Wen ◽  
Naoki Takahashi ◽  
Hongjian Zhang ◽  
Katsuhiko Ogasawara ◽  
...  

BACKGROUND Named entity recognition (NER) plays an important role in extracting the features of descriptions for mining free-text radiology reports. However, the performance of existing NER tools is limited because the number of entities depends on its dictionary lookup. Especially, the recognition of compound terms is very complicated because there are a variety of patterns. OBJECTIVE The objective of the study is to develop and evaluate a NER tool concerned with compound terms using the RadLex for mining free-text radiology reports. METHODS We leveraged the clinical Text Analysis and Knowledge Extraction System (cTAKES) to develop customized pipelines using both RadLex and SentiWordNet (a general-purpose dictionary, GPD). We manually annotated 400 of radiology reports for compound terms (Cts) in noun phrases and used them as the gold standard for the performance evaluation (precision, recall, and F-measure). Additionally, we also created a compound-term-enhanced dictionary (CtED) by analyzing false negatives (FNs) and false positives (FPs), and applied it for another 100 radiology reports for validation. We also evaluated the stem terms of compound terms, through defining two measures: an occurrence ratio (OR) and a matching ratio (MR). RESULTS The F-measure of the cTAKES+RadLex+GPD was 32.2% (Precision 92.1%, Recall 19.6%) and that of combined the CtED was 67.1% (Precision 98.1%, Recall 51.0%). The OR indicated that stem terms of “effusion”, "node", "tube", and "disease" were used frequently, but it still lacks capturing Cts. The MR showed that 71.9% of stem terms matched with that of ontologies and RadLex improved about 22% of the MR from the cTAKES default dictionary. The OR and MR revealed that the characteristics of stem terms would have the potential to help generate synonymous phrases using ontologies. CONCLUSIONS We developed a RadLex-based customized pipeline for parsing radiology reports and demonstrated that CtED and stem term analysis has the potential to improve dictionary-based NER performance toward expanding vocabularies.


Sign in / Sign up

Export Citation Format

Share Document