scholarly journals A BERT-Based Hybrid System for Chemical Identification and Indexing in Full-Text Articles

2021 ◽  
Author(s):  
Arslan Erdengasileng ◽  
Keqiao Li ◽  
Qing Han ◽  
Shubo Tian ◽  
Jian Wang ◽  
...  

Identification and indexing of chemical compounds in full-text articles are essential steps in biomedical article categorization, information extraction, and biological text mining. BioCreative Challenge was established to evaluate methods for biological text mining and information extraction. Track 2 of BioCreative VII (summer 2021) consists of two subtasks: chemical identification and chemical indexing in full-text PubMed articles. The chemical identification subtask also includes two parts: chemical named entity recognition (NER) and chemical normalization. In this paper, we present our work on developing a hybrid pipeline for chemical named entity recognition, chemical normalization, and chemical indexing in full-text PubMed articles. Specifically, we applied BERT-based methods for chemical NER and chemical indexing, and a sieve-based dictionary matching method for chemical normalization. For subtask 1, we used PubMedBERT with data augmentation on the chemical NER task. Several chemical-MeSH dictionaries including MeSH.XML, SUPP.XML, MRCONSO.RFF, and PubTator chemical annotations are used in a specific order to get the best performance on chemical normalization. We achieved an F1 score of 0.86 and 0.7668 on chemical NER and chemical normalization, respectively. For subtask 2, we formulated it as a binary prediction problem for each individual chemical compound name. We then used a BERT-based model with engineered features and achieved a strict F1 score of 0.4825 on the test set, which is substantially higher than the median F1 score (0.3971) of all the submissions.

2017 ◽  
Author(s):  
David Westergaard ◽  
Hans-Henrik Stærfeldt ◽  
Christian Tønsberg ◽  
Lars Juhl Jensen ◽  
Søren Brunak

AbstractAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.


2021 ◽  
Vol 189 ◽  
pp. 292-299
Author(s):  
Caroline Sabty ◽  
Islam Omar ◽  
Fady Wasfalla ◽  
Mohamed Islam ◽  
Slim Abdennadher

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nícia Rosário-Ferreira ◽  
Victor Guimarães ◽  
Vítor S. Costa ◽  
Irina S. Moreira

Abstract Background Blood cancers (BCs) are responsible for over 720 K yearly deaths worldwide. Their prevalence and mortality-rate uphold the relevance of research related to BCs. Despite the availability of different resources establishing Disease-Disease Associations (DDAs), the knowledge is scattered and not accessible in a straightforward way to the scientific community. Here, we propose SicknessMiner, a biomedical Text-Mining (TM) approach towards the centralization of DDAs. Our methodology encompasses Named Entity Recognition (NER) and Named Entity Normalization (NEN) steps, and the DDAs retrieved were compared to the DisGeNET resource for qualitative and quantitative comparison. Results We obtained the DDAs via co-mention using our SicknessMiner or gene- or variant-disease similarity on DisGeNET. SicknessMiner was able to retrieve around 92% of the DisGeNET results and nearly 15% of the SicknessMiner results were specific to our pipeline. Conclusions SicknessMiner is a valuable tool to extract disease-disease relationship from RAW input corpus.


Sign in / Sign up

Export Citation Format

Share Document