ChEMU shared task: chemical entity recognition and event extraction of chemical reactions from patents

Chemical patents are an essential source of information about novel chemicals and chemical reactions. However, with the increasing volume of such patents, mining information about these chemicals and chemical reactions has become a time-intensive and laborious endeavor. In this study, we present a system to extract chemical reaction events from patents automatically. Our approach consists of two steps: 1) named entity recognition (NER)—the automatic identification of chemical reaction parameters from the corresponding text, and 2) event extraction (EE)—the automatic classifying and linking of entities based on their relationships to each other. For our NER system, we evaluate bidirectional long short-term memory (BiLSTM)-based and bidirectional encoder representations from transformer (BERT)-based methods. For our EE system, we evaluate BERT-based, convolutional neural network (CNN)-based, and rule-based methods. We evaluate our NER and EE components independently and as an end-to-end system, reporting the precision, recall, and F1 score. Our results show that the BiLSTM-based method performed best at identifying the entities, and the CNN-based method performed best at extracting events.

Download Full-text

ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/978-3-030-45442-5_74 ◽

2020 ◽

pp. 572-579 ◽

Cited By ~ 1

Author(s):

Dat Quoc Nguyen ◽

Zenan Zhai ◽

Hiyori Yoshikawa ◽

Biaoyan Fang ◽

Christian Druckenbrodt ◽

...

Keyword(s):

Chemical Reactions ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Named Entity

Download Full-text

Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents

Lecture Notes in Computer Science - Experimental IR Meets Multilinguality, Multimodality, and Interaction ◽

10.1007/978-3-030-58219-7_18 ◽

2020 ◽

pp. 237-254

Author(s):

Jiayuan He ◽

Dat Quoc Nguyen ◽

Saber A. Akhondi ◽

Christian Druckenbrodt ◽

Camilo Thorne ◽

...

Keyword(s):

Chemical Reactions ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Named Entity

Download Full-text

From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents

Frontiers in Research Metrics and Analytics ◽

10.3389/frma.2021.691105 ◽

2021 ◽

Vol 6 ◽

Author(s):

Jingqi Wang ◽

Yuankai Ren ◽

Zhi Zhang ◽

Hua Xu ◽

Yaoyun Zhang

Keyword(s):

Information Extraction ◽

Chemical Reactions ◽

Chemical Reaction ◽

High Performance ◽

Event Extraction ◽

Entity Recognition ◽

Language Models ◽

Accurate Information ◽

Free Text ◽

Semantic Roles

Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020—ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising.

Download Full-text

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents

Frontiers in Research Metrics and Analytics ◽

10.3389/frma.2021.654438 ◽

2021 ◽

Vol 6 ◽

Author(s):

Jiayuan He ◽

Dat Quoc Nguyen ◽

Saber A. Akhondi ◽

Christian Druckenbrodt ◽

Camilo Thorne ◽

...

Keyword(s):

Information Extraction ◽

Chemical Reactions ◽

Language Processing ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Named Entity ◽

Fundamental Information ◽

Essential Chemical ◽

Reaction Processes

Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.

Download Full-text

Active learning for ontological event extraction incorporating named entity recognition and unknown word handling

Journal of Biomedical Semantics ◽

10.1186/s13326-016-0059-z ◽

2016 ◽

Vol 7 (1) ◽

Cited By ~ 2

Author(s):

Xu Han ◽

Jung-jae Kim ◽

Chee Keong Kwoh

Keyword(s):

Active Learning ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Unknown Word ◽

Named Entity

Download Full-text

Exploring the Adaptation of Recurrent Neural Network Approaches for Extracting Drug–Drug Interactions from Biomedical Text

International Journal of Machine Learning and Computing ◽

10.18178/ijmlc.2021.11.4.1046 ◽

2021 ◽

Vol 11 (4) ◽

pp. 267-273

Author(s):

Wen-Juan Hou ◽

◽

Bamfa Ceesay

Keyword(s):

Text Processing ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Biomedical Text ◽

Automatic Extraction ◽

Named Entity ◽

Structured Information ◽

Network Approaches ◽

Form Information

Information extraction (IE) is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several activities, such as named entity recognition, event extraction, relationship discovery, and document classification, with the overall goal of translating text into a more structured form. Information on the changes in the effect of a drug, when taken in combination with a second drug, is known as drug–drug interaction (DDI). DDIs can delay, decrease, or enhance absorption of drugs and thus decrease or increase their efficacy or cause adverse effects. Recent research trends have shown several adaptation of recurrent neural networks (RNNs) from text. In this study, we highlight significant challenges of using RNNs in biomedical text processing and propose automatic extraction of DDIs aiming at overcoming some challenges. Our results show that the system is competitive against other systems for the task of extracting DDIs.

Download Full-text

Robust Multilingual Named Entity Recognition with Shallow Semi-supervised Features (Extended Abstract)

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/703 ◽

2017 ◽

Cited By ~ 1

Author(s):

Rodrigo Agerri ◽

German Rigau

Keyword(s):

Reproducibility Of Results ◽

State Of The Art ◽

Named Entity Recognition ◽

Local Information ◽

Entity Recognition ◽

Shared Task ◽

Competitive System ◽

Named Entity ◽

Text Understanding ◽

Domain Models

We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text. Understanding via empiricalexperimentation how to effectively combine various types of clustering features allows us to seamlessly export our system to other datasets and languages. The result is a simple but highly competitive system which obtains state of the art results across five languages and twelve datasets. The results are reported on standard shared task evaluation data such as CoNLL for English, Spanish and Dutch. Furthermore, and despite the lack of linguistically motivated features, we also report best results for languages such as Basque and German. In addition, we demonstrate that our method also obtains very competitive results even when the amount of supervised data is cut by half, alleviating the dependency on manually annotated data. Finally, the results show that our emphasis on clustering features is crucial to develop robust out-of-domain models. The system and models are freely available to facilitate its use and guarantee the reproducibility of results.

Download Full-text

Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records

Data Intelligence ◽

10.1162/dint_a_00093 ◽

2021 ◽

pp. 1-13

Author(s):

Xia Li ◽

Qinghua Wen ◽

Zengtao Jiao ◽

Jiangtao Zhang

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Language Models ◽

Data Sets ◽

External Resources ◽

Named Entity ◽

Evaluation Task

Abstract The China Conference on Knowledge Graph and Semantic Computing (CCKS) 2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records. Two annotated data sets and some other additional resources for these two subtasks were provided for participators. This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results. The pre-trained language models are widely applied in this evaluation task. Data argumentation and external resources are also helpful.

Download Full-text

Recognition of Chemical Entities using Pattern Matching and Functional Group Classification

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2016100102 ◽

2016 ◽

Vol 12 (4) ◽

pp. 21-44 ◽

Cited By ~ 3

Author(s):

R. Hema ◽

T. V. Geetha

Keyword(s):

Pattern Matching ◽

Functional Group ◽

Named Entity Recognition ◽

Chemical Compounds ◽

Chemical Entity ◽

Entity Recognition ◽

Matching Method ◽

Named Entity ◽

One Way Anova

The two main challenges in chemical entity recognition are: (i) New chemical compounds are constantly being synthesized infinitely. (ii) High ambiguity in chemical representation in which a chemical entity is being described by different nomenclatures. Therefore, the identification and maintenance of chemical terminologies is a tough task. Since most of the existing text mining methods followed the term-based approaches, the problems of polysemy and synonymy came into the picture. So, a Named Entity Recognition (NER) system based on pattern matching in chemical domain is developed to extract the chemical entities from chemical documents. The Tf-idf and PMI association measures are used to filter out the non-chemical terms. The F-score of 92.19% is achieved for chemical NER. This proposed method is compared with the baseline method and other existing approaches. As the final step, the filtered chemical entities are classified into sixteen functional groups. The classification is done using SVM One against All multiclass classification approach and achieved the accuracy of 87%. One-way ANOVA is used to test the quality of pattern matching method with the other existing chemical NER methods.

Download Full-text