Domain-Transferable Method for Named Entity Recognition Task

DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210195 ◽

2021 ◽

Author(s):

Mahanazuddin Syed ◽

Shaymaa Al-Shukri ◽

Shorabuddin Syed ◽

Kevin Sexton ◽

Melody L. Greer ◽

...

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Discharge Summary ◽

Training Data ◽

Entity Recognition ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Model Training ◽

Clinical Domain

Named Entity Recognition (NER) aims to identify and classify entities into predefined categories is a critical pre-processing task in Natural Language Processing (NLP) pipeline. Readily available off-the-shelf NER algorithms or programs are trained on a general corpus and often need to be retrained when applied on a different domain. The end model’s performance depends on the quality of named entities generated by these NER models used in the NLP task. To improve NER model accuracy, researchers build domain-specific corpora for both model training and evaluation. However, in the clinical domain, there is a dearth of training data because of privacy reasons, forcing many studies to use NER models that are trained in the non-clinical domain to generate NER feature-set. Thus, influencing the performance of the downstream NLP tasks like information extraction and de-identification. In this paper, our objective is to create a high quality annotated clinical corpus for training NER models that can be easily generalizable and can be used in a downstream de-identification task to generate named entities feature-set.

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

A Survey of Arabic Named Entity Recognition and Classification

Computational Linguistics ◽

10.1162/coli_a_00178 ◽

2014 ◽

Vol 40 (2) ◽

pp. 469-510 ◽

Cited By ~ 62

Author(s):

Khaled Shaalan

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Relevant Information ◽

Arabic Language ◽

Entity Recognition ◽

Named Entities ◽

Linguistic Resources ◽

Named Entity ◽

To Receive ◽

Made In

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Download Full-text

A Probability based Classification of Named Entities for Malayalam Language combining Word, Part of Speech and Lexicalized features

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a1968.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 839-842

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Supervised Machine Learning ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Part Of Speech ◽

Classification Probability ◽

Malayalam Language

Named Entity Recognition is the process wherein named entities which are designators of a sentence are identified. Designators of a sentence are domain specific. The proposed system identifies named entities in Malayalam language belonging to tourism domain which generally includes names of persons, places, organizations, dates etc. The system uses word, part of speech and lexicalized features to find the probability of a word belonging to a named entity category and to do the appropriate classification. Probability is calculated based on supervised machine learning using word and part of speech features present in a tagged training corpus and using certain rules applied based on lexicalized features.

Download Full-text

SCIENTIFIC NAMED ENTITY RECOGNITION WITH THE HELP OF MODERN METHODS

Bulletin Series of Physics & Mathematical Sciences ◽

10.51889/2021-3.1728-7901.11 ◽

2021 ◽

Vol 75 (3) ◽

pp. 94-99

Author(s):

A.M. Yelenov ◽

◽

A.B. Jaxylykova ◽

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Support Vector ◽

Scientific Article ◽

Natural Languages ◽

Named Entity ◽

Learning Area

This research focuses on a comparative study of the Named Entity Recognition task for scientific article texts. Natural language processing could be considered as one of the cornerstones in the machine learning area which devotes its attention to the problems connected with the understanding of different natural languages and linguistic analysis. It was already shown that current deep learning techniques have a good performance and accuracy in such areas as image recognition, pattern recognition, computer vision, that could mean that such technology probably would be successful in the neuro-linguistic programming area too and lead to a dramatic increase on the research interest on this topic. For a very long time, quite trivial algorithms have been used in this area, such as support vector machines or various types of regression, basic encoding on text data was also used, which did not provide high results. The following dataset was used to process the experiment models: Dataset Scientific Entity Relation Core. The algorithms used were Long short-term memory, Random Forest Classifier with Conditional Random Fields, and Named-entity recognition with Bidirectional Encoder Representations from Transformers. In the findings, the metrics scores of all models were compared to each other to make a comparison. This research is devoted to the processing of scientific articles, concerning the machine learning area, because the subject is not investigated on enough properly level.The consideration of this task can help machines to understand natural languages better, so that they can solve other neuro-linguistic programming tasks better, enhancing scores in common sense.

Download Full-text

A Sequence-to-Set Network for Nested Named Entity Recognition

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/542 ◽

2021 ◽

Author(s):

Zeqi Tan ◽

Yongliang Shen ◽

Shuai Zhang ◽

Weiming Lu ◽

Yueting Zhuang

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Recognition Task ◽

Search Space ◽

Entity Recognition ◽

Bipartite Matching ◽

Named Entity ◽

Proposed Model ◽

Fixed Set ◽

Sequence Method

Named entity recognition (NER) is a widely studied task in natural language processing. Recently, a growing number of studies have focused on the nested NER. The span-based methods, considering the entity recognition as a span classification task, can deal with nested entities naturally. But they suffer from the huge search space and the lack of interactions between entities. To address these issues, we propose a novel sequence-to-set neural network for nested NER. Instead of specifying candidate spans in advance, we provide a fixed set of learnable vectors to learn the patterns of the valuable spans. We utilize a non-autoregressive decoder to predict the final set of entities in one pass, in which we are able to capture dependencies between entities. Compared with the sequence-to-sequence method, our model is more suitable for such unordered recognition task as it is insensitive to the label order. In addition, we utilize the loss function based on bipartite matching to compute the overall training loss. Experimental results show that our proposed model achieves state-of-the-art on three nested NER corpora: ACE 2004, ACE 2005 and KBP 2017. The code is available at https://github.com/zqtan1024/sequence-to-set.

Download Full-text

Techniques for Named Entity Recognition

Advances in Human and Social Aspects of Technology - Collaboration and the Semantic Web ◽

10.4018/978-1-4666-0894-8.ch011 ◽

2012 ◽

pp. 191-217 ◽

Cited By ~ 1

Author(s):

Girish Keshav Palshikar

Keyword(s):

Semantic Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Number Of Factors ◽

Web Contents ◽

Biological Domain

While building and using a fully semantic understanding of Web contents is a distant goal, named entities (NEs) provide a small, tractable set of elements carrying a well-defined semantics. Generic named entities are names of persons, locations, organizations, phone numbers, and dates, while domain-specific named entities includes names of for example, proteins, enzymes, organisms, genes, cells, et cetera, in the biological domain. An ability to automatically perform named entity recognition (NER) – i.e., identify occurrences of NE in Web contents – can have multiple benefits, such as improving the expressiveness of queries and also improving the quality of the search results. A number of factors make building highly accurate NER a challenging task. Given the importance of NER in semantic processing of text, this chapter presents a detailed survey of NER techniques for English text.

Download Full-text

Techniques for Named Entity Recognition

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch022 ◽

2013 ◽

pp. 400-426 ◽

Cited By ~ 2

Author(s):

Girish Keshav Palshikar

Keyword(s):

Semantic Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Number Of Factors ◽

Web Contents ◽

Biological Domain

While building and using a fully semantic understanding of Web contents is a distant goal, named entities (NEs) provide a small, tractable set of elements carrying a well-defined semantics. Generic named entities are names of persons, locations, organizations, phone numbers, and dates, while domain-specific named entities includes names of for example, proteins, enzymes, organisms, genes, cells, et cetera, in the biological domain. An ability to automatically perform named entity recognition (NER) – i.e., identify occurrences of NE in Web contents – can have multiple benefits, such as improving the expressiveness of queries and also improving the quality of the search results. A number of factors make building highly accurate NER a challenging task. Given the importance of NER in semantic processing of text, this chapter presents a detailed survey of NER techniques for English text.

Download Full-text

Biomedical Named Entity Recognition via Knowledge Guidance and Question Answering

ACM Transactions on Computing for Healthcare ◽

10.1145/3465221 ◽

2021 ◽

Vol 2 (4) ◽

pp. 1-24

Author(s):

Pratyay Banerjee ◽

Kuntal Kumar Pal ◽

Murthy Devarakonda ◽

Chitta Baral

Keyword(s):

Question Answering ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Neural Models ◽

Named Entity ◽

Input Text ◽

The Impact ◽

Biomedical Named Entity Recognition ◽

Entity Class

In this work, we formulated the named entity recognition (NER) task as a multi-answer knowledge guided question-answer task (KGQA) and showed that the knowledge guidance helps to achieve state-of-the-art results for 11 of 18 biomedical NER datasets. We prepended five different knowledge contexts—entity types, questions, definitions, and examples—to the input text and trained and tested BERT-based neural models on such input sequences from a combined dataset of the 18 different datasets. This novel formulation of the task (a) improved named entity recognition and illustrated the impact of different knowledge contexts, (b) reduced system confusion by limiting prediction to a single entity-class for each input token (i.e., B , I , O only) compared to multiple entity-classes in traditional NER (i.e., B entity 1, B entity 2, I entity 1, I , O ), (c) made detection of nested entities easier, and (d) enabled the models to jointly learn NER-specific features from a large number of datasets. We performed extensive experiments of this KGQA formulation on the biomedical datasets, and through the experiments, we showed when knowledge improved named entity recognition. We analyzed the effect of the task formulation, the impact of the different knowledge contexts, the multi-task aspect of the generic format, and the generalization ability of KGQA. We also probed the model to better understand the key contributors for these improvements.

Download Full-text