Indirectly Named Entity Recognition

Alexis Kauffmann; François-Claude Rey; Iana Atanassova; Arnaud Gaudinat; Peter Greenfield; Hélène Madinier; Sylviane Cardey

doi:10.4995/jclr.2021.15922

Indirectly Named Entity Recognition

Journal of Computer-Assisted Linguistic Research ◽

10.4995/jclr.2021.15922 ◽

2021 ◽

Vol 5 (1) ◽

pp. 27-46

Author(s):

Alexis Kauffmann ◽

François-Claude Rey ◽

Iana Atanassova ◽

Arnaud Gaudinat ◽

Peter Greenfield ◽

...

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Future Research ◽

Proof Of Concept ◽

Future Perspectives ◽

Named Entities ◽

Multiword Expressions ◽

Named Entity ◽

French Texts

We define here indirectly named entities, as a term to denote multiword expressions referring to known named entities by means of periphrasis. While named entity recognition is a classical task in natural language processing, little attention has been paid to indirectly named entities and their treatment. In this paper, we try to address this gap, describing issues related to the detection and understanding of indirectly named entities in texts. We introduce a proof of concept for retrieving both lexicalised and non-lexicalised indirectly named entities in French texts. We also show example cases where this proof of concept is applied, and discuss future perspectives. We have initiated the creation of a first lexicon of 712 indirectly named entity entries that is available for future research.

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

A Survey of Arabic Named Entity Recognition and Classification

Computational Linguistics ◽

10.1162/coli_a_00178 ◽

2014 ◽

Vol 40 (2) ◽

pp. 469-510 ◽

Cited By ~ 62

Author(s):

Khaled Shaalan

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Relevant Information ◽

Arabic Language ◽

Entity Recognition ◽

Named Entities ◽

Linguistic Resources ◽

Named Entity ◽

To Receive ◽

Made In

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Download Full-text

Evaluating named entity recognition tools for extracting social networks from novels

PeerJ Computer Science ◽

10.7717/peerj-cs.189 ◽

2019 ◽

Vol 5 ◽

pp. e189 ◽

Cited By ~ 2

Author(s):

Niels Dekker ◽

Tobias Kuhn ◽

Marieke van Erp

Keyword(s):

Social Networks ◽

Social Interactions ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Computer Assisted ◽

Early 20Th Century ◽

Automatic Extraction ◽

Named Entities ◽

Named Entity

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels, the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th and early 20th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day literature as they are to those older novels. We present a study in which we evaluate natural language processing tools for the automatic extraction of social networks from novels as well as their network structure. We find that there are no significant differences between old and modern novels but that both are subject to a large amount of variance. Furthermore, we identify several issues that complicate named entity recognition in our set of novels and we present methods to remedy these. We see this work as a step in creating more culturally-aware AI systems.

Download Full-text

Improving classification of low-resource COVID-19 literature by using Named Entity Recognition

Genomics & Informatics ◽

10.5808/gi.21018 ◽

2021 ◽

Vol 19 (3) ◽

pp. e22

Author(s):

Oscar Lithgow-Serrano ◽

Joseph Cornelius ◽

Vani Kanjirangat ◽

Carlos-Francisco Méndez-Cruz ◽

Fabio Rinaldi

Keyword(s):

Classification Scheme ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biological Databases ◽

Proof Of Concept ◽

Baseline Model ◽

Named Entities ◽

Named Entity ◽

Automatic Document Classification

Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) Clinical repository—a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice—where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene’s Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE’s origin was useful to classify document types and NE’s type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

Download Full-text

Statistical Method for Named Entity Recognition in Telugu, an Indian Language

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b3500.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 4211-4216

Keyword(s):

Language Processing ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Semantic Features ◽

Indian Language ◽

Named Entities ◽

Maximum Entropy Models ◽

Named Entity ◽

Proper Nouns

One of the important tasks of Natural Language Processing (NLP) is Named Entity Recognition (NER). The primary operation of NER is to identify proper nouns i.e. to locate all the named entities in the text and tag them as certain named entity categories such as Entity, Time expression and Numeric expression. In the previous works, NER for Telugu language is addressed with Conditional Random Fields (CRF) and Maximum Entropy models however they failed to handle ambiguous named entity tags for the same named entity. This paper presents a hybrid statistical system for Named Entity Recognition in Telugu language in which named entities are identified by both dictionary-based approach and statistical Hidden Markov Model (HMM). The proposed method uses Lexicon-lookup dictionary and contexts based on semantic features for predicting named entity tags. Further HMM is used to resolve the named entity ambiguities in predicted named entity tags. The present work reports an average accuracy of 86.3% for finding the named entities

Download Full-text

Myanmar named entity corpus and its use in syllable-based neural named entity recognition

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i2.pp1544-1551 ◽

2020 ◽

Vol 10 (2) ◽

pp. 1544 ◽

Cited By ~ 1

Author(s):

Hsu Myat Mo ◽

Khin Mar Soe

Keyword(s):

Neural Network ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Future Research ◽

Network Architectures ◽

Entity Extraction ◽

Named Entity ◽

Named Entity Extraction ◽

Network Approaches

Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.

Download Full-text

Evaluating social network extraction for classic and modern fiction literature

10.7287/peerj.preprints.27263 ◽

2018 ◽

Author(s):

Niels Dekker ◽

Tobias Kuhn ◽

Marieke van Erp

Keyword(s):

Social Networks ◽

Science Fiction ◽

19Th Century ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Computer Assisted ◽

Named Entities ◽

Named Entity ◽

Modern Fiction

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day science fiction and fantasy literature as they are to those 19th century classics. We present a study to compare classic literature to modern literature in terms of performance of natural language processing tools for the automatic extraction of social networks as well as their network structure. We find that there are no significant differences between the two sets of novels but that both are subject to a high amount of variance. Furthermore, we identify several issues that complicate named entity recognition in modern novels and we present methods to remedy these.

Download Full-text

Techniques for Named Entity Recognition on Arabic-English Code-Mixed Data

International Journal of Robotic Computing ◽

10.35708/tai1868-126245 ◽

2019 ◽

pp. 44-63

Author(s):

Caroline Sabty ◽

Ahmed Sherif ◽

Mohamed Elmahdy ◽

Slim Abdennadher

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Arab Countries ◽

Entity Recognition ◽

Mixed Data ◽

Word Embeddings ◽

Named Entities ◽

Named Entity ◽

Code Mixing ◽

Social Media Platforms

As a result of globalization and better quality of education, a signifcant percentage of the population in Arab countries have become bilingual/multilingual. This has raised the frequency of code-switching and code-mixing among Arabs in daily communication. Consequently, huge amount of Code-Mixed (CM) content can be found on different social media platforms. Such data could be analyzed and used in different Natural Language Processing (NLP) tasks to tackle the challenges emerging due to this multilingual phenomenon. Named-Entity Recognition (NER) is one of the major tasks for several NLP systems. It is the process of identifying named entities in text. However, there is a lack of annotated CM data and resources for such task. This work aims at collecting and building the first annotated CM Arabic-English corpus for NER. Furthermore, we constructed a baseline NER system using deep neural networks and word embeddings for Arabic-English CM text. Moreover, we investigated the usage of different types of classical and contextual pre-trained word embeddings on our system. The highest NER system achieved an F1-score of 77.69% by combining classical and contextual word embeddings.

Download Full-text

Domain-Transferable Method for Named Entity Recognition Task

10.5121/csit.2020.101407 ◽

2020 ◽

Author(s):

Vladislav Mikhailov ◽

Tatiana Shavrina

Keyword(s):

Language Processing ◽

Question Answering ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Neural Models ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Human Effort

Named Entity Recognition (NER) is a fundamental task in the fields of natural language processing and information extraction. NER has been widely used as a standalone tool or an essential component in a variety of applications such as question answering, dialogue assistants and knowledge graphs development. However, training reliable NER models requires a large amount of labelled data which is expensive to obtain, particularly in specialized domains. This paper describes a method to learn a domain-specific NER model for an arbitrary set of named entities when domain-specific supervision is not available. We assume that the supervision can be obtained with no human effort, and neural models can learn from each other. The code, data and models are publicly available.

Download Full-text

Using Named Entities for Recognizing Family Relationships

10.5753/kdmile.2021.17457 ◽

2021 ◽

Author(s):

E. Oliveira ◽

G. Dias ◽

J. Lima ◽

J. P. C. Pirovani

Keyword(s):

Language Processing ◽

Family Relationships ◽

Named Entity Recognition ◽

Entity Recognition ◽

Hard Problem ◽

Named Entities ◽

Named Entity ◽

Hybrid Approaches ◽

Local Grammar ◽

F Measure

Named Entity Recognition problem’s objective is to automatically identify and classify entities like persons, places,organizations, and so forth. That is an area in Natural Language Processing and Information Extraction. NamedEntity Recognition is important because it is a fundamental step of preprocessing for several applications like relationextraction. However, it is a hard problem to solve as several categories of named entities are written similarly andthey appear in similar contexts. To accomplish it, we can use some hybrid approaches. Nevertheless, in this presentstudy, we use linguistic flavor by applying Local Grammar and Cascade of Transducers. Local Grammars are used torepresent the rules of a particular linguistic structure. They are often built manually to describe the entities we aimto recognize. In our study, we adapted a Local Grammar to improve the Recognition of Named Entities. The resultsshow an improvement of up to 7% on the F-measure metric in relation to the previous Local Grammar. Also, we builtanother Local Grammar to recognize family relationships from the improved Local Grammar. We present a practicalapplication for the extracted relationships using Prolog.

Download Full-text