scholarly journals A Survey of Arabic Named Entity Recognition and Classification

2014 ◽  
Vol 40 (2) ◽  
pp. 469-510 ◽  
Author(s):  
Khaled Shaalan

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Data ◽  
2021 ◽  
Vol 6 (7) ◽  
pp. 71
Author(s):  
Gonçalo Carnaz ◽  
Mário Antunes ◽  
Vitor Beires Nogueira

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.


Author(s):  
Abdelsalam A. Almarimi ◽  
Ezzedin M. Enbiah

Named Entity Recognition (NER) is a computational linguistic concept that is used to find and classify appropriate nouns in a text such as person names, geographical locations, and organizations. Such a concept is fundamental in the field of natural language processing. In Libya, many private and public institutions suffer from using the proper translation of entity names from Arabic language into English. Therefore, in this paper, we are concerned with analyzing Arabic articles to extract and recognize entity names. A recognition system is developed for recognizing names of persons, academic institutions, and cities in Libya. At first, a training corpus and dictionaries are built for the intended entity names in this research. Then, the aspects of the entity names are studied, and their patterns and rules are designed. Then, the implementation is performed using Nooj linguistic language. The recognition of person names and Libyan cities and academic institutions was carried out. Statistics showed the frequencies of the appearance rate of person names, academic institutions, and cities in our training corpus. The obtained results are promised and met the research goals for tackling the problem of Arabic named entity recognition.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Zhi Liu ◽  
Changyong Luo ◽  
Zeyu Zheng ◽  
Yan Li ◽  
Dianzheng Fu ◽  
...  

Intelligent traditional Chinese medicine (TCM) has become a popular research field by means of prospering of deep learning technology. Important achievements have been made in such representative tasks as automatic diagnosis of TCM syndromes and diseases and generation of TCM herbal prescriptions. However, one unavoidable issue that still hinders its progress is the lack of labeled samples, i.e., the TCM medical records. As an efficient tool, the named entity recognition (NER) models trained on various TCM resources can effectively alleviate this problem and continuously increase the labeled TCM samples. In this work, on the basis of in-depth analysis, we argue that the performance of the TCM named entity recognition model can be better by using the character-level representation and tagging and propose a novel word-character integrated self-attention module. With the help of TCM doctors and experts, we define 5 classes of TCM named entities and construct a comprehensive NER dataset containing the standard content of the publications and the clinical medical records. The experimental results on this dataset demonstrate the effectiveness of the proposed module.


2019 ◽  
Vol 5 ◽  
pp. e189 ◽  
Author(s):  
Niels Dekker ◽  
Tobias Kuhn ◽  
Marieke van Erp

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels, the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th and early 20th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day literature as they are to those older novels. We present a study in which we evaluate natural language processing tools for the automatic extraction of social networks from novels as well as their network structure. We find that there are no significant differences between old and modern novels but that both are subject to a large amount of variance. Furthermore, we identify several issues that complicate named entity recognition in our set of novels and we present methods to remedy these. We see this work as a step in creating more culturally-aware AI systems.


2019 ◽  
Vol 8 (2) ◽  
pp. 4211-4216

One of the important tasks of Natural Language Processing (NLP) is Named Entity Recognition (NER). The primary operation of NER is to identify proper nouns i.e. to locate all the named entities in the text and tag them as certain named entity categories such as Entity, Time expression and Numeric expression. In the previous works, NER for Telugu language is addressed with Conditional Random Fields (CRF) and Maximum Entropy models however they failed to handle ambiguous named entity tags for the same named entity. This paper presents a hybrid statistical system for Named Entity Recognition in Telugu language in which named entities are identified by both dictionary-based approach and statistical Hidden Markov Model (HMM). The proposed method uses Lexicon-lookup dictionary and contexts based on semantic features for predicting named entity tags. Further HMM is used to resolve the named entity ambiguities in predicted named entity tags. The present work reports an average accuracy of 86.3% for finding the named entities


2018 ◽  
Author(s):  
Niels Dekker ◽  
Tobias Kuhn ◽  
Marieke van Erp

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day science fiction and fantasy literature as they are to those 19th century classics. We present a study to compare classic literature to modern literature in terms of performance of natural language processing tools for the automatic extraction of social networks as well as their network structure. We find that there are no significant differences between the two sets of novels but that both are subject to a high amount of variance. Furthermore, we identify several issues that complicate named entity recognition in modern novels and we present methods to remedy these.


Author(s):  
Caroline Sabty ◽  
Ahmed Sherif ◽  
Mohamed Elmahdy ◽  
Slim Abdennadher

As a result of globalization and better quality of education, a signifcant percentage of the population in Arab countries have become bilingual/multilingual. This has raised the frequency of code-switching and code-mixing among Arabs in daily communication. Consequently, huge amount of Code-Mixed (CM) content can be found on different social media platforms. Such data could be analyzed and used in different Natural Language Processing (NLP) tasks to tackle the challenges emerging due to this multilingual phenomenon. Named-Entity Recognition (NER) is one of the major tasks for several NLP systems. It is the process of identifying named entities in text. However, there is a lack of annotated CM data and resources for such task. This work aims at collecting and building the first annotated CM Arabic-English corpus for NER. Furthermore, we constructed a baseline NER system using deep neural networks and word embeddings for Arabic-English CM text. Moreover, we investigated the usage of different types of classical and contextual pre-trained word embeddings on our system. The highest NER system achieved an F1-score of 77.69% by combining classical and contextual word embeddings.


2020 ◽  
Author(s):  
Vladislav Mikhailov ◽  
Tatiana Shavrina

Named Entity Recognition (NER) is a fundamental task in the fields of natural language processing and information extraction. NER has been widely used as a standalone tool or an essential component in a variety of applications such as question answering, dialogue assistants and knowledge graphs development. However, training reliable NER models requires a large amount of labelled data which is expensive to obtain, particularly in specialized domains. This paper describes a method to learn a domain-specific NER model for an arbitrary set of named entities when domain-specific supervision is not available. We assume that the supervision can be obtained with no human effort, and neural models can learn from each other. The code, data and models are publicly available.


2021 ◽  
Author(s):  
E. Oliveira ◽  
G. Dias ◽  
J. Lima ◽  
J. P. C. Pirovani

Named Entity Recognition problem’s objective is to automatically identify and classify entities like persons, places,organizations, and so forth. That is an area in Natural Language Processing and Information Extraction. NamedEntity Recognition is important because it is a fundamental step of preprocessing for several applications like relationextraction. However, it is a hard problem to solve as several categories of named entities are written similarly andthey appear in similar contexts. To accomplish it, we can use some hybrid approaches. Nevertheless, in this presentstudy, we use linguistic flavor by applying Local Grammar and Cascade of Transducers. Local Grammars are used torepresent the rules of a particular linguistic structure. They are often built manually to describe the entities we aimto recognize. In our study, we adapted a Local Grammar to improve the Recognition of Named Entities. The resultsshow an improvement of up to 7% on the F-measure metric in relation to the previous Local Grammar. Also, we builtanother Local Grammar to recognize family relationships from the improved Local Grammar. We present a practicalapplication for the extracted relationships using Prolog.


Data ◽  
2018 ◽  
Vol 3 (4) ◽  
pp. 53 ◽  
Author(s):  
Maria Mitrofan ◽  
Verginica Barbu Mititelu ◽  
Grigorina Mitrofan

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.


Sign in / Sign up

Export Citation Format

Share Document