Statistical Method for Named Entity Recognition in Telugu, an Indian Language

One of the important tasks of Natural Language Processing (NLP) is Named Entity Recognition (NER). The primary operation of NER is to identify proper nouns i.e. to locate all the named entities in the text and tag them as certain named entity categories such as Entity, Time expression and Numeric expression. In the previous works, NER for Telugu language is addressed with Conditional Random Fields (CRF) and Maximum Entropy models however they failed to handle ambiguous named entity tags for the same named entity. This paper presents a hybrid statistical system for Named Entity Recognition in Telugu language in which named entities are identified by both dictionary-based approach and statistical Hidden Markov Model (HMM). The proposed method uses Lexicon-lookup dictionary and contexts based on semantic features for predicting named entity tags. Further HMM is used to resolve the named entity ambiguities in predicted named entity tags. The present work reports an average accuracy of 86.3% for finding the named entities

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

A Survey of Arabic Named Entity Recognition and Classification

Computational Linguistics ◽

10.1162/coli_a_00178 ◽

2014 ◽

Vol 40 (2) ◽

pp. 469-510 ◽

Cited By ~ 62

Author(s):

Khaled Shaalan

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Relevant Information ◽

Arabic Language ◽

Entity Recognition ◽

Named Entities ◽

Linguistic Resources ◽

Named Entity ◽

To Receive ◽

Made In

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Download Full-text

The sale of heritage on eBay: Market trends and cultural value

Big Data & Society ◽

10.1177/2053951720968865 ◽

2020 ◽

Vol 7 (2) ◽

pp. 205395172096886

Author(s):

Mark Altaweel ◽

Tasoula Georgiou Hadjitofi

Keyword(s):

Language Processing ◽

Large Scale ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Cultural Value ◽

Named Entity ◽

Online Marketplace ◽

Large Scale Analysis ◽

Market Trends

The marketisation of heritage has been a major topic of interest among heritage specialists studying how the online marketplace shapes sales. Missing from that debate is a large-scale analysis seeking to understand market trends on popular selling platforms such as eBay. Sites such as eBay can inform what heritage items are of interest to the wider public, and thus what is potentially of greater cultural value, while also demonstrating monetary value trends. To better understand the sale of heritage on eBay’s international site, this work applies named entity recognition using conditional random fields, a method within natural language processing, and word dictionaries that inform on market trends. The methods demonstrate how Western markets, particularly the US and UK, have dominated sales for different cultures. Roman, Egyptian, Viking (Norse/Dane) and Near East objects are sold the most. Surprisingly, Cyprus and Egypt, two countries with relatively strict prohibition against the sale of heritage items, make the top 10 selling countries on eBay. Objects such as jewellery, statues and figurines, and religious items sell in relatively greater numbers, while masks and vessels (e.g. vases) sell at generally higher prices. Metal, stone and terracotta are commonly sold materials. More rare materials, such as those made of ivory, papyrus or wood, have relatively higher prices. Few sellers dominate the market, where in some months 40% of sales are controlled by the top 10 sellers. The tool used for the study is freely provided, demonstrating benefits in an automated approach to understanding sale trends.

Download Full-text

An Experimental Study of Hybrid Machine Learning Models for Extracting Named Entities

10.29007/dp5m ◽

2019 ◽

Author(s):

Lei Jiang ◽

Elena Bolshakova

Keyword(s):

Neural Network ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Network Models ◽

Entity Recognition ◽

Neural Network Models ◽

Named Entities ◽

Hybrid Neural Network ◽

Named Entity ◽

Two Hybrid

The paper describes two hybrid neural network models for named entity recognition (NER) in texts, as well as results of experiments with them. The first model, namely Bi-LSTM-CRF, is known and used for NER, while the other model named Gated-CNN- CRF is proposed in this work. It combines convolutional neural network (CNN), gated linear units, and conditional random fields (CRF). Both models were tested for NER on three different language datasets, for English, Russian, and Chinese. All resulted scores of precision, recall and F1-measure for both models are close to the state-of-the-art for NER, and for the English dataset CoNLL-2003, Gated-CNN-CRF model achieves 92.66 of F1-measure, outperforming the known result.

Download Full-text

POS Tagging and NER System for Kannada Using Conditional Random Fields

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2021100101 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1-13

Author(s):

Arpitha Swamy ◽

Srinath S.

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Model Testing ◽

Entity Recognition ◽

Parts Of Speech ◽

Named Entity ◽

Pos Tagging ◽

Proper Nouns ◽

Pos Tagger

Parts-of-speech (POS) tagging is a method used to assign the POS tag for every word present in the text, and named entity recognition (NER) is a process to identify the proper nouns in the text and to classify the identified nouns into certain predefined categories. A POS tagger and a NER system for Kannada text have been proposed utilizing conditional random fields (CRFs). The dataset used for POS tagging consists of 147K tokens, where 103K tokens are used for training and the remaining tokens are used for testing. The proposed CRF model for POS tagging of Kannada text obtained 91.3% of precision, 91.6% of recall, and 91.4% of f-score values, respectively. To develop the NER system for Kannada, the data required is created manually using the modified tag-set containing 40 labels. The dataset used for NER system consists of 16.5K tokens, where 70% of the total words are used for training the model, and the remaining 30% of total words are used for model testing. The developed NER model obtained the 94% of precision, 93.9% of recall, and 93.9% of F1-measure values, respectively.

Download Full-text

Research on Chinese Named Entity Recognition Based on Ontology

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.195-196.1180 ◽

2012 ◽

Vol 195-196 ◽

pp. 1180-1185

Author(s):

Wei Li Chang ◽

Fang Luo ◽

Ji Lai Qian

Keyword(s):

Language Processing ◽

Conditional Random Fields ◽

Critical Role ◽

Named Entity Recognition ◽

Recall Rate ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech ◽

Speech Features ◽

Precision Rate

As a critical role in many Natural Language Processing (NLP) applications, such as Information Extraction, Machine Translation etc, Chinese Named Entity Recognition (NER) remains a challenging task because of its characteristics. This paper proposes a method of Chinese NER, which combining Conditional Random Fields (CRFs) model with domain ontology as a semantic feature besides word and part of speech features. Experiments were made to compare the two kinds of feature templates, and the precision rate and recall rate of Chinese NER rose to 90.86% and 88.23%, which showed remarkable performance of the proposed approach. Combination of ontology and CRFs method increased effectively the precision and recall of Chinese NER.

Download Full-text

Evaluating named entity recognition tools for extracting social networks from novels

PeerJ Computer Science ◽

10.7717/peerj-cs.189 ◽

2019 ◽

Vol 5 ◽

pp. e189 ◽

Cited By ~ 2

Author(s):

Niels Dekker ◽

Tobias Kuhn ◽

Marieke van Erp

Keyword(s):

Social Networks ◽

Social Interactions ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Computer Assisted ◽

Early 20Th Century ◽

Automatic Extraction ◽

Named Entities ◽

Named Entity

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels, the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th and early 20th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day literature as they are to those older novels. We present a study in which we evaluate natural language processing tools for the automatic extraction of social networks from novels as well as their network structure. We find that there are no significant differences between old and modern novels but that both are subject to a large amount of variance. Furthermore, we identify several issues that complicate named entity recognition in our set of novels and we present methods to remedy these. We see this work as a step in creating more culturally-aware AI systems.

Download Full-text

Evaluating social network extraction for classic and modern fiction literature

10.7287/peerj.preprints.27263 ◽

2018 ◽

Author(s):

Niels Dekker ◽

Tobias Kuhn ◽

Marieke van Erp

Keyword(s):

Social Networks ◽

Science Fiction ◽

19Th Century ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Computer Assisted ◽

Named Entities ◽

Named Entity ◽

Modern Fiction

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day science fiction and fantasy literature as they are to those 19th century classics. We present a study to compare classic literature to modern literature in terms of performance of natural language processing tools for the automatic extraction of social networks as well as their network structure. We find that there are no significant differences between the two sets of novels but that both are subject to a high amount of variance. Furthermore, we identify several issues that complicate named entity recognition in modern novels and we present methods to remedy these.

Download Full-text

Conditional Random Fields for Biomedical Named Entity Recognition Revisited

10.21203/rs.3.rs-36431/v1 ◽

2020 ◽

Author(s):

Xie-Yuan Xie

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biomedical Domain ◽

Minimal Set ◽

Named Entities ◽

Named Entity ◽

Biomedical Texts ◽

Biomedical Named Entity Recognition

Abstract Named Entity Recognition (NER) is a key task which automatically extracts Named Entities (NE) from the text. Names of persons, places, date and time are examples of NEs. We are applying Conditional Random Fields (CRFs) for NER in biomedical domain. Examples of NEs in biomedical texts are gene, proteins. We used a minimal set of features to train CRF algorithm and obtained a good results for biomedical texts.

Download Full-text

A Software Tool for Biomedical Information Extraction (And Beyond)

Health Information Systems ◽

10.4018/978-1-60566-988-5.ch061 ◽

2011 ◽

pp. 975-985

Author(s):

Burr Settles

Keyword(s):

Open Source ◽

Language Processing ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Software Tool ◽

Cell Types ◽

Entity Recognition ◽

Named Entity ◽

Software Distribution ◽

Classification Information

ABNER (A Biomedical Named Entity Recognizer) is an open-source software tool for text mining in the molecular biology literature. It processes unstructured biomedical documents in order to discover and annotate mentions of genes, proteins, cell types, and other entities of interest. This task, known as named entity recognition (NER), is an important first step for many larger information management goals in biomedicine, namely extraction of biochemical relationships, document classification, information retrieval, and the like. To accomplish this task, ABNER uses state-of-the-art machine learning models for sequence labeling called conditional random fields (CRFs). The software distribution comes bundled with two models that are pre-trained on standard evaluation corpora. ABNER can run as a stand-alone application with a graphical user interface, or be accessed as a Java API allowing it to be re-trained with new labeled corpora and incorporated into other, higher-level applications. This chapter describes the software and its features, presents an overview of the underlying technology, and provides a discussion of some of the more advanced natural language processing systems for which ABNER has been used as a component. ABNER is open-source and freely available from http://pages. cs.wisc.edu/~bsettles/abner/

Download Full-text