Research on Chinese Named Entity Recognition Based on Ontology

As a critical role in many Natural Language Processing (NLP) applications, such as Information Extraction, Machine Translation etc, Chinese Named Entity Recognition (NER) remains a challenging task because of its characteristics. This paper proposes a method of Chinese NER, which combining Conditional Random Fields (CRFs) model with domain ontology as a semantic feature besides word and part of speech features. Experiments were made to compare the two kinds of feature templates, and the precision rate and recall rate of Chinese NER rose to 90.86% and 88.23%, which showed remarkable performance of the proposed approach. Combination of ontology and CRFs method increased effectively the precision and recall of Chinese NER.

Download Full-text

Advances in Computational Linguistics and Text Processing Frameworks

Advances in Computer and Electrical Engineering - Handbook of Research on Engineering Innovations and Technology Management in Organizations ◽

10.4018/978-1-7998-2772-6.ch012 ◽

2020 ◽

pp. 217-244

Author(s):

Ayush Srivastav ◽

Hera Khan ◽

Amit Kumar Mishra

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes

Data ◽

10.3390/data5030060 ◽

2020 ◽

Vol 5 (3) ◽

pp. 60

Author(s):

Nasser Alshammari ◽

Saad Alanazi

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Definite Article ◽

Linguistic Features ◽

Annotation Scheme ◽

Named Entity ◽

Part Of Speech ◽

Annotation Process ◽

Arabic Natural Language Processing

This article outlines a novel data descriptor that provides the Arabic natural language processing community with a dataset dedicated to named entity recognition tasks for diseases. The dataset comprises more than 60 thousand words, which were annotated manually by two independent annotators using the inside–outside (IO) annotation scheme. To ensure the reliability of the annotation process, the inter-annotator agreements rate was calculated, and it scored 95.14%. Due to the lack of research efforts in the literature dedicated to studying Arabic multi-annotation schemes, a distinguishing and a novel aspect of this dataset is the inclusion of six more annotation schemes that will bridge the gap by allowing researchers to explore and compare the effects of these schemes on the performance of the Arabic named entity recognizers. These annotation schemes are IOE, IOB, BIES, IOBES, IE, and BI. Additionally, five linguistic features, including part-of-speech tags, stopwords, gazetteers, lexical markers, and the presence of the definite article, are provided for each record in the dataset.

Download Full-text

Named Entity Recognition for the Indonesian Language: Combining Contextual, Morphological and Part-of-Speech Features into a Knowledge Engineering Approach

Discovery Science - Lecture Notes in Computer Science ◽

10.1007/11563983_7 ◽

2005 ◽

pp. 57-69 ◽

Cited By ~ 11

Author(s):

Indra Budi ◽

Stéphane Bressan ◽

Gatot Wahyudi ◽

Zainal A. Hasibuan ◽

Bobby A. A. Nazief

Keyword(s):

Knowledge Engineering ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Engineering Approach ◽

Part Of Speech ◽

Speech Features

Download Full-text

The sale of heritage on eBay: Market trends and cultural value

Big Data & Society ◽

10.1177/2053951720968865 ◽

2020 ◽

Vol 7 (2) ◽

pp. 205395172096886

Author(s):

Mark Altaweel ◽

Tasoula Georgiou Hadjitofi

Keyword(s):

Language Processing ◽

Large Scale ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Cultural Value ◽

Named Entity ◽

Online Marketplace ◽

Large Scale Analysis ◽

Market Trends

The marketisation of heritage has been a major topic of interest among heritage specialists studying how the online marketplace shapes sales. Missing from that debate is a large-scale analysis seeking to understand market trends on popular selling platforms such as eBay. Sites such as eBay can inform what heritage items are of interest to the wider public, and thus what is potentially of greater cultural value, while also demonstrating monetary value trends. To better understand the sale of heritage on eBay’s international site, this work applies named entity recognition using conditional random fields, a method within natural language processing, and word dictionaries that inform on market trends. The methods demonstrate how Western markets, particularly the US and UK, have dominated sales for different cultures. Roman, Egyptian, Viking (Norse/Dane) and Near East objects are sold the most. Surprisingly, Cyprus and Egypt, two countries with relatively strict prohibition against the sale of heritage items, make the top 10 selling countries on eBay. Objects such as jewellery, statues and figurines, and religious items sell in relatively greater numbers, while masks and vessels (e.g. vases) sell at generally higher prices. Metal, stone and terracotta are commonly sold materials. More rare materials, such as those made of ivory, papyrus or wood, have relatively higher prices. Few sellers dominate the market, where in some months 40% of sales are controlled by the top 10 sellers. The tool used for the study is freely provided, demonstrating benefits in an automated approach to understanding sale trends.

Download Full-text

Statistical Method for Named Entity Recognition in Telugu, an Indian Language

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b3500.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 4211-4216

Keyword(s):

Language Processing ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Semantic Features ◽

Indian Language ◽

Named Entities ◽

Maximum Entropy Models ◽

Named Entity ◽

Proper Nouns

One of the important tasks of Natural Language Processing (NLP) is Named Entity Recognition (NER). The primary operation of NER is to identify proper nouns i.e. to locate all the named entities in the text and tag them as certain named entity categories such as Entity, Time expression and Numeric expression. In the previous works, NER for Telugu language is addressed with Conditional Random Fields (CRF) and Maximum Entropy models however they failed to handle ambiguous named entity tags for the same named entity. This paper presents a hybrid statistical system for Named Entity Recognition in Telugu language in which named entities are identified by both dictionary-based approach and statistical Hidden Markov Model (HMM). The proposed method uses Lexicon-lookup dictionary and contexts based on semantic features for predicting named entity tags. Further HMM is used to resolve the named entity ambiguities in predicted named entity tags. The present work reports an average accuracy of 86.3% for finding the named entities

Download Full-text

A Software Tool for Biomedical Information Extraction (And Beyond)

Health Information Systems ◽

10.4018/978-1-60566-988-5.ch061 ◽

2011 ◽

pp. 975-985

Author(s):

Burr Settles

Keyword(s):

Open Source ◽

Language Processing ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Software Tool ◽

Cell Types ◽

Entity Recognition ◽

Named Entity ◽

Software Distribution ◽

Classification Information

ABNER (A Biomedical Named Entity Recognizer) is an open-source software tool for text mining in the molecular biology literature. It processes unstructured biomedical documents in order to discover and annotate mentions of genes, proteins, cell types, and other entities of interest. This task, known as named entity recognition (NER), is an important first step for many larger information management goals in biomedicine, namely extraction of biochemical relationships, document classification, information retrieval, and the like. To accomplish this task, ABNER uses state-of-the-art machine learning models for sequence labeling called conditional random fields (CRFs). The software distribution comes bundled with two models that are pre-trained on standard evaluation corpora. ABNER can run as a stand-alone application with a graphical user interface, or be accessed as a Java API allowing it to be re-trained with new labeled corpora and incorporated into other, higher-level applications. This chapter describes the software and its features, presents an overview of the underlying technology, and provides a discussion of some of the more advanced natural language processing systems for which ABNER has been used as a component. ABNER is open-source and freely available from http://pages. cs.wisc.edu/~bsettles/abner/

Download Full-text

Study on Affect of Part-of-Speech on the Performance of Chinese Named Entity Recognition Based on the Conditional Random Fields

2011 International Conference on Computational and Information Sciences ◽

10.1109/iccis.2011.261 ◽

2011 ◽

Author(s):

Qiu Sha ◽

Duan Bo ◽

Wang Fuyan ◽

Shen Haoro ◽

A. Yuan

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech

Download Full-text

Indonesian Sentence Boundary Detection using Deep Learning Approaches

Knowledge Engineering and Data Science ◽

10.17977/um018v4i12021p38-48 ◽

2021 ◽

Vol 4 (1) ◽

pp. 38

Author(s):

Joan Santoso ◽

Esther Irawati Setiawan ◽

Christian Nathaniel Purwanto ◽

Fachrul Kurniawan

Keyword(s):

Deep Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Learning Approaches ◽

Named Entity ◽

Sequence Labeling ◽

Part Of Speech ◽

Sentence Patterns ◽

Sentence Boundary

Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes..

Download Full-text

A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6443 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9090-9097

Author(s):

Niels Van der Heijden ◽

Samira Abnar ◽

Ekaterina Shutova

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech ◽

Joint Training ◽

Comprehensive Comparison

The lack of annotated data in many languages is a well-known challenge within the field of multilingual natural language processing (NLP). Therefore, many recent studies focus on zero-shot transfer learning and joint training across languages to overcome data scarcity for low-resource languages. In this work we (i) perform a comprehensive comparison of state-of-the-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging; and (ii) propose a new method for creating multilingual contextualized word embeddings, compare it to multiple baselines and show that it performs at or above state-of-the-art level in zero-shot transfer settings. Finally, we show that our method allows for better knowledge sharing across languages in a joint training setting.

Download Full-text

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Data ◽

10.3390/data3040053 ◽

2018 ◽

Vol 3 (4) ◽

pp. 53 ◽

Cited By ~ 1

Author(s):

Maria Mitrofan ◽

Verginica Barbu Mititelu ◽

Grigorina Mitrofan

Keyword(s):

Language Processing ◽

Gold Standard ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Resources ◽

Named Entities ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech ◽

Biomedical Named Entity Recognition

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.

Download Full-text