A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings

Niels Van der Heijden; Samira Abnar; Ekaterina Shutova

doi:10.1609/aaai.v34i05.6443

A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6443 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9090-9097

Author(s):

Niels Van der Heijden ◽

Samira Abnar ◽

Ekaterina Shutova

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech ◽

Joint Training ◽

Comprehensive Comparison

The lack of annotated data in many languages is a well-known challenge within the field of multilingual natural language processing (NLP). Therefore, many recent studies focus on zero-shot transfer learning and joint training across languages to overcome data scarcity for low-resource languages. In this work we (i) perform a comprehensive comparison of state-of-the-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging; and (ii) propose a new method for creating multilingual contextualized word embeddings, compare it to multiple baselines and show that it performs at or above state-of-the-art level in zero-shot transfer settings. Finally, we show that our method allows for better knowledge sharing across languages in a joint training setting.

Download Full-text

Named entity recognition in texts with the help of part of speech tagging

Bulletin of Taras Shevchenko National University of Kyiv. Series: Physics and Mathematics ◽

10.17721/1812-5409.2018/4.11 ◽

2018 ◽

pp. 74-83

Author(s):

M. Bevza

Keyword(s):

State Of The Art ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Recent Developments ◽

Future Work

We analyze neural network architectures that yield state of the art results on named entity recognition task and propose a number of new architectures for improving results even further. We have analyzed a number of ideas and approaches that researchers have used to achieve state of the art results in a variety of NLP tasks. In this work, we present a few architectures which we consider to be most likely to improve the existing state of the art solutions for named entity recognition task and part of speech tasks. The architectures are inspired by recent developments in multi-task learning. This work tests the hypothesis that NER and POS are related tasks and adding information about POS tags as input to the network can help achieve better NER results. And vice versa, information about NER tags can help solve the task of POS tagging. This work also contains the implementation of the network and results of the experiments together with the conclusions and future work.

Download Full-text

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Data ◽

10.3390/data3040053 ◽

2018 ◽

Vol 3 (4) ◽

pp. 53 ◽

Cited By ~ 1

Author(s):

Maria Mitrofan ◽

Verginica Barbu Mititelu ◽

Grigorina Mitrofan

Keyword(s):

Language Processing ◽

Gold Standard ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Resources ◽

Named Entities ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech ◽

Biomedical Named Entity Recognition

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.

Download Full-text

South China Sea Conflicts Classification Using Named Entity Recognition (NER) and Part-of-Speech (POS) Tagging

International Journal of Innovative Computing ◽

10.11113/ijic.v10n1.255 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Nur Rafeeqkha Sulaiman ◽

Maheyzah Md Siraj

Keyword(s):

South China Sea ◽

Language Processing ◽

South China ◽

Named Entity Recognition ◽

Online News ◽

Entity Recognition ◽

China Sea ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech

Internet connects everyone to everything globally. The existence of Internet eases people in completing daily tasks. Thanks to Internet, information is being digitalized and spread openly to the public. Online news articles not only provide us with useful and reliable information and reports, it also eases information extraction and gathering for research purposes especially in Natural Language Processing (NLP) and Machine Learning (ML). The topics regarding the South China Sea have been popular lately due to the rise of conflicts between several countries claim on the islands in the sea. Gathering data through Internet and online sources proves to be easy, but to process a huge amount data and to identify only useful information manually takes a longer time to complete. Extracting important features from a text document can be done by using one or a combination of feature extraction methods. Relevant information and the classification of news articles in relation to the conflicts in South China Sea need to be done. In this paper, a model is proposed to use Named Entity Recognition (NER) that search for and classifies important information regarding to the conflicts. In order to do that, a combination of Part-of-Speech (POS) and NER are needed to extract type of conflicts from the news. This study also claims to classify news by using Conditional Random Field (CRF) algorithm and Multinomial Naïve Bayes (MNB) as classification methods by training and testing the data.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

Advances in Computational Linguistics and Text Processing Frameworks

Advances in Computer and Electrical Engineering - Handbook of Research on Engineering Innovations and Technology Management in Organizations ◽

10.4018/978-1-7998-2772-6.ch012 ◽

2020 ◽

pp. 217-244

Author(s):

Ayush Srivastav ◽

Hera Khan ◽

Amit Kumar Mishra

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes

Data ◽

10.3390/data5030060 ◽

2020 ◽

Vol 5 (3) ◽

pp. 60

Author(s):

Nasser Alshammari ◽

Saad Alanazi

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Definite Article ◽

Linguistic Features ◽

Annotation Scheme ◽

Named Entity ◽

Part Of Speech ◽

Annotation Process ◽

Arabic Natural Language Processing

This article outlines a novel data descriptor that provides the Arabic natural language processing community with a dataset dedicated to named entity recognition tasks for diseases. The dataset comprises more than 60 thousand words, which were annotated manually by two independent annotators using the inside–outside (IO) annotation scheme. To ensure the reliability of the annotation process, the inter-annotator agreements rate was calculated, and it scored 95.14%. Due to the lack of research efforts in the literature dedicated to studying Arabic multi-annotation schemes, a distinguishing and a novel aspect of this dataset is the inclusion of six more annotation schemes that will bridge the gap by allowing researchers to explore and compare the effects of these schemes on the performance of the Arabic named entity recognizers. These annotation schemes are IOE, IOB, BIES, IOBES, IE, and BI. Additionally, five linguistic features, including part-of-speech tags, stopwords, gazetteers, lexical markers, and the presence of the definite article, are provided for each record in the dataset.

Download Full-text

CWPC_BiAtt: Character–Word–Position Combined BiLSTM-Attention for Chinese Named Entity Recognition

Information ◽

10.3390/info11010045 ◽

2020 ◽

Vol 11 (1) ◽

pp. 45 ◽

Cited By ~ 1

Author(s):

Shardrom Johnson ◽

Sherlock Shen ◽

Yuanchen Liu

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Conditional Random Field ◽

Named Entity Recognition ◽

Attention Mechanism ◽

Entity Recognition ◽

Position Information ◽

Named Entity ◽

Pos Tagging ◽

Word Position

Usually taken as linguistic features by Part-Of-Speech (POS) tagging, Named Entity Recognition (NER) is a major task in Natural Language Processing (NLP). In this paper, we put forward a new comprehensive-embedding, considering three aspects, namely character-embedding, word-embedding, and pos-embedding stitched in the order we give, and thus get their dependencies, based on which we propose a new Character–Word–Position Combined BiLSTM-Attention (CWPC_BiAtt) for the Chinese NER task. Comprehensive-embedding via the Bidirectional Llong Short-Term Memory (BiLSTM) layer can get the connection between the historical and future information, and then employ the attention mechanism to capture the connection between the content of the sentence at the current position and that at any location. Finally, we utilize Conditional Random Field (CRF) to decode the entire tagging sequence. Experiments show that CWPC_BiAtt model we proposed is well qualified for the NER task on Microsoft Research Asia (MSRA) dataset and Weibo NER corpus. A high precision and recall were obtained, which verified the stability of the model. Position-embedding in comprehensive-embedding can compensate for attention-mechanism to provide position information for the disordered sequence, which shows that comprehensive-embedding has completeness. Looking at the entire model, our proposed CWPC_BiAtt has three distinct characteristics: completeness, simplicity, and stability. Our proposed CWPC_BiAtt model achieved the highest F-score, achieving the state-of-the-art performance in the MSRA dataset and Weibo NER corpus.

Download Full-text

A Hierarchical Multi-Task Approach for Learning Embeddings from Semantic Tasks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016949 ◽

2019 ◽

Vol 33 ◽

pp. 6949-6956 ◽

Cited By ~ 8

Author(s):

Victor Sanh ◽

Thomas Wolf ◽

Sebastian Ruder

Keyword(s):

Language Processing ◽

Semantic Information ◽

State Of The Art ◽

Named Entity Recognition ◽

Relation Extraction ◽

Entity Recognition ◽

Inductive Bias ◽

Named Entity ◽

Task Learning ◽

Hidden States

Much effort has been devoted to evaluate whether multi-task learning can be leveraged to learn rich representations that can be used in various Natural Language Processing (NLP) down-stream applications. However, there is still a lack of understanding of the settings in which multi-task learning has a significant effect. In this work, we introduce a hierarchical model trained in a multi-task learning setup on a set of carefully selected semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieves state-of-the-art results on a number of tasks, namely Named Entity Recognition, Entity Mention Detection and Relation Extraction without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induces a set of shared semantic representations at lower layers of the model. We show that as we move from the bottom to the top layers of the model, the hidden states of the layers tend to represent more complex semantic information.

Download Full-text

Research on Chinese Named Entity Recognition Based on Ontology

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.195-196.1180 ◽

2012 ◽

Vol 195-196 ◽

pp. 1180-1185

Author(s):

Wei Li Chang ◽

Fang Luo ◽

Ji Lai Qian

Keyword(s):

Language Processing ◽

Conditional Random Fields ◽

Critical Role ◽

Named Entity Recognition ◽

Recall Rate ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech ◽

Speech Features ◽

Precision Rate

As a critical role in many Natural Language Processing (NLP) applications, such as Information Extraction, Machine Translation etc, Chinese Named Entity Recognition (NER) remains a challenging task because of its characteristics. This paper proposes a method of Chinese NER, which combining Conditional Random Fields (CRFs) model with domain ontology as a semantic feature besides word and part of speech features. Experiments were made to compare the two kinds of feature templates, and the precision rate and recall rate of Chinese NER rose to 90.86% and 88.23%, which showed remarkable performance of the proposed approach. Combination of ontology and CRFs method increased effectively the precision and recall of Chinese NER.

Download Full-text

An ERNIE-Based Joint Model for Chinese Named Entity Recognition

Applied Sciences ◽

10.3390/app10165711 ◽

2020 ◽

Vol 10 (16) ◽

pp. 5711

Author(s):

Yu Wang ◽

Yining Sun ◽

Zuchang Ma ◽

Lisheng Gao ◽

Yang Xu

Keyword(s):

Language Processing ◽

Text Classification ◽

Knowledge Integration ◽

Named Entity Recognition ◽

Training Model ◽

Joint Model ◽

Entity Recognition ◽

Named Entity ◽

Sentence Level ◽

Joint Training

Named Entity Recognition (NER) is the fundamental task for Natural Language Processing (NLP) and the initial step in building a Knowledge Graph (KG). Recently, BERT (Bidirectional Encoder Representations from Transformers), which is a pre-training model, has achieved state-of-the-art (SOTA) results in various NLP tasks, including the NER. However, Chinese NER is still a more challenging task for BERT because there are no physical separations between Chinese words, and BERT can only obtain the representations of Chinese characters. Nevertheless, the Chinese NER cannot be well handled with character-level representations, because the meaning of a Chinese word is quite different from that of the characters, which make up the word. ERNIE (Enhanced Representation through kNowledge IntEgration), which is an improved pre-training model of BERT, is more suitable for Chinese NER because it is designed to learn language representations enhanced by the knowledge masking strategy. However, the potential of ERNIE has not been fully explored. ERNIE only utilizes the token-level features and ignores the sentence-level feature when performing the NER task. In this paper, we propose the ERNIE-Joint, which is a joint model based on ERNIE. The ERNIE-Joint can utilize both the sentence-level and token-level features by joint training the NER and text classification tasks. In order to use the raw NER datasets for joint training and avoid additional annotations, we perform the text classification task according to the number of entities in the sentences. The experiments are conducted on two datasets: MSRA-NER and Weibo. These datasets contain Chinese news data and Chinese social media data, respectively. The results demonstrate that the ERNIE-Joint not only outperforms BERT and ERNIE but also achieves the SOTA results on both datasets.

Download Full-text