Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition

Usually taken as linguistic features by Part-Of-Speech (POS) tagging, Named Entity Recognition (NER) is a major task in Natural Language Processing (NLP). In this paper, we put forward a new comprehensive-embedding, considering three aspects, namely character-embedding, word-embedding, and pos-embedding stitched in the order we give, and thus get their dependencies, based on which we propose a new Character–Word–Position Combined BiLSTM-Attention (CWPC_BiAtt) for the Chinese NER task. Comprehensive-embedding via the Bidirectional Llong Short-Term Memory (BiLSTM) layer can get the connection between the historical and future information, and then employ the attention mechanism to capture the connection between the content of the sentence at the current position and that at any location. Finally, we utilize Conditional Random Field (CRF) to decode the entire tagging sequence. Experiments show that CWPC_BiAtt model we proposed is well qualified for the NER task on Microsoft Research Asia (MSRA) dataset and Weibo NER corpus. A high precision and recall were obtained, which verified the stability of the model. Position-embedding in comprehensive-embedding can compensate for attention-mechanism to provide position information for the disordered sequence, which shows that comprehensive-embedding has completeness. Looking at the entire model, our proposed CWPC_BiAtt has three distinct characteristics: completeness, simplicity, and stability. Our proposed CWPC_BiAtt model achieved the highest F-score, achieving the state-of-the-art performance in the MSRA dataset and Weibo NER corpus.

Download Full-text

Named entity recognition for Polish

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2019-0010 ◽

2019 ◽

Vol 55 (2) ◽

pp. 239-269

Author(s):

Michał Marcińczuk ◽

Aleksander Wawer

Keyword(s):

Open Source ◽

State Of The Art ◽

Proper Names ◽

Named Entity Recognition ◽

Entity Recognition ◽

Coarse Grained ◽

Named Entity ◽

Current State ◽

Annotated Corpora ◽

Available Resources

Abstract In this article we discuss the current state-of-the-art for named entity recognition for Polish. We present publicly available resources and open-source tools for named entity recognition. The overview includes various kind of resources, i.e. guidelines, annotated corpora (NKJP, KPWr, CEN, PST) and lexicons (NELexiconS, PNET, Gazetteer). We present the major NER tools for Polish (Sprout, NERF, Liner2, Parallel LSTM-CRFs and PolDeepNer) and discuss their performance on the reference datasets. In the article we cover identification of named entity mentions in the running text, local and global entity categorization, fine- and coarse-grained categorization and lemmatization of proper names.

Download Full-text

POS Tagging and NER System for Kannada Using Conditional Random Fields

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2021100101 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1-13

Author(s):

Arpitha Swamy ◽

Srinath S.

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Model Testing ◽

Entity Recognition ◽

Parts Of Speech ◽

Named Entity ◽

Pos Tagging ◽

Proper Nouns ◽

Pos Tagger

Parts-of-speech (POS) tagging is a method used to assign the POS tag for every word present in the text, and named entity recognition (NER) is a process to identify the proper nouns in the text and to classify the identified nouns into certain predefined categories. A POS tagger and a NER system for Kannada text have been proposed utilizing conditional random fields (CRFs). The dataset used for POS tagging consists of 147K tokens, where 103K tokens are used for training and the remaining tokens are used for testing. The proposed CRF model for POS tagging of Kannada text obtained 91.3% of precision, 91.6% of recall, and 91.4% of f-score values, respectively. To develop the NER system for Kannada, the data required is created manually using the modified tag-set containing 40 labels. The dataset used for NER system consists of 16.5K tokens, where 70% of the total words are used for training the model, and the remaining 30% of total words are used for model testing. The developed NER model obtained the 94% of precision, 93.9% of recall, and 93.9% of F1-measure values, respectively.

Download Full-text

Named entity recognition in texts with the help of part of speech tagging

Bulletin of Taras Shevchenko National University of Kyiv. Series: Physics and Mathematics ◽

10.17721/1812-5409.2018/4.11 ◽

2018 ◽

pp. 74-83

Author(s):

M. Bevza

Keyword(s):

State Of The Art ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Recent Developments ◽

Future Work

We analyze neural network architectures that yield state of the art results on named entity recognition task and propose a number of new architectures for improving results even further. We have analyzed a number of ideas and approaches that researchers have used to achieve state of the art results in a variety of NLP tasks. In this work, we present a few architectures which we consider to be most likely to improve the existing state of the art solutions for named entity recognition task and part of speech tasks. The architectures are inspired by recent developments in multi-task learning. This work tests the hypothesis that NER and POS are related tasks and adding information about POS tags as input to the network can help achieve better NER results. And vice versa, information about NER tags can help solve the task of POS tagging. This work also contains the implementation of the network and results of the experiments together with the conclusions and future work.

Download Full-text

A Software Tool for Biomedical Information Extraction (And Beyond)

Health Information Systems ◽

10.4018/978-1-60566-988-5.ch061 ◽

2011 ◽

pp. 975-985

Author(s):

Burr Settles

Keyword(s):

Open Source ◽

Language Processing ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Software Tool ◽

Cell Types ◽

Entity Recognition ◽

Named Entity ◽

Software Distribution ◽

Classification Information

ABNER (A Biomedical Named Entity Recognizer) is an open-source software tool for text mining in the molecular biology literature. It processes unstructured biomedical documents in order to discover and annotate mentions of genes, proteins, cell types, and other entities of interest. This task, known as named entity recognition (NER), is an important first step for many larger information management goals in biomedicine, namely extraction of biochemical relationships, document classification, information retrieval, and the like. To accomplish this task, ABNER uses state-of-the-art machine learning models for sequence labeling called conditional random fields (CRFs). The software distribution comes bundled with two models that are pre-trained on standard evaluation corpora. ABNER can run as a stand-alone application with a graphical user interface, or be accessed as a Java API allowing it to be re-trained with new labeled corpora and incorporated into other, higher-level applications. This chapter describes the software and its features, presents an overview of the underlying technology, and provides a discussion of some of the more advanced natural language processing systems for which ABNER has been used as a component. ABNER is open-source and freely available from http://pages. cs.wisc.edu/~bsettles/abner/

Download Full-text

Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00374 ◽

2021 ◽

Vol 9 ◽

pp. 410-428

Author(s):

Edoardo M. Ponti ◽

Ivan Vulić ◽

Ryan Cotterell ◽

Marinela Parovic ◽

Roi Reichart ◽

...

Keyword(s):

Latent Variables ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Language Varieties ◽

Named Entity ◽

Shot Classification ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Abstract Most combinations of NLP tasks and language varieties lack in-domain examples for supervised training because of the paucity of annotated data. How can neural models make sample-efficient generalizations from task–language combinations with available data to low-resource ones? In this work, we propose a Bayesian generative model for the space of neural parameters. We assume that this space can be factorized into latent variables for each language and each task. We infer the posteriors over such latent variables based on data from seen task–language combinations through variational inference. This enables zero-shot classification on unseen combinations at prediction time. For instance, given training data for named entity recognition (NER) in Vietnamese and for part-of-speech (POS) tagging in Wolof, our model can perform accurate predictions for NER in Wolof. In particular, we experiment with a typologically diverse sample of 33 languages from 4 continents and 11 families, and show that our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods. Our code is available at github.com/cambridgeltl/parameter-factorization.

Download Full-text

A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6443 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9090-9097

Author(s):

Niels Van der Heijden ◽

Samira Abnar ◽

Ekaterina Shutova

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech ◽

Joint Training ◽

Comprehensive Comparison

The lack of annotated data in many languages is a well-known challenge within the field of multilingual natural language processing (NLP). Therefore, many recent studies focus on zero-shot transfer learning and joint training across languages to overcome data scarcity for low-resource languages. In this work we (i) perform a comprehensive comparison of state-of-the-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging; and (ii) propose a new method for creating multilingual contextualized word embeddings, compare it to multiple baselines and show that it performs at or above state-of-the-art level in zero-shot transfer settings. Finally, we show that our method allows for better knowledge sharing across languages in a joint training setting.

Download Full-text

A single-model approach for Arabic segmentation, POS tagging, and named entity recognition

2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP) ◽

10.1109/icnlsp.2018.8374393 ◽

2018 ◽

Cited By ~ 3

Author(s):

Abed Alhakim Freihat ◽

Gabor Bella ◽

Hamdy Mubarak ◽

Fausto Giunchiglia

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Single Model ◽

Named Entity ◽

Pos Tagging ◽

Model Approach

Download Full-text

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Data ◽

10.3390/data3040053 ◽

2018 ◽

Vol 3 (4) ◽

pp. 53 ◽

Cited By ~ 1

Author(s):

Maria Mitrofan ◽

Verginica Barbu Mititelu ◽

Grigorina Mitrofan

Keyword(s):

Language Processing ◽

Gold Standard ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Resources ◽

Named Entities ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech ◽

Biomedical Named Entity Recognition

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.

Download Full-text