Zero-Shot Entity Recognition via Multi-Source Projection and Unlabeled Data

Although modern named entity recognition (NER) systems show impressive performance on standard datasets, they perform poorly when presented with noisy data. In particular, capitalization is a strong signal for entities in many languages, and even state of the art models overfit to this feature, with drastically lower performance on uncapitalized text. In this work, we address the problem of robustness of NER systems in data with noisy or uncertain casing, using a pretraining objective that predicts casing in text, or a truecaser, leveraging unlabeled data. The pretrained truecaser is combined with a standard BiLSTM-CRF model for NER by appending output distributions to character embeddings. In experiments over several datasets of varying domain and casing quality, we show that our new model improves performance in uncased text, even adding value to uncased BERT embeddings. Our method achieves a new state of the art on the WNUT17 shared task dataset.

Download Full-text

Named entity recognition in Bengali using system combination

Lingvisticae Investigationes ◽

10.1075/li.37.1.01ekb ◽

2014 ◽

Vol 37 (1) ◽

pp. 1-22 ◽

Cited By ~ 2

Author(s):

Asif Ekbal ◽

Sivaji Bandyopadhyay

Keyword(s):

Conditional Random Field ◽

Named Entity Recognition ◽

Machine Learning Algorithms ◽

Unlabeled Data ◽

Entity Recognition ◽

Supervised Machine Learning ◽

Support Vector ◽

System Combination ◽

Named Entity ◽

Part Of Speech

This paper reports a voted Named Entity Recognition (NER) system that exploits appropriate unlabeled data. Initially, we develop NER systems using the supervised machine learning algorithms such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). Each of these models makes use of the language independent features in the form of different contextual and orthographic word-level features along with the language dependent features extracted from the Part-of-Speech (POS) tagger and gazetteers. Context patterns generated from the unlabeled data using an active learning method are also used as the features in each of the classifiers. A semi-supervised method is proposed to describe the measures to automatically select effective unlabeled documents as well as sentences from the unlabeled data. Finally, the supervised models are combined together into a final system by defining appropriate weighted voting technique. Experimental results for a resource-poor language like Bengali show the effectiveness of the proposed approach with the overall recall, precision and F-measure values of 93.81%, 92.18% and 92.98%, respectively.

Download Full-text

UniTrans : Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/543 ◽

2020 ◽

Author(s):

Qianhui Wu ◽

Zijia Lin ◽

Börje F. Karlsson ◽

Biqing Huang ◽

Jian-Guang Lou

Keyword(s):

Data Transfer ◽

Named Entity Recognition ◽

Unlabeled Data ◽

Entity Recognition ◽

Target Language ◽

Context Information ◽

Prior Work ◽

Named Entity ◽

Model Transfer ◽

Cross Lingual

Prior work in cross-lingual named entity recognition (NER) with no/little labeled data falls into two primary categories: model transfer- and data transfer-based methods. In this paper, we find that both method types can complement each other, in the sense that, the former can exploit context information via language-independent features but sees no task-specific information in the target language; while the latter generally generates pseudo target-language training data via translation but its exploitation of context information is weakened by inaccurate translations. Moreover, prior work rarely leverages unlabeled data in the target language, which can be effortlessly collected and potentially contains valuable information for improved results. To handle both problems, we propose a novel approach termed UniTrans to Unify both model and data Transfer for cross-lingual NER, and furthermore, leverage the available information from unlabeled target-language data via enhanced knowledge distillation. We evaluate our proposed UniTrans over 4 target languages on benchmark datasets. Our experimental results show that it substantially outperforms the existing state-of-the-art methods.

Download Full-text

Developing and Deploying Algorithms for Information Extraction using Classification Measures for Named Entity Recognition

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i10.235248 ◽

2018 ◽

Vol 6 (10) ◽

pp. 235-248

Author(s):

Rehan Khan ◽

A.J. Singh

Keyword(s):

Information Extraction ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

Arabic named entity recognition using optimized feature sets

10.3115/1613715.1613755 ◽

2008 ◽

Cited By ~ 38

Author(s):

Yassine Benajiba ◽

Mona Diab ◽

Paolo Rosso

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Feature Sets ◽

Named Entity

Download Full-text

Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition

10.3115/v1/p14-5003 ◽

2014 ◽

Cited By ~ 25

Author(s):

Jana Straková ◽

Milan Straka ◽

Jan Hajič

Keyword(s):

Open Source ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Pos Tagging

Download Full-text

Semantische Suche nach wissenschaftlichen Videos – Automatische Verschlagwortung durch Named Entity Recognition

Zeitschrift für Bibliothekswesen und Bibliographie ◽

10.3196/18642950146145154 ◽

2014 ◽

Vol 61 (4-5) ◽

pp. 254-258 ◽

Cited By ~ 2

Author(s):

Margret Plank ◽

Sven Strobel

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

Improving Robustness of Speaker Recognition to New Conditions Using Unlabeled Data

10.21437/interspeech.2017-605 ◽

2017 ◽

Cited By ~ 1

Author(s):

Diego Castan ◽

Mitchell McLaren ◽

Luciana Ferrer ◽

Aaron Lawson ◽

Alicia Lozano-Diez

Keyword(s):

Speaker Recognition ◽

Unlabeled Data

Download Full-text

Developing a RadLex-based Named Entity Recognition Tool for Mining Textual Radiology Reports (Preprint)

10.2196/preprints.25378 ◽

2020 ◽

Author(s):

Shintaro Tsuji ◽

Andrew Wen ◽

Naoki Takahashi ◽

Hongjian Zhang ◽

Katsuhiko Ogasawara ◽

...

Keyword(s):

Named Entity Recognition ◽

Noun Phrases ◽

General Purpose ◽

Entity Recognition ◽

Free Text ◽

Clinical Text ◽

Named Entity ◽

Radiology Reports ◽

Two Measures ◽

F Measure

BACKGROUND Named entity recognition (NER) plays an important role in extracting the features of descriptions for mining free-text radiology reports. However, the performance of existing NER tools is limited because the number of entities depends on its dictionary lookup. Especially, the recognition of compound terms is very complicated because there are a variety of patterns. OBJECTIVE The objective of the study is to develop and evaluate a NER tool concerned with compound terms using the RadLex for mining free-text radiology reports. METHODS We leveraged the clinical Text Analysis and Knowledge Extraction System (cTAKES) to develop customized pipelines using both RadLex and SentiWordNet (a general-purpose dictionary, GPD). We manually annotated 400 of radiology reports for compound terms (Cts) in noun phrases and used them as the gold standard for the performance evaluation (precision, recall, and F-measure). Additionally, we also created a compound-term-enhanced dictionary (CtED) by analyzing false negatives (FNs) and false positives (FPs), and applied it for another 100 radiology reports for validation. We also evaluated the stem terms of compound terms, through defining two measures: an occurrence ratio (OR) and a matching ratio (MR). RESULTS The F-measure of the cTAKES+RadLex+GPD was 32.2% (Precision 92.1%, Recall 19.6%) and that of combined the CtED was 67.1% (Precision 98.1%, Recall 51.0%). The OR indicated that stem terms of “effusion”, "node", "tube", and "disease" were used frequently, but it still lacks capturing Cts. The MR showed that 71.9% of stem terms matched with that of ontologies and RadLex improved about 22% of the MR from the cTAKES default dictionary. The OR and MR revealed that the characteristics of stem terms would have the potential to help generate synonymous phrases using ontologies. CONCLUSIONS We developed a RadLex-based customized pipeline for parsing radiology reports and demonstrated that CtED and stem term analysis has the potential to improve dictionary-based NER performance toward expanding vocabularies.

Download Full-text

Improving Topic Coherence Using Entity Extraction Denoising

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2018-0004 ◽

2018 ◽

Vol 110 (1) ◽

pp. 85-101 ◽

Cited By ~ 1

Author(s):

Ronald Cardenas ◽

Kevin Bello ◽

Alberto Coronado ◽

Elizabeth Villota

Keyword(s):

Topic Modeling ◽

Human Perception ◽

Relevant Information ◽

Entity Recognition ◽

Entity Extraction ◽

Fine Grained ◽

Job Advertisement ◽

Coherence Score ◽

Probabilistic Topic Modeling ◽

Promising Solution

Abstract Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method and the evaluation of this model is an interesting problem on its own. Topic interpretability measures have been developed in recent years as a more natural option for topic quality evaluation, emulating human perception of coherence with word sets correlation scores. In this paper, we show experimental evidence of the improvement of topic coherence score by restricting the training corpus to that of relevant information in the document obtained by Entity Recognition. We experiment with job advertisement data and find that with this approach topic models improve interpretability in about 40 percentage points on average. Our analysis reveals as well that using the extracted text chunks, some redundant topics are joined while others are split into more skill-specific topics. Fine-grained topics observed in models using the whole text are preserved.

Download Full-text