A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

Xianwen Liao; Yongzhong Huang; Peng Yang; Lei Chen

doi:10.1145/3483524

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3483524 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-21

Author(s):

Xianwen Liao ◽

Yongzhong Huang ◽

Peng Yang ◽

Lei Chen

Keyword(s):

Language Model ◽

Dynamic Programming Algorithm ◽

Named Entity Recognition ◽

Word Segmentation ◽

Training Data ◽

Entity Recognition ◽

Divide And Conquer ◽

Programming Algorithm ◽

Statistical Language Model ◽

Sequence Labeling

By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.

Download Full-text

Learning Task-Specific Representation for Novel Words in Sequence Labeling

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/715 ◽

2019 ◽

Author(s):

Minlong Peng ◽

Qi Zhang ◽

Xiaoyu Xing ◽

Tao Gui ◽

Jinlan Fu ◽

...

Keyword(s):

Empirical Studies ◽

Named Entity Recognition ◽

Learning Task ◽

Training Data ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech Tagging ◽

Sequence Labeling ◽

Part Of Speech ◽

Word Representation

Word representation is a key component in neural-network-based sequence labeling systems. However, representations of unseen or rare words trained on the end task are usually poor for appreciable performance. This is commonly referred to as the out-of-vocabulary (OOV) problem. In this work, we address the OOV problem in sequence labeling using only training data of the task. To this end, we propose a novel method to predict representations for OOV words from their surface-forms (e.g., character sequence) and contexts. The method is specifically designed to avoid the error propagation problem suffered by existing approaches in the same paradigm. To evaluate its effectiveness, we performed extensive empirical studies on four part-of-speech tagging (POS) tasks and four named entity recognition (NER) tasks. Experimental results show that the proposed method can achieve better or competitive performance on the OOV problem compared with existing state-of-the-art methods.

Download Full-text

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

10.21203/rs.3.rs-90025/v1 ◽

2020 ◽

Author(s):

Usman Naseem ◽

Matloob Khushi ◽

Vinay Reddy ◽

Sakthivel Rajendran ◽

Imran Razzak ◽

...

Keyword(s):

State Of The Art ◽

Language Model ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Future Research ◽

Named Entity ◽

Domain Specific ◽

Context Dependent ◽

Biomedical Named Entity Recognition

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

Download Full-text

Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00059 ◽

2017 ◽

Vol 5 ◽

pp. 247-261 ◽

Cited By ~ 1

Author(s):

Gáabor Berend

Keyword(s):

Sparse Coding ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Sequence Labeling ◽

Part Of Speech ◽

Proposed Model

In this paper we propose and carefully evaluate a sequence labeling framework which solely utilizes sparse indicator features derived from dense distributed word representations. The proposed model obtains (near) state-of-the art performance for both part-of-speech tagging and named entity recognition for a variety of languages. Our model relies only on a few thousand sparse coding-derived features, without applying any modification of the word representations employed for the different tasks. The proposed model has favorable generalization properties as it retains over 89.8% of its average POS tagging accuracy when trained at 1.2% of the total available training data, i.e. 150 sentences per language.

Download Full-text

HAMNER: Headword Amplified Multi-Span Distantly Supervised Method for Domain Specific Named Entity Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6358 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8401-8408 ◽

Cited By ~ 1

Author(s):

Shifeng Liu ◽

Yifang Sun ◽

Bing Li ◽

Wei Wang ◽

Xiang Zhao

Keyword(s):

Dynamic Programming Algorithm ◽

Named Entity Recognition ◽

Entity Recognition ◽

Programming Algorithm ◽

Exact Matching ◽

Named Entity ◽

Distant Supervision ◽

Benchmark Datasets ◽

Supervised Methods ◽

Level Model

To tackle Named Entity Recognition (NER) tasks, supervised methods need to obtain sufficient cleanly annotated data, which is labor and time consuming. On the contrary, distantly supervised methods acquire automatically annotated data using dictionaries to alleviate this requirement. Unfortunately, dictionaries hinder the effectiveness of distantly supervised methods for NER due to its limited coverage, especially in specific domains. In this paper, we aim at the limitations of the dictionary usage and mention boundary detection. We generalize the distant supervision by extending the dictionary with headword based non-exact matching. We apply a function to better weight the matched entity mentions. We propose a span-level model, which classifies all the possible spans then infers the selected spans with a proposed dynamic programming algorithm. Experiments on all three benchmark datasets demonstrate that our method outperforms previous state-of-the-art distantly supervised methods.

Download Full-text

Joint Pre-trained Chinese Named Entity Recognition based on Bi-directional Language Model

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001421530037 ◽

2021 ◽

Author(s):

Ma Changxia ◽

Zhang Chen

Keyword(s):

Language Model ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

An Empirical Study of Automatic Chinese Word Segmentation for Spoken Language Understanding and Named Entity Recognition

10.18653/v1/n16-1028 ◽

2016 ◽

Cited By ~ 2

Author(s):

Wencan Luo ◽

Fan Yang

Keyword(s):

Empirical Study ◽

Named Entity Recognition ◽

Spoken Language ◽

Word Segmentation ◽

Entity Recognition ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Language Understanding ◽

Spoken Language Understanding ◽

Named Entity

Download Full-text

K-best max-margin approaches for sequence labeling

Computer Science and Information Systems ◽

10.2298/csis140713014m ◽

2015 ◽

Vol 12 (2) ◽

pp. 465-486

Author(s):

Dejan Mancev ◽

Branimir Todorovic

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Computational Time ◽

Structured Learning ◽

Training Procedure ◽

Named Entity ◽

Sequence Labeling ◽

Small Collection ◽

Shallow Parsing ◽

Output Space

Structured learning algorithms usually require inference during the training procedure. Due to their exponential size of output space, the parameter update is performed only on a relatively small collection built from the ?best? structures. The k-best MIRA is an example of an online algorithm which seeks optimal parameters by making updates on k structures with the highest score at a time. Following the idea of using k-best structures during the learning process, in this paper we introduce four new k-best extensions of max-margin structured algorithms. We discuss their properties and connection, and evaluate all algorithms on two sequence labeling problems, the shallow parsing and named entity recognition. The experiments show how the proposed algorithms are affected by the changes of k in terms of the F-measure and computational time, and that the proposed algorithms can improve results in comparison to the single best case. Moreover, the restriction to the single best case produces a comparison of the existing algorithms.

Download Full-text

A Comparative Study on the Performance of Named Entity Recognition in Materials and Chemistry Fields through Multiple Embedding Combination Based on a Pre-trained Neural Network Language Model

Journal of KIISE ◽

10.5626/jok.2021.48.6.696 ◽

2021 ◽

Vol 48 (6) ◽

pp. 696-706

Author(s):

Myunghoon Lee ◽

Hyeonho Shin ◽

Hong-Woo Chun ◽

Jae-Min Lee ◽

Taehyun Ha ◽

...

Keyword(s):

Neural Network ◽

Comparative Study ◽

Language Model ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Trained Neural Network ◽

Network Language

Download Full-text

The Algorithms for Word Segmentation and Named Entity Recognition of Chinese Medical Records

Advances in Artificial Intelligence and Security - Communications in Computer and Information Science ◽

10.1007/978-3-030-78615-1_35 ◽

2021 ◽

pp. 397-405

Author(s):

Yuan-Nong Ye ◽

Liu-Feng Zheng ◽

Meng-Ya Huang ◽

Tao Liu ◽

Zhu Zeng

Keyword(s):

Medical Records ◽

Named Entity Recognition ◽

Word Segmentation ◽

Entity Recognition ◽

Named Entity

Download Full-text

Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort

Lecture Notes in Computer Science - Computational Science – ICCS 2019 ◽

10.1007/978-3-030-22734-0_29 ◽

2019 ◽

pp. 398-411 ◽

Cited By ~ 2

Author(s):

Roselyne B. Tchoua ◽

Aswathy Ajith ◽

Zhi Hong ◽

Logan T. Ward ◽

Kyle Chard ◽

...

Keyword(s):

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Named Entity ◽

Human Effort

Download Full-text