scholarly journals Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling

2017 ◽  
Vol 5 ◽  
pp. 247-261 ◽  
Author(s):  
Gáabor Berend

In this paper we propose and carefully evaluate a sequence labeling framework which solely utilizes sparse indicator features derived from dense distributed word representations. The proposed model obtains (near) state-of-the art performance for both part-of-speech tagging and named entity recognition for a variety of languages. Our model relies only on a few thousand sparse coding-derived features, without applying any modification of the word representations employed for the different tasks. The proposed model has favorable generalization properties as it retains over 89.8% of its average POS tagging accuracy when trained at 1.2% of the total available training data, i.e. 150 sentences per language.

Author(s):  
Minlong Peng ◽  
Qi Zhang ◽  
Xiaoyu Xing ◽  
Tao Gui ◽  
Jinlan Fu ◽  
...  

Word representation is a key component in neural-network-based sequence labeling systems. However, representations of unseen or rare words trained on the end task are usually poor for appreciable performance. This is commonly referred to as the out-of-vocabulary (OOV) problem. In this work, we address the OOV problem in sequence labeling using only training data of the task. To this end, we propose a novel method to predict representations for OOV words from their surface-forms (e.g., character sequence) and contexts. The method is specifically designed to avoid the error propagation problem suffered by existing approaches in the same paradigm. To evaluate its effectiveness, we performed extensive empirical studies on four part-of-speech tagging (POS) tasks and four named entity recognition (NER) tasks. Experimental results show that the proposed method can achieve better or competitive performance on the OOV problem compared with existing state-of-the-art methods.


Author(s):  
M. Bevza

We analyze neural network architectures that yield state of the art results on named entity recognition task and propose a number of new architectures for improving results even further. We have analyzed a number of ideas and approaches that researchers have used to achieve state of the art results in a variety of NLP tasks. In this work, we present a few architectures which we consider to be most likely to improve the existing state of the art solutions for named entity recognition task and part of speech tasks. The architectures are inspired by recent developments in multi-task learning. This work tests the hypothesis that NER and POS are related tasks and adding information about POS tags as input to the network can help achieve better NER results. And vice versa, information about NER tags can help solve the task of POS tagging. This work also contains the implementation of the network and results of the experiments together with the conclusions and future work.


2021 ◽  
Vol 9 ◽  
pp. 410-428
Author(s):  
Edoardo M. Ponti ◽  
Ivan Vulić ◽  
Ryan Cotterell ◽  
Marinela Parovic ◽  
Roi Reichart ◽  
...  

Abstract Most combinations of NLP tasks and language varieties lack in-domain examples for supervised training because of the paucity of annotated data. How can neural models make sample-efficient generalizations from task–language combinations with available data to low-resource ones? In this work, we propose a Bayesian generative model for the space of neural parameters. We assume that this space can be factorized into latent variables for each language and each task. We infer the posteriors over such latent variables based on data from seen task–language combinations through variational inference. This enables zero-shot classification on unseen combinations at prediction time. For instance, given training data for named entity recognition (NER) in Vietnamese and for part-of-speech (POS) tagging in Wolof, our model can perform accurate predictions for NER in Wolof. In particular, we experiment with a typologically diverse sample of 33 languages from 4 continents and 11 families, and show that our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods. Our code is available at github.com/cambridgeltl/parameter-factorization.


Author(s):  
Yuan Zhang ◽  
Hongshen Chen ◽  
Yihong Zhao ◽  
Qun Liu ◽  
Dawei Yin

Sequence tagging is the basis for multiple applications in natural language processing. Despite successes in learning long term token sequence dependencies with neural network, tag dependencies are rarely considered previously. Sequence tagging actually possesses complex dependencies and interactions among the input tokens and the output tags. We propose a novel multi-channel model, which handles different ranges of token-tag dependencies and their interactions simultaneously. A tag LSTM is augmented to manage the output tag dependencies and word-tag interactions, while three mechanisms are presented to efficiently incorporate token context representation and tag dependency. Extensive experiments on part-of-speech tagging and named entity recognition tasks show that  the proposed model outperforms the BiLSTM-CRF baseline by effectively incorporating the tag dependency feature.


2021 ◽  
Vol 4 (1) ◽  
pp. 38
Author(s):  
Joan Santoso ◽  
Esther Irawati Setiawan ◽  
Christian Nathaniel Purwanto ◽  
Fachrul Kurniawan

Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes..


Sign in / Sign up

Export Citation Format

Share Document