scholarly journals Low-Resource Text Classification Using Domain-Adversarial Learning

Author(s):  
Daniel Grießhaber ◽  
Ngoc Thang Vu ◽  
Johannes Maucher
2020 ◽  
Vol 62 ◽  
pp. 101056 ◽  
Author(s):  
Daniel Grießhaber ◽  
Ngoc Thang Vu ◽  
Johannes Maucher

2020 ◽  
Author(s):  
Shuning Jin ◽  
Sam Wiseman ◽  
Karl Stratos ◽  
Karen Livescu

2021 ◽  
Author(s):  
Wilson Wongso ◽  
Henry Lucky ◽  
Derwin Suhartono

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.


Information ◽  
2019 ◽  
Vol 10 (12) ◽  
pp. 387 ◽  
Author(s):  
Sardar Parhat ◽  
Mijit Ablimit ◽  
Askar Hamdulla

In this paper, based on the multilingual morphological analyzer, we researched the similar low-resource languages, Uyghur and Kazakh, short text classification. Generally, the online linguistic resources of these languages are noisy. So a preprocessing is necessary and can significantly improve the accuracy. Uyghur and Kazakh are the languages with derivational morphology, in which words are coined by stems concatenated with suffixes. Usually, terms are used as the representation of text content while excluding functional parts as stop words in these languages. By extracting stems we can collect necessary terms and exclude stop words. Morpheme segmentation tool can split text into morphemes with 95% high reliability. After preparing both word- and morpheme-based training text corpora, we apply convolutional neural network (CNN) as a feature selection and text classification algorithm to perform text classification tasks. Experimental results show that the morpheme-based approach outperformed the word-based approach. Word embedding technique is frequently used in text representation both in the framework of neural networks and as a value expression, and can map language units into a sequential vector space based on context, and it is a natural way to extract and predict out-of-vocabulary (OOV) from context information. Multilingual morphological analysis has provided a convenient way for processing tasks of low resource languages like Uyghur and Kazakh.


Information ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 52
Author(s):  
Awet Fesseha ◽  
Shengwu Xiong ◽  
Eshete Derb Emiru ◽  
Moussa Diallo ◽  
Abdelghani Dahou

This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low-resource” language and is notable in terms of the absence of comprehensive and free data. Furthermore, it is characterized as one of the most semantically and syntactically complex languages in the world, similar to other Semitic languages. To the best of our knowledge, no previous research has been conducted on the state-of-the-art embedding technique that is shown here. We investigate which word representation methods perform better in terms of learning for single-label text classification problems, which are common when dealing with morphologically rich and complex languages. Manually annotated datasets are used here, where one contains 30,000 Tigrinya news texts from various sources with six categories of “sport”, “agriculture”, “politics”, “religion”, “education”, and “health” and one unannotated corpus that contains more than six million words. In this paper, we explore pretrained word embedding architectures using various convolutional neural networks (CNNs) to predict class labels. We construct a CNN with a continuous bag-of-words (CBOW) method, a CNN with a skip-gram method, and CNNs with and without word2vec and FastText to evaluate Tigrinya news articles. We also compare the CNN results with traditional machine learning models and evaluate the results in terms of the accuracy, precision, recall, and F1 scoring techniques. The CBOW CNN with word2vec achieves the best accuracy with 93.41%, significantly improving the accuracy for Tigrinya news classification.


2020 ◽  
Vol 34 (05) ◽  
pp. 9547-9554
Author(s):  
Mozhi Zhang ◽  
Yoshinari Fujinuma ◽  
Jordan Boyd-Graber

Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.


Author(s):  
Stuti Mehta ◽  
Suman K. Mitra

Text classification is an extremely important area of Natural Language Processing (NLP). This paper studies various methods for embedding and classification in the Gujarati language. The dataset comprises of Gujarati News Headlines classified into various categories. Different embedding methods for Gujarati language and various classifiers are used to classify the headlines into given categories. Gujarati is a low resource language. This language is not commonly worked upon. This paper deals with one of the most important NLP tasks - classification and along with it, an idea about various embedding techniques for Gujarati language can be obtained since they help in feature extraction for the process of classification. This paper first performs embedding to get a valid representation of the textual data and then uses already existing robust classifiers to perform classification over the embedded data. Additionally, the paper provides an insight into how various NLP tasks can be performed over a low resource language like Gujarati. Finally, the research paper carries out a comparative analysis between the performances of various existing methods of embedding and classification to get an idea of which combination gives a better outcome.


Sign in / Sign up

Export Citation Format

Share Document