Deep Domain Adaptation for Low-Resource Cross-Lingual Text Classification Tasks

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6500 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9547-9554

Author(s):

Mozhi Zhang ◽

Yoshinari Fujinuma ◽

Jordan Boyd-Graber

Keyword(s):

Knowledge Transfer ◽

Text Classification ◽

Document Classification ◽

Training Data ◽

Target Language ◽

Source Language ◽

Low Resource ◽

Classification Framework ◽

Related Language ◽

Cross Lingual

Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.

Download Full-text

Low-Resource Text Classification via Cross-Lingual Language Model Fine-Tuning

Lecture Notes in Computer Science - Chinese Computational Linguistics ◽

10.1007/978-3-030-63031-7_17 ◽

2020 ◽

pp. 231-246

Author(s):

Xiuhong Li ◽

Zhe Li ◽

Jiabao Sheng ◽

Wushour Slamu

Keyword(s):

Text Classification ◽

Language Model ◽

Fine Tuning ◽

Low Resource ◽

Cross Lingual

Download Full-text

Out-of-Domain Detection for Low-Resource Text Classification Tasks

10.18653/v1/d19-1364 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ming Tan ◽

Yang Yu ◽

Haoyu Wang ◽

Dakuo Wang ◽

Saloni Potdar ◽

...

Keyword(s):

Text Classification ◽

Low Resource ◽

Classification Tasks

Download Full-text

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval ◽

10.1145/3397271.3401209 ◽

2020 ◽

Author(s):

Xin Dong ◽

Yaxin Zhu ◽

Yupeng Zhang ◽

Zuohui Fu ◽

Dongkuan Xu ◽

...

Keyword(s):

Text Classification ◽

Adversarial Training ◽

Cross Lingual ◽

Self Learning

Download Full-text

Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3473331 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-18

Author(s):

Iqra Muneer ◽

Rao Muhammad Adeel Nawab

Keyword(s):

Probabilistic Approach ◽

Detailed Comparison ◽

Longest Common Subsequence ◽

Common Subsequence ◽

Classification Tasks ◽

N Gram ◽

Cross Lingual ◽

Recurrent Architecture ◽

Translation Systems ◽

Language Pair

Cross-Lingual Text Reuse Detection (CLTRD) has recently attracted the attention of the research community due to a large amount of digital text readily available for reuse in multiple languages through online digital repositories. In addition, efficient machine translation systems are freely and readily available to translate text from one language into another, which makes it quite easy to reuse text across languages, and consequently difficult to detect it. In the literature, the most prominent and widely used approach for CLTRD is Translation plus Monolingual Analysis (T+MA). To detect CLTR for English-Urdu language pair, T+MA has been used with lexical approaches, namely, N-gram Overlap, Longest Common Subsequence, and Greedy String Tiling. This clearly shows that T+MA has not been thoroughly explored for the English-Urdu language pair. To fulfill this gap, this study presents an in-depth and detailed comparison of 26 approaches that are based on T+MA. These approaches include semantic similarity approaches (semantic tagger based approaches, WordNet-based approaches), probabilistic approach (Kullback-Leibler distance approach), monolingual word embedding-based approaches siamese recurrent architecture, and monolingual sentence transformer-based approaches for English-Urdu language pair. The evaluation was carried out using the CLEU benchmark corpus, both for the binary and the ternary classification tasks. Our extensive experimentation shows that our proposed approach that is a combination of 26 approaches obtained an F 1 score of 0.77 and 0.61 for the binary and ternary classification tasks, respectively, and outperformed the previously reported approaches [ 41 ] ( F 1 = 0.73) for the binary and ( F 1 = 0.55) for the ternary classification tasks) on the CLEU corpus.

Download Full-text