Phrase Table Combination Based on Symmetrization of Word Alignment for Low-Resource Languages

Phrase table combination in pivot approaches can be an effective method to deal with low-resource language pairs. The common practice to generate phrase tables in pivot approaches is to use standard symmetrization, i.e., grow-diag-final-and. Although some researchers found that the use of non-standard symmetrization could improve bilingual evaluation understudy (BLEU) scores, the use of non-standard symmetrization has not been commonly employed in pivot approaches. In this study, we propose a strategy that uses the non-standard symmetrization of word alignment in phrase table combination. The appropriate symmetrization is selected based on the highest BLEU scores in each direct translation of source–target, source–pivot, and pivot–target of Kazakh–English (Kk–En) and Japanese–Indonesian (Ja–Id). Our experiments show that our proposed strategy outperforms the direct translation in Kk–En with absolute improvements of 0.35 (a 11.3% relative improvement) and 0.22 (a 6.4% relative improvement) BLEU points for 3-gram and 5-gram, respectively. The proposed strategy shows an absolute gain of up to 0.11 (a 0.9% relative improvement) BLEU points compared to direct translation for 3-gram in Ja–Id. Our proposed strategy using a small phrase table obtains better BLEU scores than a strategy using a large phrase table. The size of the target monolingual and feature function weight of the language model (LM) could reduce perplexity scores.

Download Full-text

Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2015.7178865 ◽

2015 ◽

Cited By ~ 5

Author(s):

Chongjia Ni ◽

Cheung-Chi Leung ◽

Lei Wang ◽

Nancy F. Chen ◽

Bin Ma

Keyword(s):

Keyword Search ◽

Language Model ◽

Data Selection ◽

Low Resource ◽

Mixed Language

Download Full-text

Multi-Task Word Alignment Triangulation for Low-Resource Languages

10.3115/v1/n15-1129 ◽

2015 ◽

Cited By ~ 1

Author(s):

Tomer Levinboim ◽

David Chiang

Keyword(s):

Word Alignment ◽

Low Resource

Download Full-text

One for “All”: a unified model for fine-grained sentiment analysis under three tasks

PeerJ Computer Science ◽

10.7717/peerj-cs.816 ◽

2021 ◽

Vol 7 ◽

pp. e816

Author(s):

Heng-yang Lu ◽

Jun Yang ◽

Cong Hu ◽

Wei Fang

Keyword(s):

Sentiment Analysis ◽

Data Augmentation ◽

Language Model ◽

Unified Model ◽

Training Data ◽

Low Resource ◽

Fine Grained ◽

Questions And Answers ◽

Resource Conditions ◽

Media Data

Background Fine-grained sentiment analysis is used to interpret consumers’ sentiments, from their written comments, towards specific entities on specific aspects. Previous researchers have introduced three main tasks in this field (ABSA, TABSA, MEABSA), covering all kinds of social media data (e.g., review specific, questions and answers, and community-based). In this paper, we identify and address two common challenges encountered in these three tasks, including the low-resource problem and the sentiment polarity bias. Methods We propose a unified model called PEA by integrating data augmentation methodology with the pre-trained language model, which is suitable for all the ABSA, TABSA and MEABSA tasks. Two data augmentation methods, which are entity replacement and dual noise injection, are introduced to solve both challenges at the same time. An ensemble method is also introduced to incorporate the results of the basic RNN-based and BERT-based models. Results PEA shows significant improvements on all three fine-grained sentiment analysis tasks when compared with state-of-the-art models. It also achieves comparable results with what the baseline models obtain while using only 20% of their training data, which demonstrates its extraordinary performance under extreme low-resource conditions.

Download Full-text

Extremely Low Resource Text simplification with Pre-trained Transformer Language Model

2019 International Conference on Asian Language Processing (IALP) ◽

10.1109/ialp48816.2019.9037650 ◽

2019 ◽

Author(s):

Takumi Maruyama ◽

Kazuhide Yamamoto

Keyword(s):

Language Model ◽

Low Resource ◽

Text Simplification

Download Full-text

Under the hood: lay counsellor element use in a modular multi-problem transdiagnostic intervention in lower resource countries

The Cognitive Behaviour Therapist ◽

10.1017/s1754470x18000144 ◽

2019 ◽

Vol 12 ◽

Cited By ~ 2

Author(s):

Laura K. Murray ◽

Emily E. Haroz ◽

Michael D. Pullmann ◽

Shannon Dorsey ◽

Jeremy Kane ◽

...

Keyword(s):

Mental Health ◽

Decision Making ◽

Mental Health Problems ◽

Symptom Assessment ◽

Low Resource Settings ◽

Low Resource ◽

Decision Making Processes ◽

The Common ◽

Multi Level ◽

Mental Health Treatments

AbstractThe use of transdiagnostic mental health treatments in low resource settings has been proposed as a possible aid in scaling up mental health services. Modular, multi-problem transdiagnostic treatments can be used to treat a range of mental health problems and are designed to handle comorbidity. Two randomized controlled trials have been completed on one treatment – the Common Elements Treatment Approach, or CETA – delivered by lay counsellors in Iraq and Thailand. This paper utilizes data from two clinical trials to explore the delivery of CETA by lay providers, examining fidelity and flexibility of element use. Data were collected at every therapy session. Clients completed a short symptom assessment and providers described the clinical elements delivered during sessions. Analyses included descriptive statistics of delivery including selection and sequencing of treatment elements, and the variance in element dose, clustering at the counsellor level, using multi-level models. Results indicate that lay providers in low resource settings (with supervision) demonstrated fidelity to the recommended CETA elements, order and dose, and occasionally added in elements and flexed dosage based on client presentation (i.e. flexibility). This modular approach did not result in significantly longer treatment duration. Our analysis suggests that lay providers were able to learn decision-making processes of CETA based on client presentation and adjust treatment as needed with supervision. As modular multi-problem transdiagnostic treatments continue to be explored in low resource settings, research should continue to focus on ‘unpacking’ lay counsellor delivery of these interventions, decision-making processes, and the level of supervision required.

Download Full-text

Low-Resource Text Classification via Cross-Lingual Language Model Fine-Tuning

Lecture Notes in Computer Science - Chinese Computational Linguistics ◽

10.1007/978-3-030-63031-7_17 ◽

2020 ◽

pp. 231-246

Author(s):

Xiuhong Li ◽

Zhe Li ◽

Jiabao Sheng ◽

Wushour Slamu

Keyword(s):

Text Classification ◽

Language Model ◽

Fine Tuning ◽

Low Resource ◽

Cross Lingual

Download Full-text

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Applied Sciences ◽

10.3390/app11051974 ◽

2021 ◽

Vol 11 (5) ◽

pp. 1974 ◽

Cited By ~ 1

Author(s):

Chanhee Lee ◽

Kisu Yang ◽

Taesun Whang ◽

Chanjun Park ◽

Andrew Matteson ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Low Resource ◽

High Resource ◽

Cross Lingual ◽

Data Efficiency

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

Download Full-text

LanguageCrawl: a generic tool for building language models upon common Crawl

Language Resources and Evaluation ◽

10.1007/s10579-021-09551-7 ◽

2021 ◽

Author(s):

Szymon Roziewski ◽

Marek Kozłowski

Keyword(s):

Language Processing ◽

Deep Neural Networks ◽

Language Model ◽

Language Models ◽

Unstructured Data ◽

Web Pages ◽

Data Intensive ◽

The Common ◽

Internet Community ◽

N Gram

AbstractThe exponential growth of the internet community has resulted in the production of a vast amount of unstructured data, including web pages, blogs and social media. Such a volume consisting of hundreds of billions of words is unlikely to be analyzed by humans. In this work we introduce the tool LanguageCrawl, which allows Natural Language Processing (NLP) researchers to easily build web-scale corpora using the Common Crawl Archive—an open repository of web crawl information, which contains petabytes of data. We present three use cases in the course of this work: filtering of Polish websites, the construction of n-gram corpora and the training of a continuous skipgram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and n-gram ranks. This paper focuses particularly on high computing efficiency by applying highly concurrent multitasking. Our tool utilizes effective libraries and design. LanguageCrawl has been made publicly available to enrich the current set of NLP resources. We strongly believe that our work will facilitate further NLP research, especially in under-resourced languages, in which the lack of appropriately-sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.

Download Full-text