scholarly journals Phrase Table Combination Based on Symmetrization of Word Alignment for Low-Resource Languages

2021 ◽  
Vol 11 (4) ◽  
pp. 1868
Author(s):  
Sari Dewi Budiwati ◽  
Al Hafiz Akbar Maulana Siagian ◽  
Tirana Noor Fatyanosa ◽  
Masayoshi Aritsugi

Phrase table combination in pivot approaches can be an effective method to deal with low-resource language pairs. The common practice to generate phrase tables in pivot approaches is to use standard symmetrization, i.e., grow-diag-final-and. Although some researchers found that the use of non-standard symmetrization could improve bilingual evaluation understudy (BLEU) scores, the use of non-standard symmetrization has not been commonly employed in pivot approaches. In this study, we propose a strategy that uses the non-standard symmetrization of word alignment in phrase table combination. The appropriate symmetrization is selected based on the highest BLEU scores in each direct translation of source–target, source–pivot, and pivot–target of Kazakh–English (Kk–En) and Japanese–Indonesian (Ja–Id). Our experiments show that our proposed strategy outperforms the direct translation in Kk–En with absolute improvements of 0.35 (a 11.3% relative improvement) and 0.22 (a 6.4% relative improvement) BLEU points for 3-gram and 5-gram, respectively. The proposed strategy shows an absolute gain of up to 0.11 (a 0.9% relative improvement) BLEU points compared to direct translation for 3-gram in Ja–Id. Our proposed strategy using a small phrase table obtains better BLEU scores than a strategy using a large phrase table. The size of the target monolingual and feature function weight of the language model (LM) could reduce perplexity scores.

2021 ◽  
Vol 7 ◽  
pp. e816
Author(s):  
Heng-yang Lu ◽  
Jun Yang ◽  
Cong Hu ◽  
Wei Fang

Background Fine-grained sentiment analysis is used to interpret consumers’ sentiments, from their written comments, towards specific entities on specific aspects. Previous researchers have introduced three main tasks in this field (ABSA, TABSA, MEABSA), covering all kinds of social media data (e.g., review specific, questions and answers, and community-based). In this paper, we identify and address two common challenges encountered in these three tasks, including the low-resource problem and the sentiment polarity bias. Methods We propose a unified model called PEA by integrating data augmentation methodology with the pre-trained language model, which is suitable for all the ABSA, TABSA and MEABSA tasks. Two data augmentation methods, which are entity replacement and dual noise injection, are introduced to solve both challenges at the same time. An ensemble method is also introduced to incorporate the results of the basic RNN-based and BERT-based models. Results PEA shows significant improvements on all three fine-grained sentiment analysis tasks when compared with state-of-the-art models. It also achieves comparable results with what the baseline models obtain while using only 20% of their training data, which demonstrates its extraordinary performance under extreme low-resource conditions.


Author(s):  
Laura K. Murray ◽  
Emily E. Haroz ◽  
Michael D. Pullmann ◽  
Shannon Dorsey ◽  
Jeremy Kane ◽  
...  

AbstractThe use of transdiagnostic mental health treatments in low resource settings has been proposed as a possible aid in scaling up mental health services. Modular, multi-problem transdiagnostic treatments can be used to treat a range of mental health problems and are designed to handle comorbidity. Two randomized controlled trials have been completed on one treatment – the Common Elements Treatment Approach, or CETA – delivered by lay counsellors in Iraq and Thailand. This paper utilizes data from two clinical trials to explore the delivery of CETA by lay providers, examining fidelity and flexibility of element use. Data were collected at every therapy session. Clients completed a short symptom assessment and providers described the clinical elements delivered during sessions. Analyses included descriptive statistics of delivery including selection and sequencing of treatment elements, and the variance in element dose, clustering at the counsellor level, using multi-level models. Results indicate that lay providers in low resource settings (with supervision) demonstrated fidelity to the recommended CETA elements, order and dose, and occasionally added in elements and flexed dosage based on client presentation (i.e. flexibility). This modular approach did not result in significantly longer treatment duration. Our analysis suggests that lay providers were able to learn decision-making processes of CETA based on client presentation and adjust treatment as needed with supervision. As modular multi-problem transdiagnostic treatments continue to be explored in low resource settings, research should continue to focus on ‘unpacking’ lay counsellor delivery of these interventions, decision-making processes, and the level of supervision required.


2021 ◽  
Vol 11 (5) ◽  
pp. 1974 ◽  
Author(s):  
Chanhee Lee ◽  
Kisu Yang ◽  
Taesun Whang ◽  
Chanjun Park ◽  
Andrew Matteson ◽  
...  

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.


Author(s):  
Szymon Roziewski ◽  
Marek Kozłowski

AbstractThe exponential growth of the internet community has resulted in the production of a vast amount of unstructured data, including web pages, blogs and social media. Such a volume consisting of hundreds of billions of words is unlikely to be analyzed by humans. In this work we introduce the tool LanguageCrawl, which allows Natural Language Processing (NLP) researchers to easily build web-scale corpora using the Common Crawl Archive—an open repository of web crawl information, which contains petabytes of data. We present three use cases in the course of this work: filtering of Polish websites, the construction of n-gram corpora and the training of a continuous skipgram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and n-gram ranks. This paper focuses particularly on high computing efficiency by applying highly concurrent multitasking. Our tool utilizes effective libraries and design. LanguageCrawl has been made publicly available to enrich the current set of NLP resources. We strongly believe that our work will facilitate further NLP research, especially in under-resourced languages, in which the lack of appropriately-sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.


Author(s):  
Francis Zheng ◽  
Machel Reid ◽  
Edison Marrese-Taylor ◽  
Yutaka Matsuo

2019 ◽  
Author(s):  
Astik Biswas ◽  
Raghav Menon ◽  
Ewald van der Westhuizen ◽  
Thomas Niesler

Sign in / Sign up

Export Citation Format

Share Document