Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Mieradilijiang Maimaiti; Yang Liu; Huanbo Luan; Zegao Pan; Maosong Sun

doi:10.1145/3464427

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3464427 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-21

Author(s):

Mieradilijiang Maimaiti ◽

Yang Liu ◽

Huanbo Luan ◽

Zegao Pan ◽

Maosong Sun

Keyword(s):

Data Augmentation ◽

Ranking Algorithm ◽

Original Sequence ◽

Low Resource ◽

Pos Tagging ◽

Word Level ◽

Corpus Size ◽

Semantic Errors ◽

Pseudo Data

Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating pseudo data mainly lay in word omission, random sampling, or replacing some words in the text. However, previous methods barely guarantee the quality of augmented data. In this work, we try to build the data by using paraphrase embedding and POS-Tagging. Namely, we generate the fake monolingual corpus by replacing the main four POS-Tagging labels, such as noun, adjective, adverb, and verb, based on both the paraphrase table and their similarity. We select the bigger corpus size of the paraphrase table with word level and obtain the word embedding of each word in the table, then calculate the cosine similarity between these words and tagged words in the original sequence. In addition, we exploit the ranking algorithm to choose highly similar words to reduce semantic errors and leverage the POS-Tagging replacement to mitigate syntactic error to some extent. Experimental results show that our augmentation method consistently outperforms all previous SOTA methods on the low-resource language pairs in seven language pairs from four corpora by 1.16 to 2.39 BLEU points.

Download Full-text

Tackling neural machine translation in low-resource settings: a Portuguese case study

10.5753/stil.2021.17807 ◽

2021 ◽

Author(s):

Arthur T. Estrella ◽

João B. O. Souza Filho

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Word Embeddings ◽

Effective Solution ◽

Computational Power ◽

Limited Data ◽

Neural Machine Translation ◽

Low Resource

Neural machine translation (NMT) nowadays requires an increasing amount of data and computational power, so succeeding in this task with limited data and using a single GPU might be challenging. Strategies such as the use of pre-trained word embeddings, subword embeddings, and data augmentation solutions can potentially address some issues faced in low-resource experimental settings, but their impact on the quality of translations is unclear. This work evaluates some of these strategies on two low-resource experiments beyond just reporting BLEU: errors are categorized on the Portuguese-English pair with the help of a translator, considering semantic and syntactic aspects. The BPE subword approach has shown to be the most effective solution, allowing a BLEU increase of 59% p.p. compared to the standard Transformer.

Download Full-text

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Computational Intelligence and Neuroscience ◽

10.1155/2021/9975078 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chenggang Mi ◽

Shaolin Zhu ◽

Rui Nie

Keyword(s):

Language Processing ◽

Data Augmentation ◽

Feature Fusion ◽

Training Data ◽

Low Resource ◽

High Resource ◽

Part Of Speech ◽

Word Level ◽

Cross Lingual ◽

Log Linear

Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.

Download Full-text

Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3458250 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-37

Author(s):

Rajesh Kumar Mundotiya ◽

Manish Kumar Singh ◽

Rahul Kapur ◽

Swasti Mishra ◽

Anil Kumar Singh

Keyword(s):

Language Processing ◽

Language Identification ◽

Identification Algorithm ◽

Additional Contribution ◽

Low Resource ◽

Pos Tagging ◽

High Resource ◽

Language Technology ◽

Statistical Measures ◽

Corpus Size

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text

Improved neural machine translation for low-resource English–Assamese pair

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219260 ◽

2021 ◽

pp. 1-12

Author(s):

Sahinur Rahman Laskar ◽

Abdullah Faiz Ur Rahman Khilji ◽

Partha Pakray ◽

Sivaji Bandyopadhyay

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Language Translation ◽

Linguistically Diverse ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

The World ◽

Translation Accuracy ◽

Vocabulary Problems

Language translation is essential to bring the world closer and plays a significant part in building a community among people of different linguistic backgrounds. Machine translation dramatically helps in removing the language barrier and allows easier communication among linguistically diverse communities. Due to the unavailability of resources, major languages of the world are accounted as low-resource languages. This leads to a challenging task of automating translation among various such languages to benefit indigenous speakers. This article investigates neural machine translation for the English–Assamese resource-poor language pair by tackling insufficient data and out-of-vocabulary problems. We have also proposed an approach of data augmentation-based NMT, which exploits synthetic parallel data and shows significantly improved translation accuracy for English-to-Assamese and Assamese-to-English translation and obtained state-of-the-art results.

Download Full-text

Reconstruction of a Symbolic Periodic Sequence from a Sequence with Noise

INFORMACIONNYE TEHNOLOGII ◽

10.17587/it.27.531-541 ◽

2021 ◽

Vol 27 (10) ◽

pp. 531-541

Author(s):

G. N. Zhukova ◽

◽

M. V. Ulyanov ◽

◽

Keyword(s):

Fragment Length ◽

Edit Distance ◽

Periodic Sequence ◽

Levenshtein Distance ◽

Original Sequence ◽

Fixed Length ◽

Approximating Sequence ◽

Period Estimate ◽

Correct Comparison

The problem of constructing a periodic sequence consisting of at least eight periods is considered, based on a given sequence obtained from an unknown periodic sequence, also containing at least eight periods, by introducing noise of deletion, replacement, and insertion of symbols. To construct a periodic sequence that approximates a given one, distorted by noise, it is first required to estimate the length of the repeating fragment (period). Further, the distorted original sequence is divided into successive sections of equal length; the length takes on integer values from 80 to 120 % of the period estimate. Each obtained section is compared with each of the remaining sections, a section is selected to build a periodic sequence that has the minimum edit distance (Levenshtein distance) to any of the remaining sections, minimization is carried out over all sections of a fixed length, and then along all lengths from 80 to 120 % of period estimates. For correct comparison of fragments of different lengths, we consider the ration between the edit distance and the length of the fragment. The length of a fragment that minimizes the ratio of the edit distance to another fragment of the same length to the fragment length is considered the period of the approximating periodic sequence, and the fragment itself, repeating the required number of times, forms an approximating sequence. The constructed sequence may contain an incomplete repeating fragment at the end. The quality of the approximation is estimated by the ratio of the edit distance from the original distorted sequence to the constructed periodic sequence of the same length and this length.

Download Full-text

Does service integration improve technical quality of care in low-resource settings? An evaluation of a model integrating HIV care into family planning services in Kenya

Health Policy and Planning ◽

10.1093/heapol/czx090 ◽

2017 ◽

Vol 32 (suppl_4) ◽

pp. iv91-iv101 ◽

Cited By ~ 9

Author(s):

Richard Mutemwa ◽

Susannah H Mayhew ◽

Charlotte E Warren ◽

Timothy Abuya ◽

Charity Ndwiga ◽

...

Keyword(s):

Quality Of Care ◽

Family Planning ◽

Service Integration ◽

Hiv Care ◽

Technical Quality ◽

Family Planning Services ◽

Low Resource Settings ◽

Low Resource

Download Full-text

Understanding Syntactic and Semantic Errors in the Composition Writing of Jordanian EFL Learners

International Journal of Applied Linguistics & English Literature ◽

10.7575/aiac.ijalel.v.6n.6p.158 ◽

2017 ◽

Vol 6 (6) ◽

pp. 158

Author(s):

Yazan Shaker Almahameed ◽

May Al-Shaikhli

Keyword(s):

Foreign Language ◽

Language Learners ◽

Semantic Knowledge ◽

Resumptive Pronouns ◽

Foreign Language Learners ◽

Word Level ◽

Sentence Level ◽

Native Speakers Of English ◽

Semantic Errors ◽

Verb Tense

The current study aimed at investigating the salient syntactic and semantic errors made by Jordanian English foreign language learners as writing in English. Writing poses a great challenge for both native and non-native speakers of English, since writing involves employing most language sub-systems such as grammar, vocabulary, spelling and punctuation. A total of 30 Jordanian English foreign language learners participated in the study. The participants were instructed to write a composition of no more than one hundred and fifty words on a selected topic. Essays were collected and analyzed statistically to obtain the needed results. The results of the study displayed that syntactic errors produced by the participants were varied, in that eleven types of syntactic errors were committed as follows; verb-tense, agreement, auxiliary, conjunctions, word order, resumptive pronouns, null-subject, double-subject, superlative, comparative and possessive pronouns. Amongst syntactic errors, verb tense errors were the most frequent with 33%. The results additionally revealed that two types of semantic errors were made; errors at sentence level and errors at word level. Errors at word level outstripped by far errors at sentence level, scoring respectively 82% and 18%. It can be concluded that the syntactic and semantic knowledge of Jordanian learners of English is still insufficient.

Download Full-text

O Papel da Ocitocina na Profilaxia da Hemorragia Pós-Parto em Locais com Recursos Limitados

Acta Médica Portuguesa ◽

10.20344/amp.14258 ◽

2021 ◽

Vol 34 (13) ◽

Author(s):

Inês Ferreira ◽

Ana Reynolds

Keyword(s):

Postpartum Hemorrhage ◽

Healthcare Professionals ◽

Cultural Barriers ◽

Low Resource Settings ◽

Low Resource ◽

Widespread Access ◽

Political Economic ◽

Demand For Health ◽

Cultural Constraints

Introduction: Postpartum hemorrhage remains one of the leading causes of maternal death globally. Oxytocin is the uterotonic agent of choice for the prophylaxis of this complication. However, its use in low-resource settings is associated with clinical, political, economic and cultural constraints. The goal of this article is to describe the use of oxytocin for postpartum hemorrhage prophylaxis in low-resource settings.Material and Methods: A literature review on the topic was carried out, and 24 articles were included.Results: The information was organized into seven sections: the evaluation of the efficacy of oxytocin compared to other uterotonics, the use of oxytocin in home births, the training of healthcare professionals, the quality of the available oxytocin, the new formulations, the risks associated with the use of uterotonic and the adopted health policies.Discussion: Despite the progress achieved widespread access to oxytocin for postpartum hemorrhage prophylaxis in low-resource settings is less than desirable. The main difficulties encountered were the shortage of skilled healthcare professionals for oxytocin administration, deficiencies concerning the quality of the drug and the inadequacy of available clinical guidelines.Conclusion: In order to reduce maternal mortality caused by postpartum hemorrhage in low-resource settings, it is essential to improve the knowledge of healthcare professionals, to implement good practices on the use of uterotonics, to optimize resource management and to overcome cultural barriers that prevent the demand for health services.

Download Full-text

Performance of colposcopic scoring by modified International Federation of Cervical Pathology and Colposcopy terminology for diagnosing cervical intraepithelial neoplasia in a low-resource setting

South Asian Journal of Cancer ◽

10.4103/sajc.sajc_302_18 ◽

2019 ◽

Vol 08 (04) ◽

pp. 218-220 ◽

Cited By ~ 3

Author(s):

Prabhakaran Nair Rema ◽

Aleyamma Mathew ◽

Shaji Thomas

Keyword(s):

Cervical Intraepithelial Neoplasia ◽

Scoring Systems ◽

Intraepithelial Neoplasia ◽

International Federation ◽

Screening Tests ◽

Substantial Agreement ◽

Low Resource ◽

Resource Setting ◽

Low Resource Setting

Abstract Introduction: Colposcopy is a tool to evaluate women with cervical pre-cancer and cancer. To interpret the colposcopic findings, various scoring systems are used but with inter observer variations. To improve the quality of colposcopy, International Federation of Cervical Pathology and Colposcopy (IFCPC) has introduced a colposcopic nomenclature in 2011. Colposcopic scoring helps to select patients who need treatment for cervical intraepithelial neoplasia. Aim of the Study: The study aimed to evaluate the agreement between colposcopic diagnosis with the modified IFCPC terminology and cervical pathology in patients with abnormal screening tests and to assess the utility of this colposcopic scoring system in low resource settings. Methodology: Patients with abnormal screening tests who underwent colposcopic assessment in the department of Gynaecological oncology were included in the study. Colposcopic scoring was done by the modified IFCPC nomenclature. The results were compared with cytology and the final histopathology. Results: 56 patients were included in the study. The colposcopic scoring when compared to histopathology showed agreement in 65.7% which indicated the agreement was substantial and was statistically significant (P = 0.0001). With cytology the colposcopic score showed agreement in 35.6% indicating a fair agreement and this was also statistically significant (P = 0.001). Conclusion: Colposcopic scoring by modified IFCPC 2011 criteria showed substantial agreement with cervical histopathology. Compared to traditional methods, 2011 international terminology of colposcopy could improve colposcopic accuracy.

Download Full-text