Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Author(s):  
Mieradilijiang Maimaiti ◽  
Yang Liu ◽  
Huanbo Luan ◽  
Zegao Pan ◽  
Maosong Sun

Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating pseudo data mainly lay in word omission, random sampling, or replacing some words in the text. However, previous methods barely guarantee the quality of augmented data. In this work, we try to build the data by using paraphrase embedding and POS-Tagging. Namely, we generate the fake monolingual corpus by replacing the main four POS-Tagging labels, such as noun, adjective, adverb, and verb, based on both the paraphrase table and their similarity. We select the bigger corpus size of the paraphrase table with word level and obtain the word embedding of each word in the table, then calculate the cosine similarity between these words and tagged words in the original sequence. In addition, we exploit the ranking algorithm to choose highly similar words to reduce semantic errors and leverage the POS-Tagging replacement to mitigate syntactic error to some extent. Experimental results show that our augmentation method consistently outperforms all previous SOTA methods on the low-resource language pairs in seven language pairs from four corpora by 1.16 to 2.39 BLEU points.

2021 ◽  
Author(s):  
Arthur T. Estrella ◽  
João B. O. Souza Filho

Neural machine translation (NMT) nowadays requires an increasing amount of data and computational power, so succeeding in this task with limited data and using a single GPU might be challenging. Strategies such as the use of pre-trained word embeddings, subword embeddings, and data augmentation solutions can potentially address some issues faced in low-resource experimental settings, but their impact on the quality of translations is unclear. This work evaluates some of these strategies on two low-resource experiments beyond just reporting BLEU: errors are categorized on the Portuguese-English pair with the help of a translator, considering semantic and syntactic aspects. The BPE subword approach has shown to be the most effective solution, allowing a BLEU increase of 59% p.p. compared to the standard Transformer.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Chenggang Mi ◽  
Shaolin Zhu ◽  
Rui Nie

Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.


Author(s):  
Rajesh Kumar Mundotiya ◽  
Manish Kumar Singh ◽  
Rahul Kapur ◽  
Swasti Mishra ◽  
Anil Kumar Singh

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.


Author(s):  
Rashmini Naranpanawa ◽  
Ravinga Perera ◽  
Thilakshi Fonseka ◽  
Uthayasanker Thayasivam

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.


2021 ◽  
pp. 1-12
Author(s):  
Sahinur Rahman Laskar ◽  
Abdullah Faiz Ur Rahman Khilji ◽  
Partha Pakray ◽  
Sivaji Bandyopadhyay

Language translation is essential to bring the world closer and plays a significant part in building a community among people of different linguistic backgrounds. Machine translation dramatically helps in removing the language barrier and allows easier communication among linguistically diverse communities. Due to the unavailability of resources, major languages of the world are accounted as low-resource languages. This leads to a challenging task of automating translation among various such languages to benefit indigenous speakers. This article investigates neural machine translation for the English–Assamese resource-poor language pair by tackling insufficient data and out-of-vocabulary problems. We have also proposed an approach of data augmentation-based NMT, which exploits synthetic parallel data and shows significantly improved translation accuracy for English-to-Assamese and Assamese-to-English translation and obtained state-of-the-art results.


2021 ◽  
Vol 27 (10) ◽  
pp. 531-541
Author(s):  
G. N. Zhukova ◽  
◽  
M. V. Ulyanov ◽  
◽  

The problem of constructing a periodic sequence consisting of at least eight periods is considered, based on a given sequence obtained from an unknown periodic sequence, also containing at least eight periods, by introducing noise of deletion, replacement, and insertion of symbols. To construct a periodic sequence that approximates a given one, distorted by noise, it is first required to estimate the length of the repeating fragment (period). Further, the distorted original sequence is divided into successive sections of equal length; the length takes on integer values from 80 to 120 % of the period estimate. Each obtained section is compared with each of the remaining sections, a section is selected to build a periodic sequence that has the minimum edit distance (Levenshtein distance) to any of the remaining sections, minimization is carried out over all sections of a fixed length, and then along all lengths from 80 to 120 % of period estimates. For correct comparison of fragments of different lengths, we consider the ration between the edit distance and the length of the fragment. The length of a fragment that minimizes the ratio of the edit distance to another fragment of the same length to the fragment length is considered the period of the approximating periodic sequence, and the fragment itself, repeating the required number of times, forms an approximating sequence. The constructed sequence may contain an incomplete repeating fragment at the end. The quality of the approximation is estimated by the ratio of the edit distance from the original distorted sequence to the constructed periodic sequence of the same length and this length.


2017 ◽  
Vol 32 (suppl_4) ◽  
pp. iv91-iv101 ◽  
Author(s):  
Richard Mutemwa ◽  
Susannah H Mayhew ◽  
Charlotte E Warren ◽  
Timothy Abuya ◽  
Charity Ndwiga ◽  
...  

Author(s):  
Yazan Shaker Almahameed ◽  
May Al-Shaikhli

The current study aimed at investigating the salient syntactic and semantic errors made by Jordanian English foreign language learners as writing in English. Writing poses a great challenge for both native and non-native speakers of English, since writing involves employing most language sub-systems such as grammar, vocabulary, spelling and punctuation. A total of 30 Jordanian English foreign language learners participated in the study. The participants were instructed to write a composition of no more than one hundred and fifty words on a selected topic. Essays were collected and analyzed statistically to obtain the needed results. The results of the study displayed that syntactic errors produced by the participants were varied, in that eleven types of syntactic errors were committed as follows; verb-tense, agreement, auxiliary, conjunctions, word order, resumptive pronouns, null-subject, double-subject, superlative, comparative and possessive pronouns. Amongst syntactic errors, verb tense errors were the most frequent with 33%. The results additionally revealed that two types of semantic errors were made; errors at sentence level and errors at word level. Errors at word level outstripped by far errors at sentence level, scoring respectively 82% and 18%. It can be concluded that the syntactic and semantic knowledge of Jordanian learners of English is still insufficient.


2021 ◽  
Vol 34 (13) ◽  
Author(s):  
Inês Ferreira ◽  
Ana Reynolds

Introduction: Postpartum hemorrhage remains one of the leading causes of maternal death globally. Oxytocin is the uterotonic agent of choice for the prophylaxis of this complication. However, its use in low-resource settings is associated with clinical, political, economic and cultural constraints. The goal of this article is to describe the use of oxytocin for postpartum hemorrhage prophylaxis in low-resource settings.Material and Methods: A literature review on the topic was carried out, and 24 articles were included.Results: The information was organized into seven sections: the evaluation of the efficacy of oxytocin compared to other uterotonics, the use of oxytocin in home births, the training of healthcare professionals, the quality of the available oxytocin, the new formulations, the risks associated with the use of uterotonic and the adopted health policies.Discussion: Despite the progress achieved widespread access to oxytocin for postpartum hemorrhage prophylaxis in low-resource settings is less than desirable. The main difficulties encountered were the shortage of skilled healthcare professionals for oxytocin administration, deficiencies concerning the quality of the drug and the inadequacy of available clinical guidelines.Conclusion: In order to reduce maternal mortality caused by postpartum hemorrhage in low-resource settings, it is essential to improve the knowledge of healthcare professionals, to implement good practices on the use of uterotonics, to optimize resource management and to overcome cultural barriers that prevent the demand for health services.


2019 ◽  
Vol 08 (04) ◽  
pp. 218-220 ◽  
Author(s):  
Prabhakaran Nair Rema ◽  
Aleyamma Mathew ◽  
Shaji Thomas

Abstract Introduction: Colposcopy is a tool to evaluate women with cervical pre-cancer and cancer. To interpret the colposcopic findings, various scoring systems are used but with inter observer variations. To improve the quality of colposcopy, International Federation of Cervical Pathology and Colposcopy (IFCPC) has introduced a colposcopic nomenclature in 2011. Colposcopic scoring helps to select patients who need treatment for cervical intraepithelial neoplasia. Aim of the Study: The study aimed to evaluate the agreement between colposcopic diagnosis with the modified IFCPC terminology and cervical pathology in patients with abnormal screening tests and to assess the utility of this colposcopic scoring system in low resource settings. Methodology: Patients with abnormal screening tests who underwent colposcopic assessment in the department of Gynaecological oncology were included in the study. Colposcopic scoring was done by the modified IFCPC nomenclature. The results were compared with cytology and the final histopathology. Results: 56 patients were included in the study. The colposcopic scoring when compared to histopathology showed agreement in 65.7% which indicated the agreement was substantial and was statistically significant (P = 0.0001). With cytology the colposcopic score showed agreement in 35.6% indicating a fair agreement and this was also statistically significant (P = 0.001). Conclusion: Colposcopic scoring by modified IFCPC 2011 criteria showed substantial agreement with cervical histopathology. Compared to traditional methods, 2011 international terminology of colposcopy could improve colposcopic accuracy.


Sign in / Sign up

Export Citation Format

Share Document