Adaptive Language Processing Based on Deep Learning in Cloud Computing Platform

With the continuous advancement of technology, the amount of information and knowledge disseminated on the Internet every day has been developing several times. At the same time, a large amount of bilingual data has also been produced in the real world. These data are undoubtedly a great asset for statistical machine translation research. Based on the dual-sentence quality corpus screening, two corpus screening strategies are proposed first, based on the double-sentence pair length ratio method and the word-based alignment information method. The innovation of these two methods is that no additional linguistic resources such as bilingual dictionary and syntactic analyzer are needed as auxiliary. No manual intervention is required, and the poor quality sentence pairs can be automatically selected and can be applied to any language pair. Secondly, a domain adaptive method based on massive corpus is proposed. The method based on massive corpus utilizes massive corpus mechanism to carry out multidomain automatic model migration. In this domain, each domain learns the intradomain model independently, and different domains share the same general model. Through the method of massive corpus, these models can be combined and adjusted to make the model learning more accurate. Finally, the adaptive method of massive corpus filtering and statistical machine translation based on cloud platform is verified. Experiments show that both methods have good effects and can effectively improve the translation quality of statistical machines.

Download Full-text

English-Dogri Translation System using MOSES

Circulation in Computer Science ◽

10.22632/ccs-2016-251-25 ◽

2016 ◽

Vol 1 (1) ◽

pp. 45-49

Author(s):

Avinash Singh ◽

Asmeet Kour ◽

Shubhnandan S. Jamwal

Keyword(s):

Natural Language Processing ◽

Machine Translation ◽

Language Processing ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpus ◽

English System ◽

Machine Translation System ◽

Translation Machine ◽

Language Pair

The objective behind this paper is to analyze the English-Dogri parallel corpus translation. Machine translation is the translation from one language into another language. Machine translation is the biggest application of the Natural Language Processing (NLP). Moses is statistical machine translation system allow to train translation models for any language pair. We have developed translation system using Statistical based approach which helps in translating English to Dogri and vice versa. The parallel corpus consists of 98,973 sentences. The system gives accuracy of 80% in translating English to Dogri and the system gives accuracy of 87% in translating Dogri to English system.

Download Full-text

A Comprehensive Survey of Grammatical Error Correction

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3474840 ◽

2021 ◽

Vol 12 (5) ◽

pp. 1-51

Author(s):

Yu Wang ◽

Yuelin Wang ◽

Kai Dang ◽

Jie Liu ◽

Zhuo Liu

Keyword(s):

Error Correction ◽

Machine Translation ◽

Language Processing ◽

Data Augmentation ◽

Intelligent System ◽

Statistical Machine Translation ◽

Error Type ◽

Data Annotation ◽

Depth Analysis ◽

Grammatical Error

Grammatical error correction (GEC) is an important application aspect of natural language processing techniques, and GEC system is a kind of very important intelligent system that has long been explored both in academic and industrial communities. The past decade has witnessed significant progress achieved in GEC for the sake of increasing popularity of machine learning and deep learning. However, there is not a survey that untangles the large amount of research works and progress in this field. We present the first survey in GEC for a comprehensive retrospective of the literature in this area. We first give the definition of GEC task and introduce the public datasets and data annotation schema. After that, we discuss six kinds of basic approaches, six commonly applied performance boosting techniques for GEC systems, and three data augmentation methods. Since GEC is typically viewed as a sister task of Machine Translation (MT), we put more emphasis on the statistical machine translation (SMT)-based approaches and neural machine translation (NMT)-based approaches for the sake of their importance. Similarly, some performance-boosting techniques are adapted from MT and are successfully combined with GEC systems for enhancement on the final performance. More importantly, after the introduction of the evaluation in GEC, we make an in-depth analysis based on empirical results in aspects of GEC approaches and GEC systems for a clearer pattern of progress in GEC, where error type analysis and system recapitulation are clearly presented. Finally, we discuss five prospective directions for future GEC researches.

Download Full-text

Machine Learning Approaches for Bangla Statistical Machine Translation

Technical Challenges and Design Issues in Bangla Language Processing ◽

10.4018/978-1-4666-3970-6.ch004 ◽

2013 ◽

pp. 79-95

Author(s):

Maxim Roy

Keyword(s):

Machine Learning ◽

Active Learning ◽

Machine Translation ◽

Language Processing ◽

Statistical Machine Translation ◽

Low Density ◽

Learning Approaches ◽

Translation Quality ◽

Selection Strategies ◽

Translation Accuracy

Machine Translation (MT) from Bangla to English has recently become a priority task for the Bangla Natural Language Processing (NLP) community. Statistical Machine Translation (SMT) systems require a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bangla. In this chapter, the authors discuss how machine learning approaches can help to improve translation quality within as SMT system without requiring a huge increase in resources. They provide a novel semi-supervised learning and active learning framework for SMT, which utilizes both labeled and unlabeled data. The authors discuss sentence selection strategies in detail and perform detailed experimental evaluations on the sentence selection methods. In semi-supervised settings, reversed model approach outperformed all other approaches for Bangla-English SMT, and in active learning setting, geometric 4-gram and geometric phrase sentence selection strategies proved most useful based on BLEU score results over baseline approaches. Overall, in this chapter, the authors demonstrate that for low-density language like Bangla, these machine-learning approaches can improve translation quality.

Download Full-text

Verb Phrases Alignment Technique for English-Malayalam Parallel Corpus in Statistical Machine Translation Special issue on MTIL 2017

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0066 ◽

2019 ◽

Vol 28 (3) ◽

pp. 479-492

Author(s):

Mary Priya Sebastian ◽

G. Santhosh Kumar

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Foreign Languages ◽

Parallel Corpus ◽

Linguistic Resources ◽

Translation Tools ◽

Verb Phrases ◽

Alignment Technique ◽

Developing Area ◽

Malayalam Language

Abstract Machine translation (MT) from English to foreign languages is a fast developing area of research, and various techniques of translation are discussed in the literature. However, translation from English to Malayalam, a Dravidian language, is still in the rising stage, and works in this field have not flourished to a great extent, so far. The main reason of this shortcoming is the non-availability of linguistic resources and translation tools in the Malayalam language. A parallel corpus with alignment is one of such resources that are essential for a machine translator system. This paper focuses on a technique that enables automatic setting up of a verb-aligned parallel corpus by exploring the internal structure of the English and Malayalam language, which in turn facilitates the task of machine translation from English to Malayalam.

Download Full-text

Extracting parallel phrases from comparable data for machine translation

Natural Language Engineering ◽

10.1017/s1351324916000139 ◽

2016 ◽

Vol 22 (4) ◽

pp. 549-573 ◽

Cited By ~ 3

Author(s):

SANJIKA HEWAVITHARANA ◽

STEPHAN VOGEL

Keyword(s):

Machine Translation ◽

Language Processing ◽

Statistical Machine Translation ◽

Word Alignment ◽

Data Set ◽

Comparable Corpora ◽

Alignment Algorithms ◽

Extraction Algorithm ◽

Phrase Alignment ◽

Translation Systems

AbstractMining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

Download Full-text

A SYSTEMATIC READING IN STATISTICAL TRANSLATION: FROM THE STATISTICAL MACHINE TRANSLATION TO THE NEURAL TRANSLATION MODELS.

Journal of Information and Communication Technology ◽

10.32890/jict2017.16.2.8239 ◽

2017 ◽

Author(s):

Zakaria El Maazouzi ◽

Badr Eddine EL Mohajir ◽

Mohammed Al Achhab

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Statistical Machine Translation ◽

High Accuracy ◽

Neural Machine Translation ◽

Translation Quality ◽

Automatic Translation

Achieving high accuracy in automatic translation tasks has been one of the challenging goals for researchers in the area of machine translation since decades. Thus, the eagerness of exploring new possible ways to improve machine translation was always the matter for researchers in the field. Automatic translation as a key application in the natural language processing domain has developed many approaches, namely statistical machine translation and recently neural machine translation that improved largely the translation quality especially for Latin languages. They have even made it possible for the translation of some language pairs to approach human translation quality. In this paper, we present a survey of the state of the art of statistical translation, where we describe the different existing methodologies, and we overview the recent research studies while pointing out the main strengths and limitations of the different approaches.

Download Full-text

Hindi Chhattisgarhi Machine Translation System Using Statistical Approach

Webology ◽

10.14704/web/v18si02/web18067 ◽

2021 ◽

Vol 18 (Special Issue 02) ◽

pp. 208-222

Author(s):

Vikas Pandey ◽

Dr.M.V. Padmavati ◽

Dr. Ramesh Kumar

Keyword(s):

Machine Translation ◽

Language Processing ◽

Statistical Approach ◽

Statistical Machine Translation ◽

Target Language ◽

Translation System ◽

Parallel Corpus ◽

Machine Translation System ◽

Unknown Words ◽

Language Pair

Machine Translation is a subfield of Natural language Processing (NLP) which uses to translate source language to target language. In this paper an attempt has been made to make a Hindi Chhattisgarhi machine translation system which is based on statistical approach. In the state of Chhattisgarh there is a long awaited need for Hindi to Chhattisgarhi machine translation system for converting Hindi into Chhattisgarhi especially for non Chhattisgarhi speaking people. In order to develop Hindi Chhattisgarhi statistical machine translation system an open source software called Moses is used. Moses is a statistical machine translation system and used to automatically train the translation model for Hindi Chhattisgarhi language pair called as parallel corpus. A collection of structured text to study linguistic properties is called corpus. This machine translation system works on parallel corpus of 40,000 Hindi-Chhattisgarhi bilingual sentences. In order to overcome translation problem related to proper noun and unknown words, a transliteration system is also embedded in it. These sentences are extracted from various domains like stories, novels, text books and news papers etc. This system is tested on 1000 sentences to check the grammatical correctness of sentences and it was found that an accuracy of 75% is achieved.

Download Full-text

Modern Linguistic Technologies: Strategy for Teaching Translation Studies

Rupkatha Journal on Interdisciplinary Studies in Humanities ◽

10.21659/rupkatha.v13n4.65 ◽

2021 ◽

Vol 13 (4) ◽

Author(s):

Bilous O ◽

◽

Mishchenko A ◽

Datska T ◽

Ivanenko N ◽

...

Keyword(s):

Natural Language ◽

Machine Translation ◽

Language Processing ◽

New Technologies ◽

Large Data ◽

Student Autonomy ◽

Linguistic Resources ◽

Modern Computer ◽

Key Factor ◽

The Creation

How often students use IT resources is a key factor in the acquisition of skills associated to the new technologies. Strategies aimed at increasing student autonomy need to be developed and should offer resources that encourage them to make use of computing tools in class hours. The analysis of the modern linguistic technologies, concerning intellectual language processing necessary for the creation and function of the highly effective technologies of knowledge operation was considered in the paper under consideration. Computerization of the information sphere has triggered extensive search for solving the problem of the use of natural language mechanisms in automated systems of various types. One of them was creating Controlled languages based on a set of features which made machine translation more refined. Triggered by the economic demand, they are not artificial languages like Esperanto, but natural simplified languages, in terms of vocabulary, grammatical and syntactic structures. More than ever, the tasks of modern computer linguistics behold creating software for natural language processing, information retrieval in large data sets, support of technical authors in the process of creating professional texts and users of computer technology, hence creating new translation tools. Such powerful linguistic resources as corpora of texts, terminology databases and ontologies may facilitate more efficient use of modern multilingual information technology. Creating and improving all methods considered will help make the job of a translator more efficient. One of the programs, CLAT does not aim at producing machine translation, but allows technical editors to create flawless, sequential professional texts through integrated punctuation and spelling modules. Other programs under consideration are to be implemented in Ukrainian translation departments. Moreover, the databases considered in the paper enable studying of the dynamics of the linguistic system and developing areas of applied research such as terminography, terminology, automated data processing etc. Effective cooperation of developers, translators and declarative institutes in the creation of innovative linguistic technologies will promote further development of translation and applied linguistics.

Download Full-text

Improving Machine Translation through Linked Data

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0033 ◽

2017 ◽

Vol 108 (1) ◽

pp. 355-366 ◽

Cited By ~ 2

Author(s):

Ankit Srivastava ◽

Georg Rehm ◽

Felix Sasaki

Keyword(s):

Machine Translation ◽

Language Processing ◽

Linked Data ◽

Statistical Machine Translation ◽

Open Data ◽

Knowledge Bases ◽

Data Sets ◽

Lexical Resources ◽

Unknown Words ◽

Cross Lingual

Abstract With the ever increasing availability of linked multilingual lexical resources, there is a renewed interest in extending Natural Language Processing (NLP) applications so that they can make use of the vast set of lexical knowledge bases available in the Semantic Web. In the case of Machine Translation, MT systems can potentially benefit from such a resource. Unknown words and ambiguous translations are among the most common sources of error. In this paper, we attempt to minimise these types of errors by interfacing Statistical Machine Translation (SMT) models with Linked Open Data (LOD) resources such as DBpedia and BabelNet. We perform several experiments based on the SMT system Moses and evaluate multiple strategies for exploiting knowledge from multilingual linked data in automatically translating named entities. We conclude with an analysis of best practices for multilingual linked data sets in order to optimise their benefit to multilingual and cross-lingual applications.

Download Full-text

RuLearn: an Open-source Toolkit for the Automatic Inference of Shallow-transfer Rules for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0018 ◽

2016 ◽

Vol 106 (1) ◽

pp. 193-204

Author(s):

Víctor M. Sánchez-Cartagena ◽

Juan Antonio Pérez-Ortiz ◽

Felipe Sánchez-Martínez

Keyword(s):

Open Source ◽

Machine Translation ◽

Statistical Machine Translation ◽

Rule Based ◽

Parallel Corpora ◽

Linguistic Resources ◽

Data Sparseness ◽

Translation Quality ◽

Transfer Rules ◽

Automatic Inference

Abstract This paper presents ruLearn, an open-source toolkit for the automatic inference of rules for shallow-transfer machine translation from scarce parallel corpora and morphological dictionaries. ruLearn will make rule-based machine translation a very appealing alternative for under-resourced language pairs because it avoids the need for human experts to handcraft transfer rules and requires, in contrast to statistical machine translation, a small amount of parallel corpora (a few hundred parallel sentences proved to be sufficient). The inference algorithm implemented by ruLearn has been recently published by the same authors in Computer Speech & Language (volume 32). It is able to produce rules whose translation quality is similar to that obtained by using hand-crafted rules. ruLearn generates rules that are ready for their use in the Apertium platform, although they can be easily adapted to other platforms. When the rules produced by ruLearn are used together with a hybridisation strategy for integrating linguistic resources from shallow-transfer rule-based machine translation into phrase-based statistical machine translation (published by the same authors in Journal of Artificial Intelligence Research, volume 55), they help to mitigate data sparseness. This paper also shows how to use ruLearn and describes its implementation.

Download Full-text