scholarly journals Cadlaws – An English–French Parallel Corpus of Legally Equivalent Documents

2021 ◽  
Vol 14 (2) ◽  
pp. 494-508
Author(s):  
Francina Sole-Mauri ◽  
Pilar Sánchez-Gijón ◽  
Antoni Oliver

This article presents Cadlaws, a new English–French corpus built from Canadian legal documents, and describes the corpus construction process and preliminary statistics obtained from it. The corpus contains over 16 million words in each language and includes unique features since it is composed of documents that are legally equivalent in both languages but not the result of a translation. The corpus is built upon enactments co-drafted by two jurists to ensure legal equality of each version and to re­flect the concepts, terms and institutions of two legal traditions. In this article the corpus definition as a parallel corpus instead of a comparable one is also discussed. Cadlaws has been pre-processed for machine translation and baseline Bilingual Evaluation Understudy (bleu), a score for comparing a candidate translation of text to a gold-standard translation of a neural machine translation system. To the best of our knowledge, this is the largest parallel corpus of texts which convey the same meaning in this language pair and is freely available for non-commercial use.

2016 ◽  
Vol 1 (1) ◽  
pp. 45-49
Author(s):  
Avinash Singh ◽  
Asmeet Kour ◽  
Shubhnandan S. Jamwal

The objective behind this paper is to analyze the English-Dogri parallel corpus translation. Machine translation is the translation from one language into another language. Machine translation is the biggest application of the Natural Language Processing (NLP). Moses is statistical machine translation system allow to train translation models for any language pair. We have developed translation system using Statistical based approach which helps in translating English to Dogri and vice versa. The parallel corpus consists of 98,973 sentences. The system gives accuracy of 80% in translating English to Dogri and the system gives accuracy of 87% in translating Dogri to English system.


2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Syed Abdul Basit Andrabi ◽  
Abdul Wahid

Machine translation is an ongoing field of research from the last decades. The main aim of machine translation is to remove the language barrier. Earlier research in this field started with the direct word-to-word replacement of source language by the target language. Later on, with the advancement in computer and communication technology, there was a paradigm shift to data-driven models like statistical and neural machine translation approaches. In this paper, we have used a neural network-based deep learning technique for English to Urdu languages. Parallel corpus sizes of around 30923 sentences are used. The corpus contains sentences from English-Urdu parallel corpus, news, and sentences which are frequently used in day-to-day life. The corpus contains 542810 English tokens and 540924 Urdu tokens, and the proposed system is trained and tested using 70 : 30 criteria. In order to evaluate the efficiency of the proposed system, several automatic evaluation metrics are used, and the model output is also compared with the output from Google Translator. The proposed model has an average BLEU score of 45.83.


This submission describes the study of linguistically motivated features to estimate the translated sentence quality at sentence level on English-Hindi language pair. Several classification algorithms are employed to build the Quality Estimation (QE) models using the extracted features. We used source language text and the MT output to extract these features. Experiments show that our proposed approach is robust and producing competitive results for the DT based QE model on neural machine translation system.


Webology ◽  
2021 ◽  
Vol 18 (Special Issue 02) ◽  
pp. 208-222
Author(s):  
Vikas Pandey ◽  
Dr.M.V. Padmavati ◽  
Dr. Ramesh Kumar

Machine Translation is a subfield of Natural language Processing (NLP) which uses to translate source language to target language. In this paper an attempt has been made to make a Hindi Chhattisgarhi machine translation system which is based on statistical approach. In the state of Chhattisgarh there is a long awaited need for Hindi to Chhattisgarhi machine translation system for converting Hindi into Chhattisgarhi especially for non Chhattisgarhi speaking people. In order to develop Hindi Chhattisgarhi statistical machine translation system an open source software called Moses is used. Moses is a statistical machine translation system and used to automatically train the translation model for Hindi Chhattisgarhi language pair called as parallel corpus. A collection of structured text to study linguistic properties is called corpus. This machine translation system works on parallel corpus of 40,000 Hindi-Chhattisgarhi bilingual sentences. In order to overcome translation problem related to proper noun and unknown words, a transliteration system is also embedded in it. These sentences are extracted from various domains like stories, novels, text books and news papers etc. This system is tested on 1000 sentences to check the grammatical correctness of sentences and it was found that an accuracy of 75% is achieved.


2018 ◽  
Vol 7 (2) ◽  
pp. 690
Author(s):  
Kamal Deep ◽  
Ajit Kumar ◽  
Vishal Goyal

This paper describes the creation process and statistics of Punjabi English (PunEng) parallel corpus. Parallel corpus is the main requirement to develop statistical machine translation as well as neural machine translation. Until now, we do not have any availability of PunEng parallel corpus. In this paper, we have shown difficulties and intensive labor to develop parallel corpus. Methods used for collecting data and the results are discussed, errors during the process of collecting data and how to handle these errors will be described.


2020 ◽  
Vol 2020 ◽  
pp. 1-7 ◽  
Author(s):  
Thien Nguyen ◽  
Huu Nguyen ◽  
Phuoc Tran

Building the first Russian-Vietnamese neural machine translation system, we faced the problem of choosing a translation unit system on which source and target embeddings are based. Available homogeneous translation unit systems with the same translation unit on the source and target sides do not perfectly suit the investigated language pair. To solve the problem, in this paper, we propose a novel heterogeneous translation unit system, considering linguistic characteristics of the synthetic Russian language and the analytic Vietnamese language. Specifically, we decrease the embedding level on the source side by splitting token into subtokens and increase the embedding level on the target side by merging neighboring tokens into supertoken. The experiment results show that the proposed heterogeneous system improves over the existing best homogeneous Russian-Vietnamese translation system by 1.17 BLEU. Our approach could be applied to building translation bots for language pairs with different linguistic characteristics.


Author(s):  
Rashmini Naranpanawa ◽  
Ravinga Perera ◽  
Thilakshi Fonseka ◽  
Uthayasanker Thayasivam

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.


Sign in / Sign up

Export Citation Format

Share Document