parallel corpus
Recently Published Documents


TOTAL DOCUMENTS

593
(FIVE YEARS 220)

H-INDEX

13
(FIVE YEARS 3)

Author(s):  
Rupjyoti Baruah ◽  
Rajesh Kumar Mundotiya ◽  
Anil Kumar Singh

Machine translation (MT) systems have been built using numerous different techniques for bridging the language barriers. These techniques are broadly categorized into approaches like Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). End-to-end NMT systems significantly outperform SMT in translation quality on many language pairs, especially those with the adequate parallel corpus. We report comparative experiments on baseline MT systems for Assamese to other Indo-Aryan languages (in both translation directions) using the traditional Phrase-Based SMT as well as some more successful NMT architectures, namely basic sequence-to-sequence model with attention, Transformer, and finetuned Transformer. The results are evaluated using the most prominent and popular standard automatic metric BLEU (BiLingual Evaluation Understudy), as well as other well-known metrics for exploring the performance of different baseline MT systems, since this is the first such work involving Assamese. The evaluation scores are compared for SMT and NMT models for the effectiveness of bi-directional language pairs involving Assamese and other Indo-Aryan languages (Bangla, Gujarati, Hindi, Marathi, Odia, Sinhalese, and Urdu). The highest BLEU scores obtained are for Assamese to Sinhalese for SMT (35.63) and the Assamese to Bangla for NMT systems (seq2seq is 50.92, Transformer is 50.01, and finetuned Transformer is 50.19). We also try to relate the results with the language characteristics, distances, family trees, domains, data sizes, and sentence lengths. We find that the effect of the domain is the most important factor affecting the results for the given data domains and sizes. We compare our results with the only existing MT system for Assamese (Bing Translator) and also with pairs involving Hindi.


2022 ◽  
Vol 0 (0) ◽  
Author(s):  
Martijn van der Klis ◽  
Jos Tellings

Abstract This paper reports on the state-of-the-art in application of multidimensional scaling (MDS) techniques to create semantic maps in linguistic research. MDS refers to a statistical technique that represents objects (lexical items, linguistic contexts, languages, etc.) as points in a space so that close similarity between the objects corresponds to close distances between the corresponding points in the representation. We focus on the use of MDS in combination with parallel corpus data as used in research on cross-linguistic variation. We first introduce the mathematical foundations of MDS and then give an exhaustive overview of past research that employs MDS techniques in combination with parallel corpus data. We propose a set of terminology to succinctly describe the key parameters of a particular MDS application. We then show that this computational methodology is theory-neutral, i.e. it can be employed to answer research questions in a variety of linguistic theoretical frameworks. Finally, we show how this leads to two lines of future developments for MDS research in linguistics.


2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Syed Abdul Basit Andrabi ◽  
Abdul Wahid

Machine translation is an ongoing field of research from the last decades. The main aim of machine translation is to remove the language barrier. Earlier research in this field started with the direct word-to-word replacement of source language by the target language. Later on, with the advancement in computer and communication technology, there was a paradigm shift to data-driven models like statistical and neural machine translation approaches. In this paper, we have used a neural network-based deep learning technique for English to Urdu languages. Parallel corpus sizes of around 30923 sentences are used. The corpus contains sentences from English-Urdu parallel corpus, news, and sentences which are frequently used in day-to-day life. The corpus contains 542810 English tokens and 540924 Urdu tokens, and the proposed system is trained and tested using 70 : 30 criteria. In order to evaluate the efficiency of the proposed system, several automatic evaluation metrics are used, and the model output is also compared with the output from Google Translator. The proposed model has an average BLEU score of 45.83.


2021 ◽  
pp. 507-525
Author(s):  
Talita Serpa ◽  
Paula Tavares Pinto ◽  
Diva Cardoso De Camargo

There is a growing body of literature that recognises the importance of Social Sciences in Translation Studies, such as the discussions surrounding the translational habitus, developed by Simeoni, Wolf, Inghilleri and Sela-Sheffy. In our research, we associate these ideas to corpora methodologies to analyse terminological usages as part of a professional behaviour. We hypothesise that when translation students previously face the most frequent terms extracted from a parallel corpus as well as their keyness and contexts, they replicate the same translational strategies in their texts, which can indicate their competencies eligible by their habitus.


Author(s):  
Łukasz Grabowski ◽  
Nicholas Groom

Abstract This study uses both parallel and comparable reference corpora in the English-Polish language pair to explore how translators deal with recurrent multi-word items performing specific discoursal functions. We also consider whether the observed tendencies overlap with those found in native texts, and the extent to which the discoursal functions realised by the multi-word items under scrutiny are “preserved” in translation. Capitalizing on findings from earlier research (Granger, 2014; Grabar & Lefer, 2015), we analyzed a pre-selected set of phrases signaling stance-taking and those functioning as textual, discourse-structuring devices originally found in the European Parliament proceedings corpus (Koehn, 2005) and included in the English-Polish parallel corpus Paralela (Pęzik, 2016). Since our goal was to explore whether and to what extent English functionally-defined phrases reflect the same level of formulaicity and regularity in both Polish translations and native Polish texts, the findings provided insights into the translation tendencies of such items, and revealed – using inter-rater agreement metrics – that the discoursal functions of recurrent n-grams may change in translation.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Bing Li ◽  
Anxie Tuo ◽  
Hanyue Kong ◽  
Sujiao Liu ◽  
Jia Chen

This paper uses neural network as a predictive model and genetic algorithm as an online optimization algorithm to simulate the noise processing of Chinese-English parallel corpus. At the same time, according to the powerful random global search mechanism of genetic algorithm, this paper studied the principle and process of noise processing in Chinese-English parallel corpus. Aiming at the task of identifying isolated words for unspecified persons, taking into account the inadequacies of the algorithms in standard genetic algorithms and neural networks, this paper proposes a fast algorithm for training the network using genetic algorithms. Through simulation calculations, different characteristic parameters, the number of training samples, background noise, and whether a specific person affects the recognition result were analyzed and discussed and compared with the traditional dynamic time comparison method. This paper introduces the idea of reinforcement learning, uses different reward mechanisms to solve the inconsistency of loss function and evaluation index measurement methods, and uses different decoding methods to alleviate the problem of exposure bias. It uses various simple genetic operations and the survival of the fittest selection mechanism to guide the learning process and determine the direction of the search, and it can search multiple regions in the solution space at the same time. In addition, it also has the advantage of not being restricted by the restrictive conditions of the search space (such as differentiable, continuous, and unimodal). At the same time, a method of using English subword vectors to initialize the parameters of the translation model is given. The research results show that the neural network recognition method based on genetic algorithm which is given in this paper shows its ability of quickly learning network weights and it is superior to the standard in all aspects. The performance of the algorithm in genetic algorithm and neural network, with high recognition rate and unique application advantages, can achieve a win-win of time and efficiency.


2021 ◽  
Vol 12 (2) ◽  
pp. 192-226
Author(s):  
Luca Bevacqua ◽  
Sharid Loáiciga ◽  
Hannah Rohde ◽  
Christian Hardmeier

Current work on coreference focuses primarily on entities, often leaving unanalysed the use of anaphors to corefer with antecedents such as events and textual segments. Moreover, the anaphoric forms that speakers use for entity and non-entity coreference are not mutually exclusive. This ambiguity has been the subject of recent work in English, with evidence of a split between comprehenders' preferential interpretation of personal versus demonstrative pronouns. In addition, comprehenders are shown to be sensitive to antecedent complexity and aspectual status, two verb-driven cues that signal how an event is being portrayed. Here we extend this work via a comparison across five languages (English, French, German, Italian, and Spanish). With a story-continuation experiment, we test how different referring expressions corefer with entity and event antecedents and whether verbal features such as argument structure and aspect influence this choice. Our results show widely consistent, not categorical biases across languages: entity coreference is favoured for personal pronouns and event coreference for demonstratives. Antecedent complexity increases the rate at which anaphors are taken to corefer with an event antecedent, but portraying an event as completed does not reach statistical significance (though showing quite uniform patterns). Lastly, we report a comparison of the same referring expressions to refer to entity and event antecedents in a trilingual parallel corpus annotated with coreference.Together, the results provide a first crosslingual picture of coreference preferences beyond the restricted entity-only patterns targeted by most existing work on coreference. The five languages are all shown to allow gradable use of pronouns for entity and event coreference, with biases that align with existing generalizations about the link between prominence and the use of reduced referring expressions. The studies also show the feasibility of manipulating targeted verb-driven cues across multiple languages to support crosslingual comparisons.


2021 ◽  
Vol 25 (2) ◽  
pp. 343-368
Author(s):  
Barbara Lewandowska-Tomaszczyk

The focus of the paper is to present arguments in favour of a complex set of areas of reference in cross-linguistic analyses of meanings, aimed in particular at the identification of a set of relevant analytic criteria to perform such a comparison. The arguments are based on lexicographic and corpus linguistic data and specifically on the polysemic concept of integrity in English and its lexical counterparts in Polish. It is generally assumed in Cognitive Linguistics, which is taken as the basic framework of the present study, that meanings, which are defined as convention-based conceptualizations, are not discrete entities, fully determined, even in fuller context but rather they are dynamic conventional conceptualizations[13]. Therefore, it is considered essential to identify first their basic, prototypical senses and then their broad meanings , which include, apart from the core part, their contextual, culture-specific, and connotational properties, defined in terms of a parametrized set of semasiological as well as onomasiological properties. The study methodology has also been adjusted towards this multifocused analysis of linguistic forms and considers the interdisciplinary - linguistic, psychological, cultural and social domains to identify the cultural conceptualizations of the analysed forms. In the present case a cognitive corpus-based analysis in monolinguistic English contexts and in the English-to-Polish and Polish-to-English translation data of lexicographic and parallel corpus materials, as well as cultural dimensions will be exemplified to conclude with a parametrized system of cognitive cross-linguistic tertia comparationis to more fully determine their broad linguistic meanings.


Sign in / Sign up

Export Citation Format

Share Document