parallel corpora
Recently Published Documents


TOTAL DOCUMENTS

433
(FIVE YEARS 121)

H-INDEX

20
(FIVE YEARS 3)

2022 ◽  
Vol 29 (2) ◽  
pp. 1-38
Author(s):  
Ana Paula Chaves ◽  
Jesse Egbert ◽  
Toby Hocking ◽  
Eck Doerry ◽  
Marco Aurelio Gerosa

Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact of using language that fails to conform to the associated social role. Our research draws on sociolinguistic to investigate how a chatbot’s language choices can adhere to the expected social role the agent performs within a context. We seek to understand whether chatbots design should account for linguistic register. This research analyzes how register differences play a role in shaping the user’s perception of the human-chatbot interaction. We produced parallel corpora of conversations in the tourism domain with similar content and varying register characteristics and evaluated users’ preferences of chatbot’s linguistic choices in terms of appropriateness, credibility, and user experience. Our results show that register characteristics are strong predictors of user’s preferences, which points to the needs of designing chatbots with register-appropriate language to improve acceptance and users’ perceptions of chatbot interactions.


2022 ◽  
Author(s):  
Natali Alfonso Burgos ◽  
Karol Kiš ◽  
Peter Bakarac ◽  
Michal Kvasnica ◽  
Giovanni Licitra

We explore a bilingual next-word predictor (NWP) under federated optimization for a mobile application. A character-based LSTM is server-trained on English and Dutch texts from a custom parallel corpora. This is used as the target performance. We simulate a federated learning environment to assess the feasibility of distributed training for the same model. The popular Federated Averaging (FedAvg) algorithm is used as the aggregation method. We show that the federated LSTM achieves decent performance, yet it is still sub-optimal. We suggest possible next steps to bridge this performance gap. Furthermore, we explore the effects of language imbalance varying the ratio of English and Dutch training texts (or clients). We show the model upholds performance (of the balanced case) up and until a 80/20 imbalance before decaying rapidly. Lastly, we describe the implementation of local client training, word prediction and client-server communication in a custom virtual keyboard for Android platforms. Additionally, homomorphic encryption is applied to provide with secure aggregation guarding the user from malicious servers.


2022 ◽  
Author(s):  
Natali Alfonso Burgos ◽  
Karol Kiš ◽  
Peter Bakarac ◽  
Michal Kvasnica ◽  
Giovanni Licitra

We explore a bilingual next-word predictor (NWP) under federated optimization for a mobile application. A character-based LSTM is server-trained on English and Dutch texts from a custom parallel corpora. This is used as the target performance. We simulate a federated learning environment to assess the feasibility of distributed training for the same model. The popular Federated Averaging (FedAvg) algorithm is used as the aggregation method. We show that the federated LSTM achieves decent performance, yet it is still sub-optimal. We suggest possible next steps to bridge this performance gap. Furthermore, we explore the effects of language imbalance varying the ratio of English and Dutch training texts (or clients). We show the model upholds performance (of the balanced case) up and until a 80/20 imbalance before decaying rapidly. Lastly, we describe the implementation of local client training, word prediction and client-server communication in a custom virtual keyboard for Android platforms. Additionally, homomorphic encryption is applied to provide with secure aggregation guarding the user from malicious servers.


2021 ◽  
Vol 45 ◽  
Author(s):  
Roman Roszko

On New Manually Aligned and Tagged Bilingual Parallel Corpora and Their ApplicationsThis article is devoted to the manually aligned and tagged bilingual parallel CLARIN-PL-BIZ corpora of the Baltic and Slavic languages which are currently being developed. The study discusses the essential features of these corpora that make their applications go far beyond typical corpus analysis. Applications of these corpora include the design of cross-language models for the development of machine translation and artificial intelligence. The article also draws attention to the high potential of these resources as a model training base for testing natural language processing tools. O nowych ręcznie zrównoleglonych i znakowanych dwujęzycznych korpusach równoległych oraz ich zastosowaniachW artykule autor opisuje obecnie powstające ręcznie zrównoleglone i znakowane dwujęzyczne korpusy równoległe CLARIN-PL-BIZ języków bałtyckich i słowiańskich. Omawia wyróżniające cechy tych korpusów, które sprawią, że zastosowania tych korpusów znacznie wykroczą poza typowe analizy korpusowe. Wśród zastosowań tych korpusów autor wymienia definiowanie modeli międzyjęzykowych na rzecz rozwoju przekładu maszynowego i rozwoju sztucznej inteligencji. Zwraca również uwagę na wysoki potencjał tych zasobów jako wzorcowej bazy treningowej do testowania narzędzi przetwarzania języka naturalnego.


2021 ◽  
Vol 111 (6) ◽  
pp. 137-165
Author(s):  
Doris Höhmann

It is well known that linguistic variants play a key role in the acquisition of language skills in the first, second or foreign language as well as in writing and translation processes and in general in communicative interactions. Thus, a major research goal is the systematic investigation of intra- and interlinguistic variation. Due to its complexity, its qualitative-quantitative analysis continues to be a challenging issue, but it seems to become more and more feasible thanks to both the possibility of compiling very large corpora and the availability of high-performing corpus-linguistic tools. The paper discusses a corpus-linguistic pilot study concerning the use of besser, am besten and das Beste as pragmatic markers in a cross-linguistic perspective. In particular, the analysis focusses on selected superlative and comparative constructions on the left periphery used for expressing advice. The data basis consists mainly of German and Italian comparable very large web corpora and, to a lesser extent, of bilingual sentence pairs drawn from parallel corpora. As will be shown, even restricting the analysis to a very small segment of microvariation, in both languages the modal constructions appear to be characterized by the combination of numerous overlapping and interplaying variants and by different tendencies in language use.


2021 ◽  
Vol 72 (2) ◽  
pp. 477-487
Author(s):  
Klára Bendová

Abstract Text readability metrics assess how much effort a reader must put into comprehending a given text. They are, e.g., used to choose appropriate readings for different student proficiency levels, or to make sure that crucial information is efficiently conveyed (e.g., in an emergency). Flesch Reading Ease is such a globally used formula that it is even integrated into the MS Word Processor. However, its constants are language-dependent. The original formula was created for English. So far it has been adapted to several European languages, Bangla, and Hindi. This paper describes the Czech adaptation, with the language-dependent constants optimized by a machine-learning algorithm working on parallel corpora of Czech and English, Russian, Italian, and French, respectively.


2021 ◽  
Vol 12 (4) ◽  
pp. 48-52
Author(s):  
Strilets V. ◽  

Corpus technologies (corpora of English and Ukrainian texts and tools for their processing) represent modern specialized discourse and facilitate searching for and comparing different units of translation, which makes them a useful tool for both practicing and trainee translators. The purpose of this article is to determine the role and place of corpus technologies in teaching specialized translation on the example of the oil and gas industry. Comparative and parallel text corpora are characterized. The paper reveals methods of applying mono- and bilingual comparative and parallel corpora and corpus managers for acquiring knowledge about genre-stylistic features of texts; developing skills to distinguish a term and determine its collocation profile and semantic preference; analyze translation techniques; translate collocations, complex noun constructions, verbal phrases, and abbreviations. Examples of relevant exercises and tasks that should be performed at the translation training stage are given. Further research should be aimed at integrating corpus-based tasks into the translation practice stage involving the implementation of a translation project.


2021 ◽  
pp. 191-214
Author(s):  
Dmitrij Dobrovol’skij ◽  
Ludmila Pöppel
Keyword(s):  

Literator ◽  
2021 ◽  
Vol 42 (1) ◽  
Author(s):  
Nomsa J. Skosana ◽  
Respect Mlambo

The scarcity of adequate resources for South African languages poses a huge challenge for their functional development in specialised fields such as science and technology. The study examines the Autshumato Machine Translation (MT) Web Service, created by the Centre for Text Technology at the North-West University. This software supports both formal and informal translations as a machine-aided human translation tool. We investigate the system in terms of its advantages and limitations and suggest possible solutions for South African languages. The results show that the system is essential as it offers high-speed translation and operates as an open-source platform. It also provides multiple translations from sentences, documents and web pages. Some South African languages were included whilst others were excluded and we find this to be a limitation of the system. We also find that the system was trained with a limited amount of data, and this has an adverse effect on the quality of the output. The study suggests that adding specialised parallel corpora from various contemporary fields for all official languages and involving language experts in the pre-editing of training data can be a major step towards improving the quality of the system’s output. The study also outlines that developers should consider integrating the system with other natural language processing applications. Finally, the initiatives discussed in this study will help to improve this MT system to be a more effective translation tool for all the official languages of South Africa.


Sign in / Sign up

Export Citation Format

Share Document