bilingual corpora
Recently Published Documents


TOTAL DOCUMENTS

77
(FIVE YEARS 10)

H-INDEX

12
(FIVE YEARS 1)

2021 ◽  
pp. 1-28
Author(s):  
Laia FIBLA ◽  
Nuria SEBASTIAN-GALLES ◽  
Alejandrina CRISTIA

Abstract Since there are no systematic pauses delimiting words in speech, the problem of word segmentation is formidable even for monolingual infants. We use computational modeling to assess whether word segmentation is substantially harder in a bilingual than a monolingual setting. Seven algorithms representing different cognitive approaches to segmentation are applied to transcriptions of naturalistic input to young children, carefully processed to generate perfectly matched monolingual and bilingual corpora. We vary the overlap in phonology and lexicon experienced by modeling exposure to languages that are more similar (Catalan and Spanish) or more different (English and Spanish). We find that the greatest variation in performance is due to different segmentation algorithms and the second greatest to language, with bilingualism having effects that are smaller than both algorithm and language effects. Implications of these computational results for experimental and modeling approaches to language acquisition are discussed.


2021 ◽  
pp. 1-24
Author(s):  
Mohamed Chebel ◽  
Chiraz Latiri ◽  
Eric Gaussier

Abstract Bilingual corpora are an essential resource used to cross the language barrier in multilingual natural language processing tasks. Among bilingual corpora, comparable corpora have been the subject of many studies as they are both frequent and easily available. In this paper, we propose to make use of formal concept analysis to first construct concept vectors which can be used to enhance comparable corpora through clustering techniques. We then show how one can extract bilingual lexicons of improved quality from these enhanced corpora. We finally show that the bilingual lexicons obtained can complement existing bilingual dictionaries and improve cross-language information retrieval systems.


2021 ◽  
pp. 136700692110283
Author(s):  
T. Mark Ellison ◽  
Aung Si

Aims and objectives: Numerical indices developed by Guzman et al. that helped characterize code-switching (CS) patterns in Spanish–English bilingual corpora, were tested on a Hindi–English bilingual corpus. Two main research questions were addressed: first, how does Hindi–English compare with Spanish–English, and second, are there measurable differences in broad CS patterns between older and younger speakers? Methodology: Television interviews of Hindi movie (Bollywood) personalities were transcribed and coded for Hindi and English lexemes. Bespoke software in Python was used to calculate the required indices, which provided information on variables such as the level of language mixing, switching probability, and the distributions of single-language spans. Further indices, such as mean span length and an approximate ratio of insertions to alternations were also calculated. Data and analysis: The indices calculated for the Hindi–English corpus broadly match those calculated for Spanish–English. Statistically significant differences between the older and younger group were detected for some key indices, with older speakers generally using less English. High levels of intra-group variability may be responsible for some indices not showing statistically significant diachronic change. Conclusions: The Guzman et al. indices suggest that Hindi–English and Spanish–English CS resemble each other in certain ways. There have been broad changes in Hindi–English CS patterns over the last few decades, but there are indications that the CS behaviour of individual speakers might change in different ways. Originality: This is the first study to systematically and quantitatively investigate age-related differences in Hindi–English CS, using naturalistic speech in semi-controlled conditions. Implications: The quantitative indices investigated in this study can be used to compare CS behaviour in different language pairs, and can also help detect diachronic changes in CS patterns.


Author(s):  
Mingjun Zhao ◽  
Haijiang Wu ◽  
Di Niu ◽  
Zixuan Wang ◽  
Xiaoli Wang

Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1493
Author(s):  
Hanan A. Hosni Mahmoud ◽  
Hanan Abdullah Mengash

In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5–20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation.


2021 ◽  
Vol 22 (4) ◽  
pp. 1079-1087
Author(s):  
V. V. Karapets

The present research featured some narrative peculiarities, which could belong both to the narrator or other characters and violate the initial focalization. The paper focuses on zero focalization and evaluation of personages. The research objective was to analyze Russian equivalents of adjective pauvre and substantive bonhomme, which contain subjective evaluation in character’s nomination. The research featured fragments from G. Flaubert’s novel Madame Bovary and its nine Russian translations. They were selected according to the continuous sampling method and then analyzed using the comparative method. The study was based on mono- and bilingual Russian and French dictionaries. The multiple translations often preserved the original evaluation, especially in case of pauvre. Unlike bonhomme, pauvre has more Russian equivalents. Therefore, lexical isomorphism was not always preserved in translation: in some cases, the evaluation was neutralized, or even opposed to the original one, and Russian variants acquired extra semantic and stylistic meanings. These flaws may change the initial narrator-character or character-narrator point of view. They could have resulted from the lack of Russian equivalents. In early translations, they might be explained by the translator’s choice of obsolete words with an archaic effect or adding extra semantic and stylistic information. In some cases, translators might have ignored the original meaning or could have been misled by the overall peculiarities of the narrative. The research can contribute to bilingual corpora of parallel texts, as well as to textbooks on comparative lexicology.


2020 ◽  
pp. 136700692095672
Author(s):  
Antje Endesfelder Quick ◽  
Dorota Gaskins ◽  
Oksana Bailleul ◽  
Maria Frick ◽  
Elina Palola

Objectives: This study investigates monolingual and code-mixed utterances in four bilingual children with different language combinations (German–English, English–Polish, Finnish–English, and French–Russian) in terms of utterance lengths (MLUs) and complexities offering a usage-based (UB) explanation based on cognitive mechanisms. Methodology: Utterances from four different child bilingual corpora were extracted and coded for individual monolingual languages and bilingual utterances. Data and analysis: 35.441 utterances between the age of 2–4 were analyzed in terms of MLU and syntactic complexity. Findings/conclusions: Results showed that for all children monolingual MLUs and complexities reflect their input situations: the more input in one language, the longer and more complex those utterances were. However, in all four children code-mixed utterances were longer and more complex from the beginning of the recordings. Implications: This is the first study that systematically compares MLU scores and complexities of monolingual and bilingual utterances taking diverse language combinations into account and offering a UB explanation based on chunking and entrenchment processes as a new alternative for further research in bilingualism.


Information ◽  
2019 ◽  
Vol 10 (9) ◽  
pp. 267 ◽  
Author(s):  
Bin Li ◽  
Jianmin Yao

Bilingual web pages are widely used to mine translations of unknown terms. This study focused on an effective solution for obtaining relevant web pages, extracting translations with correct lexical boundaries, and ranking the translation candidates. This research adopted co-occurrence information to obtain the subject terms and then expanded the source query with the translation of the subject terms to collect effective bilingual search engine snippets. Afterwards, valid candidates were extracted from small-sized, noisy bilingual corpora using an improved frequency change measurement that combines adjacent information. This research developed a method that considers surface patterns, frequency–distance, and phonetic features to elect an appropriate translation. The experimental results revealed that the proposed method performed remarkably well for mining translations of unknown terms.


Corpora ◽  
2019 ◽  
Vol 14 (1) ◽  
pp. 63-76 ◽  
Author(s):  
Gabrielle Hodge ◽  
Kazuki Sekine ◽  
Adam Schembri ◽  
Trevor Johnston

The Auslan and Australian English archive and corpus is the first bilingual, multi-modal documentation of a deaf signed language (Auslan, the language of the Australian deaf community) and its ambient spoken language (Australian English). It aims to facilitate the direct comparison of face-to-face, multi-modal talk produced by deaf signers and hearing speakers from the same city. Here, we describe the documentation of the bilingual, multi-modal archive and outline its development pathway into a directly comparable corpus of a signed language and spoken language. We differentiate it from existing bilingual corpora and offer some research questions which the resulting corpus may be best placed to answer. The Auslan and Australian English corpus has the potential to redress several significant misunderstandings in the comparison of signed and spoken languages, especially those that follow from misapplications of the paradigm that multi-modal signed languages are used and structured in ways that are parallel to the uni-modal spoken or written conventions of spoken languages.


Sign in / Sign up

Export Citation Format

Share Document