Learning Japanese-English Bilingual Word Embeddings by Using Language Specificity

Author(s):  
Yuting Song ◽  
Biligsaikhan Batjargal ◽  
Akira Maeda

Cross-lingual word embeddings have been gaining attention because they can capture the semantic meaning of words across languages, which can be applied to cross-lingual tasks. Most methods learn a single mapping (e.g., a linear mapping) to transform a word embedding space from one language to another. To improve bilingual word embeddings, we propose an advanced method that adds a language-specific mapping. We focus on learning Japanese-English bilingual word embedding mapping by considering the specificity of the Japanese language. We evaluated our method by comparing it with single mapping-based-models on bilingual lexicon induction between Japanese and English. We determined that our method was more effective, with significant improvements on words of Japanese origin.

Author(s):  
Hailong Cao ◽  
Tiejun Zhao

Inspired by the observation that word embeddings exhibit isomorphic structure across languages, we propose a novel method to induce a bilingual lexicon from only two sets of word embeddings, which are trained on monolingual source and target data respectively. This is achieved by formulating the task as point set registration which is a more general problem. We show that a transformation from the source to the target embedding space can be learned automatically without any form of cross-lingual supervision. By properly adapting a traditional point set registration model to make it be suitable for processing word embeddings, we achieved state-of-the-art performance on the unsupervised bilingual lexicon induction task. The point set registration problem has been well-studied and can be solved by many elegant models, we thus opened up a new opportunity to capture the universal lexical semantic structure across languages.


2019 ◽  
Vol 65 ◽  
pp. 569-631 ◽  
Author(s):  
Sebastian Ruder ◽  
Ivan Vulić ◽  
Anders Søgaard

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.


Digital ◽  
2021 ◽  
Vol 1 (3) ◽  
pp. 145-161
Author(s):  
Kowshik Bhowmik ◽  
Anca Ralescu

This article presents a systematic literature review on quantifying the proximity between independently trained monolingual word embedding spaces. A search was carried out in the broader context of inducing bilingual lexicons from cross-lingual word embeddings, especially for low-resource languages. The returned articles were then classified. Cross-lingual word embeddings have drawn the attention of researchers in the field of natural language processing (NLP). Although existing methods have yielded satisfactory results for resource-rich languages and languages related to them, some researchers have pointed out that the same is not true for low-resource and distant languages. In this paper, we report the research on methods proposed to provide better representation for low-resource and distant languages in the cross-lingual word embedding space.


Author(s):  
Aya Kutsuki

Previous research has paid much attention to the overall acquisition of vocabularies among bilingual children in comparison to their monolingual counterparts. Much less attention has been paid to the type of words acquired and the possible transfer or cross-linguistic effects of the other language on vocabulary development. Thus, this study aims to explore similarities and dissimilarities in the vocabularies of simultaneous bilinguals and Japanese monolinguals and considers the possible cross-linguistic similarity effect on word acquisition. Six simultaneous Japanese–English bilingual children (mean age = 34.75 months (2.56)) were language–age-matched with six Japanese monolinguals; their productive vocabularies were compared regarding size and categories. Additionally, characteristic acquired words were compared using correspondence analyses. Results showed that, although delayed due to the reduced inputs, young bilinguals have a similar set of vocabularies in terms of word category as monolinguals. However, bilingual children’s vocabularies reflect their unevenly distributed experience with the language. Fewer interactive experiences with language speakers may result in a lower acquisition of interactive words. Furthermore, there is a cross-linguistic effect on acquisition, likely caused by form similarity between Japanese katakana words and English words. Even between languages with great dissimilarities, resources and cues are sought and used to facilitate bilingual vocabulary acquisition.


2018 ◽  
Vol 15 (4) ◽  
pp. 29-44 ◽  
Author(s):  
Yi Zhao ◽  
Chong Wang ◽  
Jian Wang ◽  
Keqing He

With the rapid growth of web services on the internet, web service discovery has become a hot topic in services computing. Faced with the heterogeneous and unstructured service descriptions, many service clustering approaches have been proposed to promote web service discovery, and many other approaches leveraged auxiliary features to enhance the classical LDA model to achieve better clustering performance. However, these extended LDA approaches still have limitations in processing data sparsity and noise words. This article proposes a novel web service clustering approach by incorporating LDA with word embedding, which leverages relevant words obtained based on word embedding to improve the performance of web service clustering. Especially, the semantically relevant words of service keywords by Word2vec were used to train the word embeddings and then incorporated into the LDA training process. Finally, experiments conducted on a real-world dataset published on ProgrammableWeb show that the authors' proposed approach can achieve better clustering performance than several classical approaches.


Author(s):  
Miroslav Kubát ◽  
Jan Hůla ◽  
Xinying Chen ◽  
Radek Čech ◽  
Jiří Milička

AbstractThis is a pilot study of usability of Context Specificity measure for stylometric purposes. Specifically, the word embedding Word2vec approach based on measuring lexical context similarity between lemmas is applied to the analysis of texts that belong to different styles. Three types of Czech texts are investigated: fiction, non-fiction, and journalism. Specifically, forty lemmas were observed (10 lemmas each for verbs, nouns, adjectives, and adverbs). The aim of the present study is to introduce a concept of the Context Specificity and to test whether this measurement is sensitive to different styles. The results show that the proposed method Closest Context Specificity (CCS) is a corpus size independent method which has a promising potential in analyzing different styles.


2018 ◽  
Author(s):  
Ruochen Xu ◽  
Yiming Yang ◽  
Naoki Otani ◽  
Yuexin Wu

Sign in / Sign up

Export Citation Format

Share Document