Learning Japanese-English Bilingual Word Embeddings by Using Language Specificity

Cross-lingual word embeddings have been gaining attention because they can capture the semantic meaning of words across languages, which can be applied to cross-lingual tasks. Most methods learn a single mapping (e.g., a linear mapping) to transform a word embedding space from one language to another. To improve bilingual word embeddings, we propose an advanced method that adds a language-specific mapping. We focus on learning Japanese-English bilingual word embedding mapping by considering the specificity of the Japanese language. We evaluated our method by comparing it with single mapping-based-models on bilingual lexicon induction between Japanese and English. We determined that our method was more effective, with significant improvements on words of Japanese origin.

Download Full-text

Improving Japanese-English Bilingual Mapping of Word Embeddings based on Language Specificity

2019 International Conference on Asian Language Processing (IALP) ◽

10.1109/ialp48816.2019.9037649 ◽

2019 ◽

Author(s):

Yuting Song ◽

Biligsaikhan Batjargal ◽

Akira Maeda

Keyword(s):

Word Embeddings ◽

English Bilingual ◽

Japanese English

Download Full-text

Point Set Registration for Unsupervised Bilingual Lexicon Induction

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/555 ◽

2018 ◽

Author(s):

Hailong Cao ◽

Tiejun Zhao

Keyword(s):

State Of The Art ◽

Semantic Structure ◽

Word Embeddings ◽

Point Set Registration ◽

Bilingual Lexicon ◽

Point Set ◽

Novel Method ◽

Cross Lingual ◽

Isomorphic Structure ◽

Target Data

Inspired by the observation that word embeddings exhibit isomorphic structure across languages, we propose a novel method to induce a bilingual lexicon from only two sets of word embeddings, which are trained on monolingual source and target data respectively. This is achieved by formulating the task as point set registration which is a more general problem. We show that a transformation from the source to the target embedding space can be learned automatically without any form of cross-lingual supervision. By properly adapting a traditional point set registration model to make it be suitable for processing word embeddings, we achieved state-of-the-art performance on the unsupervised bilingual lexicon induction task. The point set registration problem has been well-studied and can be solved by many elegant models, we thus opened up a new opportunity to capture the universal lexical semantic structure across languages.

Download Full-text

A Survey of Cross-lingual Word Embedding Models

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11640 ◽

2019 ◽

Vol 65 ◽

pp. 569-631 ◽

Cited By ~ 19

Author(s):

Sebastian Ruder ◽

Ivan Vulić ◽

Anders Søgaard

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Meaning ◽

Word Embedding ◽

Word Embeddings ◽

Objective Functions ◽

Future Challenges ◽

Cross Lingual ◽

Data Requirements

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.

Download Full-text

Cross-Lingual Word Embeddings and the Structure of the Human Bilingual Lexicon

10.18653/v1/k19-1011 ◽

2019 ◽

Author(s):

Paola Merlo ◽

Maria Andueza Rodriguez

Keyword(s):

Word Embeddings ◽

Bilingual Lexicon ◽

Cross Lingual

Download Full-text

Leveraging Vector Space Similarity for Learning Cross-Lingual Word Embeddings: A Systematic Review

Digital ◽

10.3390/digital1030011 ◽

2021 ◽

Vol 1 (3) ◽

pp. 145-161

Author(s):

Kowshik Bhowmik ◽

Anca Ralescu

Keyword(s):

Systematic Review ◽

Natural Language Processing ◽

Literature Review ◽

Natural Language ◽

Vector Space ◽

Language Processing ◽

Word Embedding ◽

Word Embeddings ◽

Low Resource ◽

Cross Lingual

This article presents a systematic literature review on quantifying the proximity between independently trained monolingual word embedding spaces. A search was carried out in the broader context of inducing bilingual lexicons from cross-lingual word embeddings, especially for low-resource languages. The returned articles were then classified. Cross-lingual word embeddings have drawn the attention of researchers in the field of natural language processing (NLP). Although existing methods have yielded satisfactory results for resource-rich languages and languages related to them, some researchers have pointed out that the same is not true for low-resource and distant languages. In this paper, we report the research on methods proposed to provide better representation for low-resource and distant languages in the cross-lingual word embedding space.

Download Full-text

Do Bilinguals Acquire Similar Words to Monolinguals? An Examination of Word Acquisition and the Similarity Effect in Japanese–English Bilinguals’ Vocabularies

European Journal of Investigation in Health, Psychology and Education ◽

10.3390/ejihpe11010014 ◽

2021 ◽

Vol 11 (1) ◽

pp. 168-182

Author(s):

Aya Kutsuki

Keyword(s):

Vocabulary Acquisition ◽

Vocabulary Development ◽

The Other ◽

Bilingual Children ◽

Word Category ◽

Word Acquisition ◽

English Bilingual ◽

Similarity Effect ◽

Japanese English

Previous research has paid much attention to the overall acquisition of vocabularies among bilingual children in comparison to their monolingual counterparts. Much less attention has been paid to the type of words acquired and the possible transfer or cross-linguistic effects of the other language on vocabulary development. Thus, this study aims to explore similarities and dissimilarities in the vocabularies of simultaneous bilinguals and Japanese monolinguals and considers the possible cross-linguistic similarity effect on word acquisition. Six simultaneous Japanese–English bilingual children (mean age = 34.75 months (2.56)) were language–age-matched with six Japanese monolinguals; their productive vocabularies were compared regarding size and categories. Additionally, characteristic acquired words were compared using correspondence analyses. Results showed that, although delayed due to the reduced inputs, young bilinguals have a similar set of vocabularies in terms of word category as monolinguals. However, bilingual children’s vocabularies reflect their unevenly distributed experience with the language. Fewer interactive experiences with language speakers may result in a lower acquisition of interactive words. Furthermore, there is a cross-linguistic effect on acquisition, likely caused by form similarity between Japanese katakana words and English words. Even between languages with great dissimilarities, resources and cues are sought and used to facilitate bilingual vocabulary acquisition.

Download Full-text

Incorporating LDA With Word Embedding for Web Service Clustering

International Journal of Web Services Research ◽

10.4018/ijwsr.2018100102 ◽

2018 ◽

Vol 15 (4) ◽

pp. 29-44 ◽

Cited By ~ 4

Author(s):

Yi Zhao ◽

Chong Wang ◽

Jian Wang ◽

Keqing He

Keyword(s):

Web Service ◽

Service Discovery ◽

Word Embedding ◽

The Internet ◽

Word Embeddings ◽

Training Process ◽

Web Service Discovery ◽

Processing Data ◽

Clustering Approach ◽

Service Clustering

With the rapid growth of web services on the internet, web service discovery has become a hot topic in services computing. Faced with the heterogeneous and unstructured service descriptions, many service clustering approaches have been proposed to promote web service discovery, and many other approaches leveraged auxiliary features to enhance the classical LDA model to achieve better clustering performance. However, these extended LDA approaches still have limitations in processing data sparsity and noise words. This article proposes a novel web service clustering approach by incorporating LDA with word embedding, which leverages relevant words obtained based on word embedding to improve the performance of web service clustering. Especially, the semantically relevant words of service keywords by Word2vec were used to train the word embeddings and then incorporated into the LDA training process. Finally, experiments conducted on a real-world dataset published on ProgrammableWeb show that the authors' proposed approach can achieve better clustering performance than several classical approaches.

Download Full-text

The lexical context in a style analysis: A word embeddings approach

Corpus Linguistics and Linguistic Theory ◽

10.1515/cllt-2018-0003 ◽

2018 ◽

Vol 0 (0) ◽

Cited By ~ 1

Author(s):

Miroslav Kubát ◽

Jan Hůla ◽

Xinying Chen ◽

Radek Čech ◽

Jiří Milička

Keyword(s):

Pilot Study ◽

Independent Method ◽

Word Embedding ◽

Style Analysis ◽

Word Embeddings ◽

Context Specificity ◽

Non Fiction ◽

Context Similarity ◽

Corpus Size ◽

Specificity Measure

AbstractThis is a pilot study of usability of Context Specificity measure for stylometric purposes. Specifically, the word embedding Word2vec approach based on measuring lexical context similarity between lemmas is applied to the analysis of texts that belong to different styles. Three types of Czech texts are investigated: fiction, non-fiction, and journalism. Specifically, forty lemmas were observed (10 lemmas each for verbs, nouns, adjectives, and adverbs). The aim of the present study is to introduce a concept of the Context Specificity and to test whether this measurement is sensitive to different styles. The results show that the proposed method Closest Context Specificity (CCS) is a corpus size independent method which has a promising potential in analyzing different styles.

Download Full-text