WEWD: A Combined Approach for Measuring Cross-lingual Semantic Word Similarity Based on Word Embeddings and Word Definitions

Abstract Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual embeddings define a multilingual space where word embeddings from two or more languages are integrated together. Current state-of-the-art approaches learn these embeddings by aligning two disjoint monolingual vector spaces through an orthogonal transformation which preserves the structure of the monolingual counterparts. In this work, we propose to apply an additional transformation after this initial alignment step, which aims to bring the vector representations of a given word and its translations closer to their average. Since this additional transformation is non-orthogonal, it also affects the structure of the monolingual spaces. We show that our approach both improves the integration of the monolingual spaces and the quality of the monolingual spaces themselves. Furthermore, because our transformation can be applied to an arbitrary number of languages, we are able to effectively obtain a truly multilingual space. The resulting (monolingual and multilingual) spaces show consistent gains over the current state-of-the-art in standard intrinsic tasks, namely dictionary induction and word similarity, as well as in extrinsic tasks such as cross-lingual hypernym discovery and cross-lingual natural language inference.

Download Full-text

Learning Lexical Subspaces in a Distributional Vector Space

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00316 ◽

2020 ◽

Vol 8 ◽

pp. 311-329

Author(s):

Kushal Arora ◽

Aishik Chakraborty ◽

Jackie C. K. Cheung

Keyword(s):

Vector Space ◽

Semantic Relations ◽

Distributional Semantics ◽

Word Embeddings ◽

Word Similarity ◽

Lexical Semantic ◽

Novel Approach ◽

Classification Tasks

In this paper, we propose LexSub, a novel approach towards unifying lexical and distributional semantics. We inject knowledge about lexical-semantic relations into distributional word embeddings by defining subspaces of the distributional vector space in which a lexical relation should hold. Our framework can handle symmetric attract and repel relations (e.g., synonymy and antonymy, respectively), as well as asymmetric relations (e.g., hypernymy and meronomy). In a suite of intrinsic benchmarks, we show that our model outperforms previous approaches on relatedness tasks and on hypernymy classification and detection, while being competitive on word similarity tasks. It also outperforms previous systems on extrinsic classification tasks that benefit from exploiting lexical relational cues. We perform a series of analyses to understand the behaviors of our model. 1 Code available at https://github.com/aishikchakraborty/LexSub .

Download Full-text

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

10.18653/v1/p19-1070 ◽

2019 ◽

Cited By ~ 6

Author(s):

Goran Glavaš ◽

Robert Litschko ◽

Sebastian Ruder ◽

Ivan Vulić

Keyword(s):

Word Embeddings ◽

Comparative Analyses ◽

Cross Lingual

Download Full-text

Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text Segmentation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6284 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7797-7804

Author(s):

Goran Glavašš ◽

Swapna Somasundaran

Keyword(s):

State Of The Art ◽

Language Transfer ◽

Text Segmentation ◽

Word Embeddings ◽

Neural Architecture ◽

Text Coherence ◽

Sentence Level ◽

Proposed Model ◽

Benchmark Datasets ◽

Cross Lingual

Breaking down the structure of long texts into semantically coherent segments makes the texts more readable and supports downstream applications like summarization and retrieval. Starting from an apparent link between text coherence and segmentation, we introduce a novel supervised model for text segmentation with simple but explicit coherence modeling. Our model – a neural architecture consisting of two hierarchically connected Transformer networks – is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones. The proposed model, dubbed Coherence-Aware Text Segmentation (CATS), yields state-of-the-art segmentation performance on a collection of benchmark datasets. Furthermore, by coupling CATS with cross-lingual word embeddings, we demonstrate its effectiveness in zero-shot language transfer: it can successfully segment texts in languages unseen in training.

Download Full-text

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Computational Linguistics ◽

10.1162/coli_a_00391 ◽

2020 ◽

pp. 1-51

Author(s):

Ivan Vulić ◽

Simon Baker ◽

Edoardo Maria Ponti ◽

Ulla Petti ◽

Ira Leviant ◽

...

Keyword(s):

Semantic Similarity ◽

Large Scale ◽

Representation Learning ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lexical Representations ◽

Language Data ◽

Weakly Supervised ◽

Cross Lingual

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Download Full-text

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15 ◽

10.1145/2766462.2767752 ◽

2015 ◽

Cited By ~ 73

Author(s):

Ivan Vulić ◽

Marie-Francine Moens

Keyword(s):

Information Retrieval ◽

Word Embeddings ◽

Retrieval Models ◽

Cross Lingual

Download Full-text

Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing

10.18653/v1/n19-1162 ◽

2019 ◽

Cited By ~ 11

Author(s):

Tal Schuster ◽

Ori Ram ◽

Regina Barzilay ◽

Amir Globerson

Keyword(s):

Word Embeddings ◽

Dependency Parsing ◽

Cross Lingual

Download Full-text

Why Overfitting Isn’t Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

10.18653/v1/2020.acl-main.201 ◽

2020 ◽

Author(s):

Mozhi Zhang ◽

Yoshinari Fujinuma ◽

Michael J. Paul ◽

Jordan Boyd-Graber

Keyword(s):

Word Embeddings ◽

Cross Lingual

Download Full-text

Learning Cross-lingual Word Embeddings via Matrix Co-factorization

10.3115/v1/p15-2093 ◽

2015 ◽

Cited By ~ 7

Author(s):

Tianze Shi ◽

Zhiyuan Liu ◽

Yang Liu ◽

Maosong Sun

Keyword(s):

Word Embeddings ◽

Cross Lingual

Download Full-text

Multi-Sense Embeddings per Word

10.31219/osf.io/udfhn ◽

2020 ◽

Author(s):

Masashi Sugiyama

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Non Parametric

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.

Download Full-text