Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks

Author(s):  
Ruan Chaves Rodrigues ◽  
Jéssica Rodrigues ◽  
Pedro Vitor Quinta de Castro ◽  
Nádia Felix Felipe da Silva ◽  
Anderson Soares
Author(s):  
Jose Manuel Gomez-Perez ◽  
Ronald Denaux ◽  
Andres Garcia-Silva

2018 ◽  
Vol 30 (23) ◽  
pp. e4489 ◽  
Author(s):  
Fengqi Yan ◽  
Qiaoqing Fan ◽  
Mingming Lu

2019 ◽  
Vol 9 (18) ◽  
pp. 3648
Author(s):  
Casper S. Shikali ◽  
Zhou Sijie ◽  
Liu Qihe ◽  
Refuoe Mokhosi

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.


2020 ◽  
pp. 1-51
Author(s):  
Ivan Vulić ◽  
Simon Baker ◽  
Edoardo Maria Ponti ◽  
Ulla Petti ◽  
Ira Leviant ◽  
...  

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.


2017 ◽  
Author(s):  
El Moatez Billah Nagoudi ◽  
Didier Schwab

2021 ◽  
Author(s):  
Tammie Borders ◽  
Svitlana Volkova

2020 ◽  
Vol 34 (05) ◽  
pp. 7456-7463 ◽  
Author(s):  
Zied Bouraoui ◽  
Jose Camacho-Collados ◽  
Steven Schockaert

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.


Sign in / Sign up

Export Citation Format

Share Document