Changing the Geometry of Representations: α-Embeddings for NLP Tasks

Word embeddings based on a conditional model are commonly used in Natural Language Processing (NLP) tasks to embed the words of a dictionary in a low dimensional linear space. Their computation is based on the maximization of the likelihood of a conditional probability distribution for each word of the dictionary. These distributions form a Riemannian statistical manifold, where word embeddings can be interpreted as vectors in the tangent space of a specific reference measure on the manifold. A novel family of word embeddings, called α-embeddings have been recently introduced as deriving from the geometrical deformation of the simplex of probabilities through a parameter α, using notions from Information Geometry. After introducing the α-embeddings, we show how the deformation of the simplex, controlled by α, provides an extra handle to increase the performances of several intrinsic and extrinsic tasks in NLP. We test the α-embeddings on different tasks with models of increasing complexity, showing that the advantages associated with the use of α-embeddings are present also for models with a large number of parameters. Finally, we show that tuning α allows for higher performances compared to the use of larger models in which additionally a transformation of the embeddings is learned during training, as experimentally verified in attention models.

Download Full-text

A Survey on Bias in Deep NLP

Applied Sciences ◽

10.3390/app11073184 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3184

Author(s):

Ismael Garrido-Muñoz ◽

Arturo Montejo-Ráez ◽

Fernando Martínez-Santiago ◽

L. Alfonso Ureña-López

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Natural Language Processing ◽

Probability Distribution ◽

Natural Language ◽

Network Design ◽

Language Processing ◽

Deep Neural Networks ◽

Learning Processes ◽

Relevant Issue

Deep neural networks are hegemonic approaches to many machine learning areas, including natural language processing (NLP). Thanks to the availability of large corpora collections and the capability of deep architectures to shape internal language mechanisms in self-supervised learning processes (also known as “pre-training”), versatile and performing models are released continuously for every new network design. These networks, somehow, learn a probability distribution of words and relations across the training collection used, inheriting the potential flaws, inconsistencies and biases contained in such a collection. As pre-trained models have been found to be very useful approaches to transfer learning, dealing with bias has become a relevant issue in this new scenario. We introduce bias in a formal way and explore how it has been treated in several networks, in terms of detection and correction. In addition, available resources are identified and a strategy to deal with bias in deep NLP is proposed.

Download Full-text

Isometric Signal Processing under Information Geometric Framework

Entropy ◽

10.3390/e21040332 ◽

2019 ◽

Vol 21 (4) ◽

pp. 332 ◽

Cited By ~ 1

Author(s):

Hao Wu ◽

Yongqiang Cheng ◽

Hongqiang Wang

Keyword(s):

Signal Processing ◽

Probability Distribution ◽

Geometric Structure ◽

Information Geometry ◽

Sufficient Condition ◽

Necessary And Sufficient Condition ◽

Statistical Manifold ◽

Intrinsic Parameter ◽

Geometric Framework ◽

Necessary And Sufficient

Information geometry is the study of the intrinsic geometric properties of manifolds consisting of a probability distribution and provides a deeper understanding of statistical inference. Based on this discipline, this letter reports on the influence of the signal processing on the geometric structure of the statistical manifold in terms of estimation issues. This letter defines the intrinsic parameter submanifold, which reflects the essential geometric characteristics of the estimation issues. Moreover, the intrinsic parameter submanifold is proven to be a tighter one after signal processing. In addition, the necessary and sufficient condition of invariant signal processing of the geometric structure, i.e., isometric signal processing, is given. Specifically, considering the processing with the linear form, the construction method of linear isometric signal processing is proposed, and its properties are presented in this letter.

Download Full-text

Multi-Sense Embeddings per Word

10.31219/osf.io/udfhn ◽

2020 ◽

Author(s):

Masashi Sugiyama

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Non Parametric

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.

Download Full-text

When Word Embeddings Become Endangered

Multilingual Facilitation ◽

10.31885/9789515150257.24 ◽

2021 ◽

pp. 275-288

Author(s):

Khalid Alnajjar

Keyword(s):

Natural Language Processing ◽

Sentiment Analysis ◽

Language Processing ◽

Endangered Languages ◽

Word Embeddings ◽

Analysis Model ◽

Resource Poor ◽

Semantic Spaces ◽

Cross Lingual ◽

Analysis Models

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis models which achieved high accuracies. All our cross-lingual word embeddings and sentiment analysis models will be released openly via an easy-to-use Python library.

Download Full-text

Computationally Efficient Learning of Quality Controlled Word Embeddings for Natural Language Processing

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) ◽

10.1109/isvlsi.2019.00033 ◽

2019 ◽

Author(s):

Mohammed Alawad ◽

Georgia Tourassi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Embeddings ◽

Computationally Efficient ◽

Efficient Learning

Download Full-text

Learning adaptive representations for entity recognition in the biomedical domain

Journal of Biomedical Semantics ◽

10.1186/s13326-021-00238-0 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Ivano Lauriola ◽

Fabio Aiolli ◽

Alberto Lavelli ◽

Fabio Rinaldi

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Machine Learning Algorithms ◽

Entity Recognition ◽

Machine Learning Techniques ◽

Hybrid Architecture ◽

Biomedical Domain ◽

Word Embeddings

Abstract Background Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task. Results This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F1 score. Conclusions Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution.

Download Full-text

Evaluation of Dimensionality Reduction and Truncation Techniques for Word Embeddings

10.5753/eniac.2018.4477 ◽

2018 ◽

Author(s):

Paulo Henrique Calado Aoun ◽

Andre C. A. Nascimento ◽

Adenilton J. Da Silva

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Dimensionality Reduction ◽

Mobile Devices ◽

Language Processing ◽

Word Embeddings ◽

Reduction Strategies ◽

Truncation Techniques

The use of word embeddings is becoming very common in many Natural Language Processing tasks. Most of the time, these require computacional resources that can not be found in most part of the current mobile devices. In this work, we evaluate a combination of numeric truncation and dimensionality reduction strategies in order to obtain smaller vectorial representations without substancial losses in performance.

Download Full-text

NMT Multi-Sense Embeddings per Word

10.31219/osf.io/k623t ◽

2019 ◽

Author(s):

William Jin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Non Parametric

Download Full-text

Domain specific word embeddings for natural language processing in radiology

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2020.103665 ◽

2021 ◽

Vol 113 ◽

pp. 103665

Author(s):

Timothy L. Chen ◽

Max Emerling ◽

Gunvant R. Chaudhari ◽

Yeshwant R. Chillakuru ◽

Youngho Seo ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Word Embeddings ◽

Domain Specific

Download Full-text

Automated Dictionary Creation for Analyzing Text: An Illustration from Stereotype Content

10.31234/osf.io/afm8k ◽

2019 ◽

Author(s):

Gandalf Nicolas ◽

Xuechunzi Bai ◽

Susan Fiske

Keyword(s):

Social Psychology ◽

Natural Language Processing ◽

Language Processing ◽

Convergent Validity ◽

R Package ◽

Word Embeddings ◽

Psychological Constructs ◽

Stereotype Content ◽

Automated Methods ◽

Initial List

Recent advances in natural language processing provide new approaches to analyze psychological open-ended data. However, many of these methods require translating to the needs of psychologists working with text. Here, we introduce automated methods to create and validate extensive dictionaries of psychological constructs using Wordnet and word embeddings. Specifically, we first expand an initial list of seed words by using Wordnet to obtain synonyms, antonyms, and other semantically related terms. Next, we evaluate dictionary reliability by using word embeddings trained on independent sources. Finally, we evaluate the dictionaries’ convergent validity against traditional scale ratings and human judgments. We illustrate these innovations by creating stereotype content dictionaries, a construct in social psychology that lacks specialized and validated dictionaries for coding open-ended data. These dictionaries achieved over 80% coverage of new responses, compared to 20% coverage by the seed-word-only approach. Cosine similarity with word embeddings confirmed that the dictionaries are more similar within the same concept than across concepts. Moreover, open-ended responses predicted both traditional scale ratings and human judgments about the dictionary topic. The R package Automated Dictionary Creation for Analyzing Text (ADCAT; https://github.com/gandalfnicolas/ADCAT) allows anyone to create novel dictionaries for constructs of interest and to access the stereotype content dictionaries.

Download Full-text