Dissecting word embeddings and language models in natural language processing

2021 ◽  
Vol 24 (5) ◽  
pp. 1509-1515
Author(s):  
Vivek Kumar Verma ◽  
Mrigank Pandey ◽  
Tarun Jain ◽  
Pradeep Kumar Tiwari
2020 ◽  
Vol 34 (05) ◽  
pp. 7456-7463 ◽  
Author(s):  
Zied Bouraoui ◽  
Jose Camacho-Collados ◽  
Steven Schockaert

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.


2021 ◽  
Author(s):  
Oscar Nils Erik Kjell ◽  
H. Andrew Schwartz ◽  
Salvatore Giorgi

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).


2020 ◽  
Author(s):  
Masashi Sugiyama

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.


2021 ◽  
Vol 50 (3) ◽  
pp. 27-28
Author(s):  
Immanuel Trummer

Introduction. We have seen significant advances in the state of the art in natural language processing (NLP) over the past few years [20]. These advances have been driven by new neural network architectures, in particular the Transformer model [19], as well as the successful application of transfer learning approaches to NLP [13]. Typically, training for specific NLP tasks starts from large language models that have been pre-trained on generic tasks (e.g., predicting obfuscated words in text [5]) for which large amounts of training data are available. Using such models as a starting point reduces task-specific training cost as well as the number of required training samples by orders of magnitude [7]. These advances motivate new use cases for NLP methods in the context of databases.


2021 ◽  
Vol 69 (11) ◽  
pp. 940-951
Author(s):  
Maximilian Both ◽  
Jochen Müller ◽  
Christian Diedrich

Zusammenfassung Systeme im Bereich Industrie 4.0 sollen interoperabel miteinander agieren können. Damit dies automatisiert realisiert werden kann, müssen sie semantisch interoperabel sein. Hierfür fokussiert der aktuelle Industrie 4.0 Forschungsansatz einen semantisch homogenen Sprachraum. In diesem Paper wird eine Methode vorgestellt, die diesen Ansatz um heterogene Semantik erweitert. Die Abbildung unbekannter Vokabulare auf eine Zielontologie ermöglicht die Interaktionen heterogener Verwaltungsschalen. Basis der Abbildung sind Methoden aus dem Bereich Natural Language Processing. Hierzu werden auf ISO Standards vortrainierte language models und sentence embeddings kombiniert. Dies führt zu einer vielversprechenden Genauigkeit bei dem erstellten Evaluationsdatensatz, welcher unterschiedliche Semantiken für Identifikation- und Design-Teilmodelle des Projektes Pumpe 4.0 enthält.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Ivano Lauriola ◽  
Fabio Aiolli ◽  
Alberto Lavelli ◽  
Fabio Rinaldi

Abstract Background Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task. Results This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F1 score. Conclusions Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution.


2014 ◽  
Vol 22 (1) ◽  
pp. 135-161 ◽  
Author(s):  
M. MELERO ◽  
M.R. COSTA-JUSSÀ ◽  
P. LAMBERT ◽  
M. QUIXAL

AbstractWe present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).


2021 ◽  
Author(s):  
Jihyeon Roh ◽  
Sungjin Park ◽  
Bo-Kyeong Kim ◽  
Sang-Hoon Oh ◽  
Soo-Young Lee

2018 ◽  
Author(s):  
Paulo Henrique Calado Aoun ◽  
Andre C. A. Nascimento ◽  
Adenilton J. Da Silva

The use of word embeddings is becoming very common in many Natural Language Processing tasks. Most of the time, these require computacional resources that can not be found in most part of the current mobile devices. In this work, we evaluate a combination of numeric truncation and dimensionality reduction strategies in order to obtain smaller vectorial representations without substancial losses in performance.


Sign in / Sign up

Export Citation Format

Share Document