Predicting Human Similarity Judgments with Distributional Models: The Value of Word Associations

To represent the meaning of a word, most models use external language resources, such as text corpora, to derive the distributional properties of word usage. In this study, we propose that internal language models, that are more closely aligned to the mental representations of words, can be used to derive new theoretical questions regarding the structure of the mental lexicon. A comparison with internal models also puts into perspective a number of assumptions underlying recently proposed distributional text-based models could provide important insights into cognitive science, including linguistics and artificial intelligence. We focus on word-embedding models which have been proposed to learn aspects of word meaning in a manner similar to humans and contrast them with internal language models derived from a new extensive data set of word associations. An evaluation using relatedness judgments shows that internal language models consistently outperform current state-of-the art text-based external language models. This suggests alternative approaches to represent word meaning using properties that aren't encoded in text.

Download Full-text

Visual and Affective Grounding in Language and Mind

10.31234/osf.io/q97f8 ◽

2018 ◽

Cited By ~ 2

Author(s):

Simon De Deyne ◽

Danielle Navarro ◽

Guillem Collell ◽

Amy Perfors

Keyword(s):

Mental Representations ◽

Word Meaning ◽

Language Models ◽

Abstract Concepts ◽

Text Corpora ◽

Semantic Models ◽

Abstract Words ◽

Distributional Semantic Models ◽

Affective Information ◽

New Evidence

One of the main limitations in natural language-based approaches to meaning is that they are not grounded. In this study, we evaluate how well different kinds of models account for people’s representations of both concrete and abstract concepts. The models are both unimodal (language-based only) models and multimodal distributional semantic models (which additionallyincorporate perceptual and/or affective information). The language-based models include both external (based on text corpora) and internal (derived from word associations) language. We present two new studies and a re-analysis of a series of previous studies demonstrating that the unimodal performance is substantially higher for internal models, especially when comparisons at the basiclevel are considered. For multimodal models, our findings suggest that additional visual and affective features lead to only slightly more accurate mental representations of word meaning than what is already encoded in internal language models; however, for abstract concepts, visual andaffective features improve the predictions of external text-based models. Our work presents new evidence that the grounding problem includes abstract words as well and is therefore more widespread than previously suggested. Implications for both embodied and distributional views arediscussed.

Download Full-text

Supervised and Unsupervised Neural Approaches to Text Readability

Computational Linguistics ◽

10.1162/coli_a_00398 ◽

2021 ◽

Vol 47 (1) ◽

pp. 141-179

Author(s):

Matej Martinc ◽

Senja Pollak ◽

Marko Robnik-Šikonja

Keyword(s):

State Of The Art ◽

Comprehensive Analysis ◽

Language Models ◽

Feature Engineering ◽

Data Sets ◽

Data Set ◽

Systematic Comparison ◽

Current State ◽

Text Readability ◽

Unsupervised Approach

Abstract We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages, and allows adaptation to a specific readability task and data set. By systematic comparison of several neural architectures on a number of benchmark and new labeled readability data sets in two languages, this study also offers a comprehensive analysis of different neural approaches to readability classification. We expose their strengths and weaknesses, compare their performance to current state-of-the-art classification approaches to readability, which in most cases still rely on extensive feature engineering, and propose possibilities for improvements.

Download Full-text

Facing the needs for clean bicycle data – a bicycle-specific approach of GPS data processing

European Transport Research Review ◽

10.1186/s12544-020-00462-2 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Sven Lißner ◽

Stefan Huber

Keyword(s):

Data Processing ◽

Gps Data ◽

Data Set ◽

Specific Data ◽

Driving Mode ◽

Mode Detection ◽

Current State ◽

Mode Recognition ◽

Recorded Data ◽

Gps Tracks

Abstract Background GPS-based cycling data are increasingly available for traffic planning these days. However, the recorded data often contain more information than simply bicycle trips. GPS tracks resulting from tracking while using other modes of transport than bike or long periods at working locations while people are still tracking are only some examples. Thus, collected bicycle GPS data need to be processed adequately to use them for transportation planning. Results The article presents a multi-level approach towards bicycle-specific data processing. The data processing model contains different steps of processing (data filtering, smoothing, trip segmentation, transport mode recognition, driving mode detection) to finally obtain a correct data set that contains bicycle trips, only. The validation reveals a sound accuracy of the model at its’ current state (82–88%).

Download Full-text

Alternative Contracting Approaches in Current Oil and Gas Business Environment

10.2118/207283-ms ◽

2021 ◽

Author(s):

Keith Bradford Critzer ◽

Douglas Andrew Colbert

Keyword(s):

Oil And Gas ◽

Business Environment ◽

Market Conditions ◽

Project Outcomes ◽

Lump Sum ◽

Current State ◽

Alternative Approaches ◽

Financial Impacts ◽

Oil And Gas Sector ◽

Broad Overview

Abstract This paper presents a broad overview of the current state of the oil and gas engineering, procurement, and construction (EPC) contractor base following a period of challenging market conditions, subsequent owner/operator investment deferments, and the resulting financial impacts to the contractor base. These factors have caused a reduced tolerance for oil and gas volatility and a reduced appetite for lump sum contract risk. This paper identifies alternative contracting approaches to traditional competitively bid lump sum contracting. These alternative approaches result in a better understanding and assignment of risk between owner/operator and contractor, encourage continued participation by contractors in the oil and gas sector, and increase the probability of successful project outcomes.

Download Full-text

Building Hybrid Representations from Text Corpora, Knowledge Graphs, and Language Models

A Practical Guide to Hybrid Natural Language Processing ◽

10.1007/978-3-030-44830-1_6 ◽

2020 ◽

pp. 57-89

Author(s):

Jose Manuel Gomez-Perez ◽

Ronald Denaux ◽

Andres Garcia-Silva

Keyword(s):

Language Models ◽

Text Corpora ◽

Knowledge Graphs

Download Full-text

Event Monitoring and Intelligence Gathering Using Twitter Based Real-Time Event Summarization and Pre-Trained Model Techniques

Applied Sciences ◽

10.3390/app112210596 ◽

2021 ◽

Vol 11 (22) ◽

pp. 10596

Author(s):

Chung-Hong Lee ◽

Hsin-Chang Yang ◽

Yenming J. Chen ◽

Yung-Lin Chuang

Keyword(s):

Real Time ◽

Data Science ◽

Clustering Algorithm ◽

Language Models ◽

Event Monitoring ◽

Data Set ◽

Intelligence Gathering ◽

Twitter Data ◽

The Government

Recently, an emerging application field through Twitter messages and algorithmic computation to detect real-time world events has become a new paradigm in the field of data science applications. During a high-impact event, people may want to know the latest information about the development of the event because they want to better understand the situation and possible trends of the event for making decisions. However, often in emergencies, the government or enterprises are usually unable to notify people in time for early warning and avoiding risks. A sensible solution is to integrate real-time event monitoring and intelligence gathering functions into their decision support system. Such a system can provide real-time event summaries, which are updated whenever important new events are detected. Therefore, in this work, we combine a developed Twitter-based real-time event detection algorithm with pre-trained language models for summarizing emergent events. We used an online text-stream clustering algorithm and self-adaptive method developed to gather the Twitter data for detection of emerging events. Subsequently we used the Xsum data set with a pre-trained language model, namely T5 model, to train the summarization model. The Rouge metrics were used to compare the summary performance of various models. Subsequently, we started to use the trained model to summarize the incoming Twitter data set for experimentation. In particular, in this work, we provide a real-world case study, namely the COVID-19 pandemic event, to verify the applicability of the proposed method. Finally, we conducted a survey on the example resulting summaries with human judges for quality assessment of generated summaries. From the case study and experimental results, we have demonstrated that our summarization method provides users with a feasible method to quickly understand the updates in the specific event intelligence based on the real-time summary of the event story.

Download Full-text

cs60075_team2 at SemEval-2021 Task 1 : Lexical Complexity Prediction using Transformer-based Language Models pre-trained on various text corpora

10.18653/v1/2021.semeval-1.87 ◽

2021 ◽

Author(s):

Abhilash Nandy ◽

Sayantan Adak ◽

Tanurima Halder ◽

Sai Mahesh Pokala

Keyword(s):

Language Models ◽

Text Corpora ◽

Lexical Complexity

Download Full-text

Evaluating Word Similarity Measure of Embeddings Through Binary Classification

Journal of Computer Science Research ◽

10.30564/jcsr.v1i3.1268 ◽

2019 ◽

Vol 1 (3) ◽

Author(s):

A. Aziz Altowayan ◽

Lixin Tao

Keyword(s):

Similarity Measure ◽

Binary Classification ◽

General Purpose ◽

Feature Representation ◽

Entity Recognition ◽

Language Models ◽

Data Set ◽

Word Similarity ◽

Domain Specific ◽

Retrieval Rate

We consider the following problem: given neural language models (embeddings) each of which is trained on an unknown data set, how can we determine which model would provide a better result when used for feature representation in a downstream task such as text classification or entity recognition? In this paper, we assess the word similarity measure through analyzing its impact on word embeddings learned from various datasets and how they perform in a simple classification task. Word representations were learned and assessed under the same conditions. For training word vectors, we used the implementation of Continuous Bag of Words described in [1]. To assess the quality of the vectors, we applied the analogy questions test for word similarity described in the same paper. Further, to measure the retrieval rate of an embedding model, we introduced a new metric (Average Retrieval Error) which measures the percentage of missing words in the model. We observe that scoring a high accuracy of syntactic and semantic similarities between word pairs is not an indicator of better classification results. This observation can be justified by the fact that a domain-specific corpus contributes to the performance better than a general-purpose corpus. For reproducibility, we release our experiments scripts and results.

Download Full-text

Kirjanduslikud digikeskkonnad keeleressursside baasina: mõjukriitika juhtumiuuring päringusüsteemis KORP / Digital literary heritage projects as a source of language resources: a case of Estonian criticism in KORP

Methis Studia humaniora Estonica ◽

10.7592/methis.v21i26.16916 ◽

2020 ◽

Vol 21 (26) ◽

Author(s):

Marin Laak

Keyword(s):

Literary Criticism ◽

Literary History ◽

Digital Humanities ◽

Language Resources ◽

Traditional Methods ◽

Literary Works ◽

Text Corpora ◽

Non Linear ◽

Literary Heritage ◽

Query System

Eesti Kirjandusmuuseum on olnud teerajajaid digihumanitaaria valdkonnas juba 1990. aastatest, alates arvutikultuuri laiemast levikust. Väärtuslike andmekogude haldamisel on olnud missiooniks nende kättesaadavaks tegemine avalikkusele. Kultuuripärand avati laiemale kasutajale kahes suunas: sisupõhised otsitavad andmebaasid ning suhtepõhised andmekeskkonnad. Siinse artikli eesmärgiks on näidata arvutusliku kirjandusteaduse tänapäevaseid võimalusi ja nendega seotud kirjanduslike keeleressursside loomist koostöös korpuslingvistidega. Artiklis analüüsin kultuuripärandi sisukeskkondade ja andmekoguside kasutusvõimalusi masinloetava keeleressursina. Esimeste selliste katsetena on valminud kirjavahetuse ja kriitika märgendatud keelekorpused päringusüsteemis KORP. Käesolev uurimus toob on 20. sajandi alguse mõjukriitika probleemi näitel välja kirjanduslike keelekorpuste potentsiaali kultuuripärandi uurimisel. Estonia can soon expect an explosive growth in digital heritage and text resources due to the current project of mass digitisation of national cultural heritage (printed books, archival documents, photos, art, audiovisual, and ethnographic artifacts) (2019–2023). This will give new opportunities for different fields of digital humanities and make digitised heritage accessible to everyone in the form of open data. The project will focus on the usage of the heritage, on the needs of education, e-learning, and the creative industry, including digital creative arts. The aim of this article is to examine some research possibilities that opened up for literary history due to the digitisation of literary works and archival sources and to put them in the general context of digital humanities. Although the field of digital humanities is broad, the meaning of DH is often reduced to methods of computational language-centered analyses, mainly based on using different tools and software languages (R, Stylo, Phyton, Gephy, Top Modelling etc.). While the corpus-based research is already a professional standard in linguistics, literary scholars are still more used to working with traditional methods. This article introduces two digital literary history projects belonging to the field of digital humanities and analyses them as language resources for creating texts corpora, and introduces some results of the case study of Estonian criticism from the Young Estonia movement up to the 1920s, carried out using the literary texts corpora in the corpus query system KORP (https://korp.keeleressursid.ee) by the Centre of Estonian Language Resources. During the past twenty years, I have mainly focussed on developing large-scale implementation projects for digital representation of Estonian literary history. The objective of these experimental projects has been to develop principally new non-linear models of Estonian literary history for the digital environment. These activities were based on my research of the intertextual relations between authors, literary works, and critical texts using traditional methods. The first content-based literary history project “ERNI. Estonian Literary History in Texts 1924–1925” (www2.kirmus.ee/erni) was based on a hypertextual network of literary source texts and reviews. We re-conceptualised literary history as a non-linear narrative and a gallery with many entrances. The task of the project was also to ensure its usability in education: a significant number of study materials has been added in cooperation with schoolteachers. In 2004, we initiated our long-term and still running project “Kreutzwald’s Century: the Estonian Cultural History Web” (http://kreutzwald.kirmus.ee) at the Estonian Literary Museum. The objective of this project was to make literary sources of the period accessible as the dynamic, interactive information environment. This was a hybrid project which synthesised the classical study of Estonian literary history, the needs of the digital media user, and the expanding digital resources from different memory institutions; its underlying idea was to link together all the works of fiction of an author, as well as their biography, manuscripts, and photos and to make them visible for the user on five interactive time axes. The project uses a specially created platform. Today, this platform is extensively used by schoolteachers: in 2020 (Jan.–Dec.) it had about 8, 986.555 million clicks and during seven years (2013 Dec.–2020 Dec.) it has collected 64, 627.380 million clicks. To find out how we can fit such content-based models of literary heritage into the context of Digital Humanities we need to compare the previous modelling practices with our current experimental project in the corpus query system KORP. Our interdisciplinary project “Literary Studies Meet Corpus Linguistics” (2017–2020) concentrated on studying literary history sources with linguistic methods. As the result of the project two literary text corpora were created: “Epistolary text corpus of Estonian writers Johannes Semper and Johannes Vares-Barbarus” and “Corpus of the Estonian literary criticism, Noor-Eesti and the 1920s”. Both of them were pilot projects in the field, started with converting the digitalised archival and printed sources into machine-readable format before text and data mining for corpus creation. Query system KORP allows us to organise the language data by all the categories used in the corpus, for example, to learn who and in what context mentioned the name of the French writer André Gide. The second currently running project is the morphologically annotated corpus of literary criticism. This corpus contains texts of literary reviews and criticism in different genres, drawn from the projects ERNI and “Kreutzwald’s Century”. The first results in studying the dynamics of literary values can already be seen. A query in KORP about the word ‘mõju’ (‘influence’) revealed that the manifesto “More of European culture!”of the group Young Estonia, voiced in 1905, was during the independent Estonian Republic replaced by the valuing of a specific national character. Corpus query showed a change in the meaning of the word: in the criticism contemporary to Young Estonia, the word ‘mõju’ was only associated with the historical pressure from Russian and German cultures. The foundation for modern comparative linguistics at the University of Tartu was laid in the 1920s by the professorship in Estonian literature.

Download Full-text

MSA Transformer

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text