A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings

Karlo Babić; Francesco Guerra; Sanda Martinčić-Ipšić; Ana Meštrović

doi:10.31341/jios.44.2.2

A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings

Journal of information and organizational sciences ◽

10.31341/jios.44.2.2 ◽

2020 ◽

Vol 44 (2) ◽

pp. 231-246

Author(s):

Karlo Babić ◽

Francesco Guerra ◽

Sanda Martinčić-Ipšić ◽

Ana Meštrović

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Similarity Measures ◽

Vital Role ◽

The Other ◽

Word Embeddings ◽

Spearman Correlation ◽

Word Senses

Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short texts. We perform a comparison of four models based on word embeddings: two variants of Word2Vec (one based on Word2Vec trained on a specific dataset and the second extending it with embeddings of word senses), FastText, and TF-IDF. Since these models provide word vectors, we experiment with various methods that calculate the semantic similarity of short texts based on word vectors. More precisely, for each of these models, we test five methods for aggregating word embeddings into text embedding. We introduced three methods by making variations of two commonly used similarity measures. One method is an extension of the cosine similarity based on centroids, and the other two methods are variations of the Okapi BM25 function. We evaluate all approaches on the two publicly available datasets: SICK and Lee in terms of the Pearson and Spearman correlation. The results indicate that extended methods perform better from the original in most of the cases.

Download Full-text

Multi-Sense Embeddings per Word

10.31219/osf.io/udfhn ◽

2020 ◽

Author(s):

Masashi Sugiyama

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Non Parametric

Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.

Download Full-text

Knowledge-based sentence semantic similarity: algebraical properties

Progress in Artificial Intelligence ◽

10.1007/s13748-021-00248-0 ◽

2021 ◽

Author(s):

Mourad Oussalah ◽

Muhidin Mohamed

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Similarity Measures ◽

Canonical Extension ◽

Similarity Score ◽

Semantic Similarity Measure ◽

Sentence Similarity

AbstractDetermining the extent to which two text snippets are semantically equivalent is a well-researched topic in the areas of natural language processing, information retrieval and text summarization. The sentence-to-sentence similarity scoring is extensively used in both generic and query-based summarization of documents as a significance or a similarity indicator. Nevertheless, most of these applications utilize the concept of semantic similarity measure only as a tool, without paying importance to the inherent properties of such tools that ultimately restrict the scope and technical soundness of the underlined applications. This paper aims to contribute to fill in this gap. It investigates three popular WordNet hierarchical semantic similarity measures, namely path-length, Wu and Palmer and Leacock and Chodorow, from both algebraical and intuitive properties, highlighting their inherent limitations and theoretical constraints. We have especially examined properties related to range and scope of the semantic similarity score, incremental monotonicity evolution, monotonicity with respect to hyponymy/hypernymy relationship as well as a set of interactive properties. Extension from word semantic similarity to sentence similarity has also been investigated using a pairwise canonical extension. Properties of the underlined sentence-to-sentence similarity are examined and scrutinized. Next, to overcome inherent limitations of WordNet semantic similarity in terms of accounting for various Part-of-Speech word categories, a WordNet “All word-To-Noun conversion” that makes use of Categorial Variation Database (CatVar) is put forward and evaluated using a publicly available dataset with a comparison with some state-of-the-art methods. The finding demonstrates the feasibility of the proposal and opens up new opportunities in information retrieval and natural language processing tasks.

Download Full-text

NMT Multi-Sense Embeddings per Word

10.31219/osf.io/k623t ◽

2019 ◽

Author(s):

William Jin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Non Parametric

Download Full-text

Getting in Shape: Word Embedding SubSpaces

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/761 ◽

2019 ◽

Author(s):

Tianyuan Zhou ◽

João Sedoc ◽

Jordan Rodu

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Theoretical Framework ◽

Word Embedding ◽

Word Embeddings ◽

Empirical Results ◽

Linear Alignment ◽

The Relationship

Many tasks in natural language processing require the alignment of word embeddings. Embedding alignment relies on the geometric properties of the manifold of word vectors. This paper focuses on supervised linear alignment and studies the relationship between the shape of the target embedding. We assess the performance of aligned word vectors on semantic similarity tasks and find that the isotropy of the target embedding is critical to the alignment. Furthermore, aligning with an isotropic noise can deliver satisfactory results. We provide a theoretical framework and guarantees which aid in the understanding of empirical results.

Download Full-text

Similarity of Sentences With Contradiction Using Semantic Similarity Measures

The Computer Journal ◽

10.1093/comjnl/bxaa100 ◽

2020 ◽

Author(s):

M Krishna Siva Prasad ◽

Poonam Sharma

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Word Order ◽

Similarity Measures ◽

Semantic Features ◽

Short Text ◽

Sentence Similarity ◽

Expert Ratings

Abstract Short text or sentence similarity is crucial in various natural language processing activities. Traditional measures for sentence similarity consider word order, semantic features and role annotations of text to derive the similarity. These measures do not suit short texts or sentences with negation. Hence, this paper proposes an approach to determine the semantic similarity of sentences and also presents an algorithm to handle negation. In sentence similarity, word pair similarity plays a significant role. Hence, this paper also discusses the similarity between word pairs. Existing semantic similarity measures do not handle antonyms accurately. Hence, this paper proposes an algorithm to handle antonyms. This paper also presents an antonym dataset with 111-word pairs and corresponding expert ratings. The existing semantic similarity measures are tested on the dataset. The results of the correlation proved that the expert ratings are in order with the correlation obtained from the semantic similarity measures. The sentence similarity is handled by proposing two algorithms. The first algorithm deals with the typical sentences, and the second algorithm deals with contradiction in the sentences. SICK dataset, which has sentences with negation, is considered for handling the sentence similarity. The algorithm helped in improving the results of sentence similarity.

Download Full-text

Extracting Word Synonyms from Text using Neural Approaches

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/1/6 ◽

2019 ◽

pp. 45-51 ◽

Cited By ~ 1

Author(s):

Nora Mohammed

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Word Association ◽

Research Problem ◽

Neutral Network ◽

Computational Techniques ◽

Word Embeddings ◽

Feed Forward

Extracting synonyms from textual corpora using computational techniques is an interesting research problem in the Natural Language Processing (NLP) domain. Neural techniques (such as Word2Vec) have been recently utilized to produce distributional word representations (also known as word embeddings) that capture semantic similarity/relatedness between words based on linear context. Nevertheless, using these techniques for synonyms extraction poses many challenges due to the fact that similarity between vector word representations does not indicate only synonymy between words, but also other sense relations as well as word association or relatedness. In this paper, we tackle this problem using a novel 2-step approach. We first build distributional word embeddings using Word2Vec then use the induced word embeddings as an input to train a feed-forward neutral network using annotated dataset to distinguish between synonyms and other semantically related words

Download Full-text

Explore different methods to build Multi-Sense Embeddings

10.31219/osf.io/4mvpx ◽

2020 ◽

Author(s):

Masashi Sugiyama

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Research Area ◽

Word Embedding ◽

The Other ◽

Word Embeddings ◽

Word Similarity ◽

Better Than ◽

Building Methods

In this last work, we did a exclusive survey related to multisense embeddings building methods. In this work, we extend our previous work and try to improve the current methods. Recently, word embeddings have been used in many natural language processing problems successfully and how to train a robust and accurate word embedding system efficiently is a popular research area. Since many, if not all, words have more than one sense, it is necessary to learn vectors for all senses of word separately. Therefore, in this project, we have explored two multi-sense word embedding models, including Multi-Sense Skip-gram (MSSG) model and Non-parametric Multi-sense Skip Gram model (NP-MSSG). Furthermore, we propose an extension of the Multi-Sense Skip-gram model called Incremental Multi-Sense Skip-gram (IMSSG) model which could learn the vectors of all senses per word incrementally. We evaluate all the systems on word similarity task and show that IMSSG is better than the other models.

Download Full-text

Measuring associational thinking through word embeddings

Artificial Intelligence Review ◽

10.1007/s10462-021-10056-6 ◽

2021 ◽

Author(s):

Carlos Periñán-Pascual

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Vector Space ◽

Semantic Similarity ◽

Language Processing ◽

Weighted Average ◽

Correlation Coefficients ◽

Word Embeddings ◽

Similarity Coefficients ◽

Major Focus

AbstractThe development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman’s and Pearson’s correlation coefficients.

Download Full-text

A Natural Language Processing Approach to Measuring Treatment Adherence and Consistency Using Semantic Similarity

AERA Open ◽

10.1177/23328584211028615 ◽

2021 ◽

Vol 7 ◽

pp. 233285842110286

Author(s):

Kylie L. Anglin ◽

Vivian C. Wong ◽

Arielle Boguslav

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Intervention Implementation ◽

Proof Of Concept ◽

Coaching Intervention ◽

Processing Techniques ◽

Teacher Coaching ◽

The Impact

Though there is widespread recognition of the importance of implementation research, evaluators often face intense logistical, budgetary, and methodological challenges in their efforts to assess intervention implementation in the field. This article proposes a set of natural language processing techniques called semantic similarity as an innovative and scalable method of measuring implementation constructs. Semantic similarity methods are an automated approach to quantifying the similarity between texts. By applying semantic similarity to transcripts of intervention sessions, researchers can use the method to determine whether an intervention was delivered with adherence to a structured protocol, and the extent to which an intervention was replicated with consistency across sessions, sites, and studies. This article provides an overview of semantic similarity methods, describes their application within the context of educational evaluations, and provides a proof of concept using an experimental study of the impact of a standardized teacher coaching intervention.

Download Full-text

Evolution of Semantic Similarity—A Survey

ACM Computing Surveys ◽

10.1145/3440755 ◽

2021 ◽

Vol 54 (2) ◽

pp. 1-37

Author(s):

Dhivya Chandrasekaran ◽

Vijay Mago

Keyword(s):

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Hybrid Methods ◽

Research Work ◽

Similarity Measures ◽

Text Data ◽

Knowledge Based ◽

Open Research ◽

Research Problems

Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of Natural Language Processing (NLP). The versatility of natural language makes it difficult to define rule-based methods for determining semantic similarity measures. To address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods beginning from traditional NLP techniques such as kernel-based methods to the most recent research work on transformer-based models, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network–based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.

Download Full-text