Correction of Vector Representations of Words to Improve the Semantic Similarity

Abstract Background Semantic similarity between Gene Ontology (GO) terms is a fundamental measure for many bioinformatics applications, such as determining functional similarity between genes or proteins. Most previous research exploited information content to estimate the semantic similarity between GO terms; recently some research exploited word embeddings to learn vector representations for GO terms from a large-scale corpus. In this paper, we proposed a novel method, named GO2Vec, that exploits graph embeddings to learn vector representations for GO terms from GO graph. GO2Vec combines the information from both GO graph and GO annotations, and its learned vectors can be applied to a variety of bioinformatics applications, such as calculating functional similarity between proteins and predicting protein-protein interactions. Results We conducted two kinds of experiments to evaluate the quality of GO2Vec: (1) functional similarity between proteins on the Collaborative Evaluation of GO-based Semantic Similarity Measures (CESSM) dataset and (2) prediction of protein-protein interactions on the Yeast and Human datasets from the STRING database. Experimental results demonstrate the effectiveness of GO2Vec over the information content-based measures and the word embedding-based measures. Conclusion Our experimental results demonstrate the effectiveness of using graph embeddings to learn vector representations from undirected GO and GOA graphs. Our results also demonstrate that GO annotations provide useful information for computing the similarity between GO terms and between proteins.

Download Full-text

METHOD FOR DETERMINING THE SEMANTIC SIMILARITY OF ARBITRARY LENGTH TEXTS USING THE TRANSFORMERS MODELS

Advanced Information Systems ◽

10.20998/2522-9052.2021.2.18 ◽

2021 ◽

Vol 5 (2) ◽

pp. 126-130

Author(s):

Сергій Олізаренко ◽

В’ячеслав Радченко

Keyword(s):

Comparative Analysis ◽

Semantic Similarity ◽

Direct Problem ◽

Method Development ◽

Classification Problem ◽

Fine Tuning ◽

Arbitrary Length ◽

Vector Representations ◽

Implementation Stage

The paper considers the results of a method development for determining the semantic similarity of arbitrary length texts based on their vector representations. These vector representations are obtained via multilingual Transformers model usage, and direct problem of determining semantic similarity of arbitrary length texts is considered as the text sequence pairs classification problem using Transformers model. Comparative analysis of the most optimal Transformers model for solving such class of problems was performed. Considered in this case main stages of the method are: Transformers model fine-tuning stage in the framework of pretrained model second problem (sentence prediction), also selection and implementation stage of the summarizing method for text sequence more than 512 (1024) tokens long to solve the problem of determining the semantic similarity for arbitrary length texts.

Download Full-text

Approaches to assessing the semantic similarity of texts in a multilingual space

CPT2020 The 8th International Scientific Conference on Computing in Physics and Technology Proceedings ◽

10.30987/conferencearticle_5fce2773b1aff6.26436513 ◽

2020 ◽

Author(s):

Aida Hakimova ◽

Michael Charnine ◽

Aleksey Klokov ◽

Evgenii Sokolov

Keyword(s):

Vector Space ◽

Semantic Similarity ◽

Semantic Space ◽

Optimal Parameters ◽

Quantitative Indicator ◽

The Cross ◽

Cross Lingual ◽

Vector Representations ◽

Semantic Textual Similarity

This paper is devoted to the development of a methodology for evaluating the semantic similarity of any texts in different languages is developed. The study is based on the hypothesis that the proximity of vector representations of terms in semantic space can be interpreted as a semantic similarity in the cross-lingual environment. Each text will be associated with a vector in a single multilingual semantic vector space. The measure of the semantic similarity of texts will be determined by the measure of the proximity of the corresponding vectors. We propose a quantitative indicator called Index of Semantic Textual Similarity (ISTS) that measures the degree of semantic similarity of multilingual texts on the basis of identified cross-lingual semantic implicit links. The setting of parameters is based on the correlation with the presence of a formal reference between documents. The measure of semantic similarity expresses the existence of two common terms, phrases or word combinations. Optimal parameters of the algorithm for identifying implicit links are selected on the thematic collection by maximizing the correlation of explicit and implicit connections. The developed algorithm can facilitate the search for close documents in the analysis of multilingual patent documentation.

Download Full-text

English Words Connected Via Hebrew Morphology: L1-L2 Bidirectional Effects on Semantic Similarity

PsycEXTRA Dataset ◽

10.1037/e527342012-808 ◽

2007 ◽

Author(s):

Tamar Degani ◽

Anat Prior ◽

Natasha Tokowicz

Keyword(s):

Semantic Similarity ◽

Bidirectional Effects

Download Full-text

Effects of semantic similarity and associative strength on the transfer and generalization of probability learning

PsycEXTRA Dataset ◽

10.1037/e666672011-271 ◽

1969 ◽

Author(s):

Lowell Schipper ◽

Bruce L. Hanson ◽

Leonard M. Davidson

Keyword(s):

Semantic Similarity ◽

Associative Strength ◽

Probability Learning

Download Full-text

Document- and Keyword-based Author Co-citation Analysis

Data and Information Management ◽

10.2478/dim-2018-0009 ◽

2018 ◽

Vol 2 (2) ◽

pp. 70-82 ◽

Cited By ~ 2

Author(s):

Binglu Wang ◽

Yi Bu ◽

Win-bin Huang

Keyword(s):

Citation Analysis ◽

Semantic Similarity ◽

Method Validation ◽

New Method ◽

Global Network ◽

Network Visualization ◽

Knowledge Domain ◽

Knowledge Domains ◽

Domain Mapping ◽

The Relationship

AbstractIn the field of scientometrics, the principal purpose for author co-citation analysis (ACA) is to map knowledge domains by quantifying the relationship between co-cited author pairs. However, traditional ACA has been criticized since its input is insufficiently informative by simply counting authors’ co-citation frequencies. To address this issue, this paper introduces a new method that reconstructs the raw co-citation matrices by regarding document unit counts and keywords of references, named as Document- and Keyword-Based Author Co-Citation Analysis (DKACA). Based on the traditional ACA, DKACA counted co-citation pairs by document units instead of authors from the global network perspective. Moreover, by incorporating the information of keywords from cited papers, DKACA captured their semantic similarity between co-cited papers. In the method validation part, we implemented network visualization and MDS measurement to evaluate the effectiveness of DKACA. Results suggest that the proposed DKACA method not only reveals more insights that are previously unknown but also improves the performance and accuracy of knowledge domain mapping, representing a new basis for further studies.

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

RTM-DCU: Referential Translation Machines for Semantic Similarity

10.3115/v1/s14-2085 ◽

2014 ◽

Cited By ~ 1

Author(s):

Ergun Bicici ◽

Andy Way

Keyword(s):

Semantic Similarity

Download Full-text

NTNU: Measuring Semantic Similarity with Sublexical Feature Representations and Soft Cardinality

10.3115/v1/s14-2078 ◽

2014 ◽

Cited By ~ 1

Author(s):

André Lynum ◽

Partha Pakray ◽

Björn Gambäck ◽

Sergio Jimenez

Keyword(s):

Semantic Similarity ◽

Feature Representations

Download Full-text

The Opposition of Surprisal and Semantic Similarity in the Prediction of Language Processing: Evidence from Eye-tracking Data

10.31234/osf.io/zypk9 ◽

2020 ◽

Author(s):

Kun Sun

Keyword(s):

Eye Tracking ◽

Semantic Similarity ◽

Cognitive Processing ◽

Language Processing ◽

Language Comprehension ◽

Word Processing ◽

Reading Time ◽

Computational Models ◽

Tracking Data ◽

Dynamic Approach

Expectations or predictions about upcoming content play an important role during language comprehension and processing. One important aspect of recent studies of language comprehension and processing concerns the estimation of the upcoming words in a sentence or discourse. Many studies have used eye-tracking data to explore computational and cognitive models for contextual word predictions and word processing. Eye-tracking data has previously been widely explored with a view to investigating the factors that influence word prediction. However, these studies are problematic on several levels, including the stimuli, corpora, statistical tools they applied. Although various computational models have been proposed for simulating contextual word predictions, past studies usually preferred to use a single computational model. The disadvantage of this is that it often cannot give an adequate account of cognitive processing in language comprehension. To avoid these problems, this study draws upon a massive natural and coherent discourse as stimuli in collecting the data on reading time. This study trains two state-of-art computational models (surprisal and semantic (dis)similarity from word vectors by linear discriminative learning (LDL)), measuring knowledge of both the syntagmatic and paradigmatic structure of language. We develop a `dynamic approach' to compute semantic (dis)similarity. It is the first time that these two computational models have been merged. Models are evaluated using advanced statistical methods. Meanwhile, in order to test the efficiency of our approach, one recently developed cosine method of computing semantic (dis)similarity based on word vectors data adopted is used to compare with our `dynamic' approach. The two computational and fixed-effect statistical models can be used to cross-verify the findings, thus ensuring that the result is reliable. All results support that surprisal and semantic similarity are opposed in the prediction of the reading time of words although both can make good predictions. Additionally, our `dynamic' approach performs better than the popular cosine method. The findings of this study are therefore of significance with regard to acquiring a better understanding how humans process words in a real-world context and how they make predictions in language cognition and processing.

Download Full-text