NTNU: Measuring Semantic Similarity with Sublexical Feature Representations and Soft Cardinality

Author(s):  
André Lynum ◽  
Partha Pakray ◽  
Björn Gambäck ◽  
Sergio Jimenez
2018 ◽  
Vol 2 (2) ◽  
pp. 70-82 ◽  
Author(s):  
Binglu Wang ◽  
Yi Bu ◽  
Win-bin Huang

AbstractIn the field of scientometrics, the principal purpose for author co-citation analysis (ACA) is to map knowledge domains by quantifying the relationship between co-cited author pairs. However, traditional ACA has been criticized since its input is insufficiently informative by simply counting authors’ co-citation frequencies. To address this issue, this paper introduces a new method that reconstructs the raw co-citation matrices by regarding document unit counts and keywords of references, named as Document- and Keyword-Based Author Co-Citation Analysis (DKACA). Based on the traditional ACA, DKACA counted co-citation pairs by document units instead of authors from the global network perspective. Moreover, by incorporating the information of keywords from cited papers, DKACA captured their semantic similarity between co-cited papers. In the method validation part, we implemented network visualization and MDS measurement to evaluate the effectiveness of DKACA. Results suggest that the proposed DKACA method not only reveals more insights that are previously unknown but also improves the performance and accuracy of knowledge domain mapping, representing a new basis for further studies.


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


2020 ◽  
Author(s):  
Kun Sun

Expectations or predictions about upcoming content play an important role during language comprehension and processing. One important aspect of recent studies of language comprehension and processing concerns the estimation of the upcoming words in a sentence or discourse. Many studies have used eye-tracking data to explore computational and cognitive models for contextual word predictions and word processing. Eye-tracking data has previously been widely explored with a view to investigating the factors that influence word prediction. However, these studies are problematic on several levels, including the stimuli, corpora, statistical tools they applied. Although various computational models have been proposed for simulating contextual word predictions, past studies usually preferred to use a single computational model. The disadvantage of this is that it often cannot give an adequate account of cognitive processing in language comprehension. To avoid these problems, this study draws upon a massive natural and coherent discourse as stimuli in collecting the data on reading time. This study trains two state-of-art computational models (surprisal and semantic (dis)similarity from word vectors by linear discriminative learning (LDL)), measuring knowledge of both the syntagmatic and paradigmatic structure of language. We develop a `dynamic approach' to compute semantic (dis)similarity. It is the first time that these two computational models have been merged. Models are evaluated using advanced statistical methods. Meanwhile, in order to test the efficiency of our approach, one recently developed cosine method of computing semantic (dis)similarity based on word vectors data adopted is used to compare with our `dynamic' approach. The two computational and fixed-effect statistical models can be used to cross-verify the findings, thus ensuring that the result is reliable. All results support that surprisal and semantic similarity are opposed in the prediction of the reading time of words although both can make good predictions. Additionally, our `dynamic' approach performs better than the popular cosine method. The findings of this study are therefore of significance with regard to acquiring a better understanding how humans process words in a real-world context and how they make predictions in language cognition and processing.


2014 ◽  
Vol 6 (2) ◽  
pp. 46-51
Author(s):  
Galang Amanda Dwi P. ◽  
Gregorius Edwadr ◽  
Agus Zainal Arifin

Nowadays, a large number of information can not be reached by the reader because of the misclassification of text-based documents. The misclassified data can also make the readers obtain the wrong information. The method which is proposed by this paper is aiming to classify the documents into the correct group.  Each document will have a membership value in several different classes. The method will be used to find the degree of similarity between the two documents is the semantic similarity. In fact, there is no document that doesn’t have a relationship with the other but their relationship might be close to 0. This method calculates the similarity between two documents by taking into account the level of similarity of words and their synonyms. After all inter-document similarity values obtained, a matrix will be created. The matrix is then used as a semi-supervised factor. The output of this method is the value of the membership of each document, which must be one of the greatest membership value for each document which indicates where the documents are grouped. Classification result computed by the method shows a good value which is 90 %. Index Terms - Fuzzy co-clustering, Heuristic, Semantica Similiarity, Semi-supervised learning.


2012 ◽  
Vol 38 (2) ◽  
pp. 229-235 ◽  
Author(s):  
Wen-Qing LI ◽  
Xin SUN ◽  
Chang-You ZHANG ◽  
Ye FENG

Sign in / Sign up

Export Citation Format

Share Document