scholarly journals Semantic Sentence Similarity: Size does not Always Matter

Author(s):  
Danny Merkx ◽  
Stefan L. Frank ◽  
Mirjam Ernestus
Keyword(s):  
2021 ◽  
Vol 11 (12) ◽  
pp. 5743
Author(s):  
Pablo Gamallo

This article describes a compositional model based on syntactic dependencies which has been designed to build contextualized word vectors, by following linguistic principles related to the concept of selectional preferences. The compositional strategy proposed in the current work has been evaluated on a syntactically controlled and multilingual dataset, and compared with Transformer BERT-like models, such as Sentence BERT, the state-of-the-art in sentence similarity. For this purpose, we created two new test datasets for Portuguese and Spanish on the basis of that defined for the English language, containing expressions with noun-verb-noun transitive constructions. The results we have obtained show that the linguistic-based compositional approach turns out to be competitive with Transformer models.


2021 ◽  
pp. 1-10
Author(s):  
Hye-Jeong Song ◽  
Tak-Sung Heo ◽  
Jong-Dae Kim ◽  
Chan-Young Park ◽  
Yu-Seop Kim

Sentence similarity evaluation is a significant task used in machine translation, classification, and information extraction in the field of natural language processing. When two sentences are given, an accurate judgment should be made whether the meaning of the sentences is equivalent even if the words and contexts of the sentences are different. To this end, existing studies have measured the similarity of sentences by focusing on the analysis of words, morphemes, and letters. To measure sentence similarity, this study uses Sent2Vec, a sentence embedding, as well as morpheme word embedding. Vectors representing words are input to the 1-dimension convolutional neural network (1D-CNN) with various sizes of kernels and bidirectional long short-term memory (Bi-LSTM). Self-attention is applied to the features transformed through Bi-LSTM. Subsequently, vectors undergoing 1D-CNN and self-attention are converted through global max pooling and global average pooling to extract specific values, respectively. The vectors generated through the above process are concatenated to the vector generated through Sent2Vec and are represented as a single vector. The vector is input to softmax layer, and finally, the similarity between the two sentences is determined. The proposed model can improve the accuracy by up to 5.42% point compared with the conventional sentence similarity estimation models.


Author(s):  
Kelvin Guu ◽  
Tatsunori B. Hashimoto ◽  
Yonatan Oren ◽  
Percy Liang

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.


2018 ◽  
Vol 1 (1) ◽  
Author(s):  
Danny Steveson ◽  
Halim Agung ◽  
Fendra Mulia

Plagiarism is a very frequent problem in all aspects of one occurring in school. There is often plagiarism on the content of the papers or assignments collected by the students. This is to support the decreasing creativity of students in giving ideas and personal opinions on the task given. To answer the above problems then this research using Rabin-Karp algorithm. Rabin-Karp algorithm is a string search algorithm that uses hashing to find one of a series of string patterns in text. Using this application, the user can compare document 1 with another document, which gives results in sentence similarity, then spelled out per word, followed by per hashing and is calculated from the average number of percentages. The test in this research is done by taking samples 50 times and in comparison between percentage with Rabin Karp algorithm and percentage with manual taking. Testing is done by comparing one document with another document. Based on the result of the research, it can be concluded by using Rabin Karp Algorithm, which can be implemented in plagiarism application evidenced by the test using 50 test samples with 43 samples of success of 14.22%.<br />Keywords: document , Rabin Karp Algorithm, Dice Sorensen Index, Plagiarism, sentence, word


Sign in / Sign up

Export Citation Format

Share Document