scholarly journals Cross Lingual Semantic Search by Improving Semantic Similarity and Relatedness Measures

Author(s):  
Nitish Aggarwal
2020 ◽  
Vol 12 (4) ◽  
pp. 67
Author(s):  
Anna Formica ◽  
Elaheh Pourabbas ◽  
Francesco Taglino

This paper presents SemSime, a method based on semantic similarity for searching over a set of digital resources previously annotated by means of concepts from a weighted reference ontology. SemSime is an enhancement of SemSim and, with respect to the latter, it uses a frequency approach for weighting the ontology, and refines both the user request and the digital resources with the addition of rating scores. Such scores are High, Medium, and Low, and in the user request indicate the preferences assigned by the user to each of the concepts representing the searching criteria, whereas in the annotation of the digital resources they represent the levels of quality associated with each concept in describing the resources. The SemSime has been evaluated and the results of the experiment show that it performs better than SemSim and an evolution of it, referred to as S e m S i m R V .


Informatics ◽  
2019 ◽  
Vol 6 (2) ◽  
pp. 19 ◽  
Author(s):  
Rajat Pandit ◽  
Saptarshi Sengupta ◽  
Sudip Kumar Naskar ◽  
Niladri Sekhar Dash ◽  
Mohini Mohan Sardar

Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by five expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.


2020 ◽  
pp. 1-51
Author(s):  
Ivan Vulić ◽  
Simon Baker ◽  
Edoardo Maria Ponti ◽  
Ulla Petti ◽  
Ira Leviant ◽  
...  

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.


2019 ◽  
Vol 9 (16) ◽  
pp. 3318
Author(s):  
Azmat Anwar ◽  
Xiao Li ◽  
Yating Yang ◽  
Yajuan Wang

Although considerable effort has been devoted to building commonsense knowledge bases (CKB), it is still not available for many low-resource languages such as Uyghur because of expensive construction cost. Focusing on this issue, we proposed a cross-lingual knowledge-projection method to construct an Uyghur CKB by projecting ConceptNet’s Chinese facts into Uyghur. We used a Chinese–Uyghur bilingual dictionary to get high-quality entity translation in facts and employed a back-translation method to eliminate the entity-translation ambiguity. Moreover, to tackle the inner relation ambiguity in translated facts, we made a hand-crafted rule to convert the structured facts into natural-language phrases and built the Chinese–Uyghur lingual phrases based on the similarity of phrases that corresponded to the bilingual semantic similarity scoring model. Experimental results show that the accuracy of our semantic similarity scoring model reached 94.75% for our task, and they successfully project 55,872 Chinese facts into Uyghur as well as obtain 67,375 Uyghur facts within a very short period.


Author(s):  
Aida Hakimova ◽  
Michael Charnine ◽  
Aleksey Klokov ◽  
Evgenii Sokolov

This paper is devoted to the development of a methodology for evaluating the semantic similarity of any texts in different languages is developed. The study is based on the hypothesis that the proximity of vector representations of terms in semantic space can be interpreted as a semantic similarity in the cross-lingual environment. Each text will be associated with a vector in a single multilingual semantic vector space. The measure of the semantic similarity of texts will be determined by the measure of the proximity of the corresponding vectors. We propose a quantitative indicator called Index of Semantic Textual Similarity (ISTS) that measures the degree of semantic similarity of multilingual texts on the basis of identified cross-lingual semantic implicit links. The setting of parameters is based on the correlation with the presence of a formal reference between documents. The measure of semantic similarity expresses the existence of two common terms, phrases or word combinations. Optimal parameters of the algorithm for identifying implicit links are selected on the thematic collection by maximizing the correlation of explicit and implicit connections. The developed algorithm can facilitate the search for close documents in the analysis of multilingual patent documentation.


Sign in / Sign up

Export Citation Format

Share Document