Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by five expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.

Download Full-text

Document Similarity for Arabic and Cross-Lingual Web Content

Communications in Computer and Information Science - Arabic Language Processing: From Theory to Practice ◽

10.1007/978-3-319-73500-9_10 ◽

2018 ◽

pp. 134-146

Author(s):

Ali Salhi ◽

Adnan H. Yahya

Keyword(s):

Web Content ◽

Document Similarity ◽

Cross Lingual

Download Full-text

Document similarity calculation model of CSLN

2014 IEEE 5th International Conference on Software Engineering and Service Science ◽

10.1109/icsess.2014.6933701 ◽

2014 ◽

Cited By ~ 1

Author(s):

Weiling Chen ◽

Gang Wang ◽

Fengxia Yin

Keyword(s):

Calculation Model ◽

Document Similarity ◽

Similarity Calculation

Download Full-text

Cross-lingual Named Entity Recognition

Lingvisticae Investigationes ◽

10.1075/li.30.1.09ste ◽

2007 ◽

Vol 30 (1) ◽

pp. 135-162 ◽

Cited By ~ 4

Author(s):

Ralf Steinberger ◽

Bruno Pouliquen

Keyword(s):

State Of The Art ◽

Named Entity Recognition ◽

Online News ◽

Entity Recognition ◽

Minimal Amount ◽

Document Similarity ◽

News Analysis ◽

Named Entity ◽

Analysis Application ◽

Cross Lingual

Named Entity Recognition and Classification (NERC) is a known and well-explored text analysis application that has been applied to various languages. We are presenting an automatic, highly multilingual news analysis system that fully integrates NERC for locations, persons and organisations with document clustering, multi-label categorisation, name attribute extraction, name variant merging and the calculation of social networks. The proposed application goes beyond the state-of-the-art by automatically merging the information found in news written in ten different languages, and by using the aggregated name information to automatically link related news documents across languages for all 45 language pair combinations. While state-of-the-art approaches for cross-lingual name variant merging and document similarity calculation require bilingual resources, the methods proposed here are mostly language-independent and require a minimal amount of monolingual language-specific effort. The development of resources for additional languages is therefore kept to a minimum and new languages can be plugged into the system effortlessly. The presented online news analysis application is fully functional and has, at the end of the year 2006, reached average usage statistics of 600,000 hits per day.

Download Full-text

Research on document similarity calculation and detection based on deep learning

Journal of Physics Conference Series ◽

10.1088/1742-6596/1757/1/012007 ◽

2021 ◽

Vol 1757 (1) ◽

pp. 012007

Author(s):

Cui Xing ◽

Yan Yang ◽

Jian Luo

Keyword(s):

Deep Learning ◽

Document Similarity ◽

Similarity Calculation

Download Full-text

Building a multi-domain comparable corpus using a learning to rank method

Natural Language Engineering ◽

10.1017/s1351324916000164 ◽

2016 ◽

Vol 22 (4) ◽

pp. 627-653 ◽

Cited By ~ 3

Author(s):

RAZIEH RAHIMI ◽

AZADEH SHAKERY ◽

JAVID DADASHKARIMI ◽

MOZHDEH ARIANNEZHAD ◽

MOSTAFA DEHGHANI ◽

...

Keyword(s):

Learning To Rank ◽

Training Data ◽

Target Language ◽

Document Similarity ◽

Comparable Corpora ◽

Linguistic Resources ◽

Source Document ◽

Candidate Target ◽

Target Languages ◽

Cross Lingual

AbstractComparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.

Download Full-text

Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Bilingual Document Similarity Calculation Based on Bilingual Word Embedding

Cross-lingual document similarity estimation and dictionary generation with comparable corpora

An Efficient Semantic Document Similarity Calculation Method Based on Double-Relations in Gene Ontology

Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language

Document Similarity for Arabic and Cross-Lingual Web Content

Document similarity calculation model of CSLN

Cross-lingual Named Entity Recognition

Research on document similarity calculation and detection based on deep learning

Building a multi-domain comparable corpus using a learning to rank method

Export Citation Format