Optimizing Document Similarity Detection in Persian Information Retrieval

Omid Kashefi ; Nina Mohseni ; Behrouz Minaei

doi:10.4156/jcit.vol5.issue2.11

Optimizing Document Similarity Detection in Persian Information Retrieval

Journal of Convergence Information Technology ◽

10.4156/jcit.vol5.issue2.11 ◽

2010 ◽

Vol 5 (2) ◽

pp. 101-106 ◽

Cited By ~ 3

Author(s):

Omid Kashefi ◽

Nina Mohseni ◽

Behrouz Minaei

Keyword(s):

Information Retrieval ◽

Document Similarity ◽

Similarity Detection

Download Full-text

A Review On Language Specific Multi Document Similarity Detection

2019 IEEE 5th International Conference for Convergence in Technology (I2CT) ◽

10.1109/i2ct45611.2019.9033688 ◽

2019 ◽

Author(s):

Achala Piyarathna ◽

Guhanathan Poravi

Keyword(s):

Document Similarity ◽

Similarity Detection

Download Full-text

Document Similarity Detection Using Indonesian Language Word2vec Model

2019 3rd International Conference on Informatics and Computational Sciences (ICICoS) ◽

10.1109/icicos48119.2019.8982432 ◽

2019 ◽

Author(s):

Nahda Rosa Ramadhanti ◽

Siti Mariyah

Keyword(s):

Document Similarity ◽

Similarity Detection

Download Full-text

A NEW APPROACH FOR TEXT SIMILARITY USING ARTICLES

International Journal of Information Technology & Decision Making ◽

10.1142/s021962200800279x ◽

2008 ◽

Vol 07 (01) ◽

pp. 23-34 ◽

Cited By ~ 5

Author(s):

ELSAYED ATLAM

Keyword(s):

Information Retrieval ◽

Text Analysis ◽

Traditional Method ◽

Text Similarity ◽

Traditional Methods ◽

Document Similarity ◽

New Approach ◽

Text Collections ◽

Subject Areas

Conventional approaches to text analysis and information retrieval which measured document similarity by considering all information in texts are relatively inefficiency for processing large text collections in heterogeneous subject areas. Previous researches showed that evidence from passage can improve retrieval results. But it also raised questions about how passage is defined, how they can be ranked efficiently, and what is their proper rule in long structure documents. Moreover, the frequency of "the" with important sentence is efficiently to summarize the text by dexterity way. We previously proposed an approach for extracting sentences which including article "the" by some restrict rules to carry out effectiveness passages. Based on previous approaches, this paper presents a new Passage SIMilarity (P-SIM) measurements between documents based on effectiveness passages after extracting them using article "the". Moreover, our new approach showing that this method is more efficient than traditional methods. Also, Recall and Precision are achieved by 92.6% and 97.5% respectively, depending on extracted passages. Furthermore, Recall and Precision significantly improved by 38.3% and 44.2% over the traditional method. The proposed methods are applied to 3,990 articles from the large tagged corpus.

Download Full-text

Erratum to: Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz138 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Peter Brown ◽

Yaoqi Zhou ◽

Keyword(s):

Literature Search ◽

Biomedical Literature ◽

Document Similarity ◽

Similarity Detection

Download Full-text

Efficient Document Similarity Detection Using Weighted Phrase Indexing

International Journal of Multimedia and Ubiquitous Engineering ◽

10.14257/ijmue.2016.11.5.21 ◽

2016 ◽

Vol 11 (5) ◽

pp. 231-244

Author(s):

Papias Niyigena ◽

Zhang Zuping ◽

Mansoor Ahmed Khuhro ◽

Damien Hanyurwimfura

Keyword(s):

Document Similarity ◽

Similarity Detection

Download Full-text

Document similarity detection using semantic social network analysis on RDF citation graph

2013 IEEE 9th International Conference on Emerging Technologies (ICET) ◽

10.1109/icet.2013.6743548 ◽

2013 ◽

Cited By ~ 6

Author(s):

Qamar Mahmood ◽

Muhammad Abdul Qadir ◽

Muhammad Tanvir Afzal

Keyword(s):

Social Network ◽

Social Network Analysis ◽

Network Analysis ◽

Document Similarity ◽

Similarity Detection ◽

Citation Graph

Download Full-text

Document-Document similarity matrix and Multiple-Kernel Fuzzy C-Means Algorithm-based web document clustering for information retrieval

IJARCCE ◽

10.17148/ijarcce.2014.31054 ◽

2014 ◽

pp. 8317-8321

Author(s):

POONAM YADAV

Keyword(s):

Information Retrieval ◽

Document Clustering ◽

Similarity Matrix ◽

Document Similarity ◽

Fuzzy C Means ◽

Web Document ◽

Multiple Kernel ◽

Web Document Clustering ◽

Fuzzy C Means Algorithm

Download Full-text

Privacy Preserving Scheme for Document Similarity Detection

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES ◽

10.3906/elk-2104-32 ◽

2021 ◽

Keyword(s):

Privacy Preserving ◽

Document Similarity ◽

Similarity Detection

Download Full-text

Dimensionality Reduction for Efficient Document Similarity Detection in Big Datasets

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2017.6582 ◽

2017 ◽

Vol 14 (6) ◽

pp. 2829-2837

Author(s):

Papias Niyigena ◽

Zhang Zuping ◽

Ammar Oad

Keyword(s):

Dimensionality Reduction ◽

Document Similarity ◽

Similarity Detection

Download Full-text

Scalable self-organizing structured P2P information retrieval model based on equivalence classes

The International Arab Journal of Information Technology ◽

10.34028/iajit/11/1/1 ◽

2013 ◽

Vol 11 (1) ◽

pp. 78-86

Author(s):

Yaser Al-Lahham ◽

Mohammad Hassan

Keyword(s):

Information Retrieval ◽

Relevant Information ◽

Network Size ◽

Equivalence Classes ◽

Locality Sensitive Hashing ◽

Document Similarity ◽

Proposed Model ◽

Universal Equivalence ◽

Local Equivalence ◽

Self Organizing

This paper proposes a new autonomous self-organizing content-based node clustering peer to peer Information Retrieval (P2PIR) model. This model uses incremental transitive document-to-document similarity technique to build Local Equivalence Classes (LECes) of documents on a source node. Locality Sensitive Hashing (LSH) scheme is applied to map a representative of each LEC into a set of keys which will be published to hosting node (s). Similar LECes on different nodes form Universal Equivalence Classes (UECes), which indicate the connectivity between these nodes. The same LSH scheme is used to submit queries to subset of nodes that most likely have relevant information. The proposed model has been implemented. The obtained results indicate efficiency in building connectivity between similar nodes, and correctly allocate and retrieve relevant answers to high percentage of queries. The system was tested for different network sizes and proved to be scalable as efficiency downgraded gracefully as the network size grows exponentially.

Download Full-text