Document similarity computation by combining multiple representation models

The contribution of this work relates to the field of Arabic text-based document analysis for the detection of plagiarism. This analysis will be carried out according to the triadic computation model of document similarity. The authors propose a hybrid segmentation prototype for Arabic text-based documents that links different processing steps in order to generate the similarity rate between the documents of an Arabic corpus. It involves two segmentation systems and a morphological analysis in order to obtain a matrix representation adapted to the triadic similarity computation according to three abstraction levels: documents, sentences and words.

Download Full-text

Large-scale document similarity computation based on cloud computing platform

2011 6th International Conference on Pervasive Computing and Applications ◽

10.1109/icpca.2011.6106499 ◽

2011 ◽

Author(s):

Chaobo He ◽

Yong Tang ◽

Feiyi Tang ◽

Atiao Yang

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Document Similarity ◽

Computing Platform ◽

Cloud Computing Platform ◽

Similarity Computation

Download Full-text

Using probabilistic topic models for document similarity computation

Computer Science and Applications ◽

10.1201/b18508-55 ◽

2015 ◽

pp. 317-326

Keyword(s):

Topic Models ◽

Document Similarity ◽

Probabilistic Topic Models ◽

Similarity Computation

Download Full-text

Secure and Scalable Document Similarity on Distributed Databases: Differential Privacy to the Rescue

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2020-0024 ◽

2020 ◽

Vol 2020 (2) ◽

pp. 209-229

Author(s):

Phillipp Schoppmann ◽

Lennart Vogelsang ◽

Adrià Gascón ◽

Borja Balle

Keyword(s):

Data Analysis ◽

Differential Privacy ◽

Distributed Databases ◽

Text Documents ◽

Text Data ◽

Document Similarity ◽

Learning Tasks ◽

Speed Up ◽

Similarity Computation ◽

Collaborative Data Analysis

AbstractPrivacy-preserving collaborative data analysis enables richer models than what each party can learn with their own data. Secure Multi-Party Computation (MPC) offers a robust cryptographic approach to this problem, and in fact several protocols have been proposed for various data analysis and machine learning tasks. In this work, we focus on secure similarity computation between text documents, and the application to k-nearest neighbors (k-NN) classification. Due to its non-parametric nature, k-NN presents scalability challenges in the MPC setting. Previous work addresses these by introducing non-standard assumptions about the abilities of an attacker, for example by relying on non-colluding servers. In this work, we tackle the scalability challenge from a different angle, and instead introduce a secure preprocessing phase that reveals differentially private (DP) statistics about the data. This allows us to exploit the inherent sparsity of text data and significantly speed up all subsequent classifications.

Download Full-text

Hybrid Segmentation Prototype for Arabic Text-Based Documents

International Journal of Service Science Management Engineering and Technology ◽

10.4018/ijssmet.2015010104 ◽

2015 ◽

Vol 6 (1) ◽

pp. 63-74 ◽

Cited By ~ 3

Author(s):

Sonia Alouane-Ksouri ◽

Minyar Sassi Hidri

Keyword(s):

Morphological Analysis ◽

Matrix Representation ◽

Document Analysis ◽

Arabic Text ◽

Computation Model ◽

Document Similarity ◽

Similarity Computation ◽

Hybrid Segmentation ◽

Abstraction Levels ◽

Processing Steps

The contribution of this work relates to the field of Arabic text-based document analysis for the detection of plagiarism. This analysis will be carried out according to the triadic computation model of document similarity. The authors propose a hybrid segmentation prototype for Arabic text-based documents that links different processing steps in order to generate the similarity rate between the documents of an Arabic corpus. It involves two segmentation systems and a morphological analysis in order to obtain a matrix representation adapted to the triadic similarity computation according to three abstraction levels: documents, sentences and words.

Download Full-text