Efficient Two-Step Protocol and Its Discriminative Feature Selections in Secure Similar Document Detection
Recently, the risk of information disclosure is increasing significantly. Accordingly, privacy-preserving data mining (PPDM) is being actively studied to obtain accurate mining results while preserving the data privacy. We here focus on secure similar document detection (SSDD), which identifies similar documents of two parties when each party does not disclose its own sensitive documents to the another party. In this paper, we propose an efficient two-step protocol that exploits a feature selection as a lower-dimensional transformation, and we present discriminative feature selections to maximize the performance of the protocol. The proposed protocol consists of two steps: thefilteringstep and thepostprocessingstep. For the feature selection, we first consider the simplest one, random projection (RP), and propose its two-step solution,SSDD-RP. We then present two discriminative feature selections and their solutions:SSDD-LFwhich selects a few dimensions locally frequent in the current querying vector andSSDD-GFwhich selects ones globally frequent in the set of all document vectors. We finally propose a hybrid one,SSDD-HF, which takes advantage of bothSSDD-LFandSSDD-GF. We empirically show that the proposed two-step protocol significantly outperforms the previous one-step protocol by three or four orders of magnitude.