scholarly journals Secure and Scalable Document Similarity on Distributed Databases: Differential Privacy to the Rescue

2020 ◽  
Vol 2020 (2) ◽  
pp. 209-229
Author(s):  
Phillipp Schoppmann ◽  
Lennart Vogelsang ◽  
Adrià Gascón ◽  
Borja Balle

AbstractPrivacy-preserving collaborative data analysis enables richer models than what each party can learn with their own data. Secure Multi-Party Computation (MPC) offers a robust cryptographic approach to this problem, and in fact several protocols have been proposed for various data analysis and machine learning tasks. In this work, we focus on secure similarity computation between text documents, and the application to k-nearest neighbors (k-NN) classification. Due to its non-parametric nature, k-NN presents scalability challenges in the MPC setting. Previous work addresses these by introducing non-standard assumptions about the abilities of an attacker, for example by relying on non-colluding servers. In this work, we tackle the scalability challenge from a different angle, and instead introduce a secure preprocessing phase that reveals differentially private (DP) statistics about the data. This allows us to exploit the inherent sparsity of text data and significantly speed up all subsequent classifications.

Author(s):  
Grace L. Samson ◽  
Joan Lu

AbstractWe present a new detection method for color-based object detection, which can improve the performance of learning procedures in terms of speed, accuracy, and efficiency, using spatial inference, and algorithm. We applied the model to human skin detection from an image; however, the method can also work for other machine learning tasks involving image pixels. We propose (1) an improved RGB/HSL human skin color threshold to tackle darker human skin color detection problem. (2), we also present a new rule-based fast algorithm (packed k-dimensional tree --- PKT) that depends on an improved spatial structure for human skin/face detection from colored 2D images. We also implemented a novel packed quad-tree (PQT) to speed up the quad-tree performance in terms of indexing. We compared the proposed system to traditional pixel-by-pixel (PBP)/pixel-wise (PW) operation, and quadtree based procedures. The results show that our proposed spatial structure performs better (with a very low false hit rate, very high precision, and accuracy rate) than most state-of-the-art models.


2020 ◽  
Vol 2020 ◽  
pp. 1-29 ◽  
Author(s):  
Xingxing Xiong ◽  
Shubo Liu ◽  
Dan Li ◽  
Zhaohui Cai ◽  
Xiaoguang Niu

With the advent of the era of big data, privacy issues have been becoming a hot topic in public. Local differential privacy (LDP) is a state-of-the-art privacy preservation technique that allows to perform big data analysis (e.g., statistical estimation, statistical learning, and data mining) while guaranteeing each individual participant’s privacy. In this paper, we present a comprehensive survey of LDP. We first give an overview on the fundamental knowledge of LDP and its frameworks. We then introduce the mainstream privatization mechanisms and methods in detail from the perspective of frequency oracle and give insights into recent studied on private basic statistical estimation (e.g., frequency estimation and mean estimation) and complex statistical estimation (e.g., multivariate distribution estimation and private estimation over complex data) under LDP. Furthermore, we present current research circumstances on LDP including the private statistical learning/inferencing, private statistical data analysis, privacy amplification techniques for LDP, and some application fields under LDP. Finally, we identify future research directions and open challenges for LDP. This survey can serve as a good reference source for the research of LDP to deal with various privacy-related scenarios to be encountered in practice.


2011 ◽  
Vol 27 (18) ◽  
pp. 2612-2613 ◽  
Author(s):  
Florian Battke ◽  
Stephan Symons ◽  
Alexander Herbig ◽  
Kay Nieselt

2020 ◽  
Vol 25 (6) ◽  
pp. 755-769
Author(s):  
Noorullah R. Mohammed ◽  
Moulana Mohammed

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.


KWALON ◽  
2010 ◽  
Vol 15 (3) ◽  
Author(s):  
Curtis Atkisson ◽  
Colin Monaghan ◽  
Edward Brent

The recent mass digitization of text data has led to a need to efficiently and effectively deal with the mountain of textual data that is generated. Digitized text is increasingly in the form of digitized data flows (Brent, 2008). Digitized data flows are non-static streams of generated content – including twitter, electronic news, etc. An oft-cited statistic is that currently 85% of all business data is in the form of text (cited in Hotho, Nürnberger & Paass, 2005). This mountain of data leads us to the question whether the labor-intensive traditional qualitative data analysis techniques are best suited for this large amount of data. Other techniques for dealing with large amounts of data may also be found wanting because those techniques remove the researcher from an immersion in the data. Both dealing with large amounts of data and allowing immersion in data are clearly desired features of any text analysis system.


2021 ◽  
pp. 2141001
Author(s):  
Sanqiang Wei ◽  
Hongxia Hou ◽  
Hua Sun ◽  
Wei Li ◽  
Wenxia Song

The plots in certain literary works are very complicated and hinder readers from understanding them. Therefore tools should be proposed to support readers; comprehension of complex literary works supports their understanding by providing the most important information to readers. A human reader must capture multiple levels of abstraction and meaning to formulate an understanding of a document. Hence, in this paper, an Improved [Formula: see text]-means clustering algorithm (IKCA) has been proposed for literary word classification. For text data, the words that can express exact semantic in a class are generally better features. This paper uses the proposed technique to capture numerous cluster centroids for every class and then select the high-frequency words in centroids the text features for classification. Furthermore, neural networks have been used to classify text documents and [Formula: see text]-mean to cluster text documents. To develop the model based on unsupervised and supervised techniques to meet and identify the similarity between documents. The numerical results show that the suggested model will enhance to increases quality comparison of the existing Algorithm and [Formula: see text]-means algorithm, accuracy comparison of ALA and IKCA (95.2%), time is taken for clustering is less than 2 hours, success rate (97.4%) and performance ratio (98.1%).


Sign in / Sign up

Export Citation Format

Share Document