Secure and Scalable Document Similarity on Distributed Databases: Differential Privacy to the Rescue

AbstractPrivacy-preserving collaborative data analysis enables richer models than what each party can learn with their own data. Secure Multi-Party Computation (MPC) offers a robust cryptographic approach to this problem, and in fact several protocols have been proposed for various data analysis and machine learning tasks. In this work, we focus on secure similarity computation between text documents, and the application to k-nearest neighbors (k-NN) classification. Due to its non-parametric nature, k-NN presents scalability challenges in the MPC setting. Previous work addresses these by introducing non-standard assumptions about the abilities of an attacker, for example by relying on non-colluding servers. In this work, we tackle the scalability challenge from a different angle, and instead introduce a secure preprocessing phase that reveals differentially private (DP) statistics about the data. This allows us to exploit the inherent sparsity of text data and significantly speed up all subsequent classifications.

Download Full-text

PKT: fast color-based spatial model for human skin detection

Multimedia Tools and Applications ◽

10.1007/s11042-021-10955-4 ◽

2021 ◽

Author(s):

Grace L. Samson ◽

Joan Lu

Keyword(s):

Spatial Structure ◽

Human Skin ◽

Skin Color ◽

Skin Detection ◽

Quad Tree ◽

Learning Tasks ◽

Color Detection ◽

Speed Up ◽

Image Pixels ◽

Speed Accuracy

AbstractWe present a new detection method for color-based object detection, which can improve the performance of learning procedures in terms of speed, accuracy, and efficiency, using spatial inference, and algorithm. We applied the model to human skin detection from an image; however, the method can also work for other machine learning tasks involving image pixels. We propose (1) an improved RGB/HSL human skin color threshold to tackle darker human skin color detection problem. (2), we also present a new rule-based fast algorithm (packed k-dimensional tree --- PKT) that depends on an improved spatial structure for human skin/face detection from colored 2D images. We also implemented a novel packed quad-tree (PQT) to speed up the quad-tree performance in terms of indexing. We compared the proposed system to traditional pixel-by-pixel (PBP)/pixel-wise (PW) operation, and quadtree based procedures. The results show that our proposed spatial structure performs better (with a very low false hit rate, very high precision, and accuracy rate) than most state-of-the-art models.

Download Full-text

A Comprehensive Survey on Local Differential Privacy

Security and Communication Networks ◽

10.1155/2020/8829523 ◽

2020 ◽

Vol 2020 ◽

pp. 1-29 ◽

Cited By ~ 1

Author(s):

Xingxing Xiong ◽

Shubo Liu ◽

Dan Li ◽

Zhaohui Cai ◽

Xiaoguang Niu

Keyword(s):

Big Data ◽

Data Analysis ◽

Statistical Learning ◽

Data Privacy ◽

Statistical Estimation ◽

Differential Privacy ◽

Future Research ◽

Reference Source ◽

Complex Data ◽

Comprehensive Survey

With the advent of the era of big data, privacy issues have been becoming a hot topic in public. Local differential privacy (LDP) is a state-of-the-art privacy preservation technique that allows to perform big data analysis (e.g., statistical estimation, statistical learning, and data mining) while guaranteeing each individual participant’s privacy. In this paper, we present a comprehensive survey of LDP. We first give an overview on the fundamental knowledge of LDP and its frameworks. We then introduce the mainstream privatization mechanisms and methods in detail from the perspective of frequency oracle and give insights into recent studied on private basic statistical estimation (e.g., frequency estimation and mean estimation) and complex statistical estimation (e.g., multivariate distribution estimation and private estimation over complex data) under LDP. Furthermore, we present current research circumstances on LDP including the private statistical learning/inferencing, private statistical data analysis, privacy amplification techniques for LDP, and some application fields under LDP. Finally, we identify future research directions and open challenges for LDP. This survey can serve as a good reference source for the research of LDP to deal with various privacy-related scenarios to be encountered in practice.

Download Full-text

A Comparative Analysis of Climate Change and Green Policy Issues: Focusing on Text Data Analysis for Each Period of Korean Government

Journal of Environmental Policy and Administration ◽

10.15301/jepa.2021.29.3.1 ◽

2021 ◽

Vol 29 (3) ◽

pp. 1-47

Author(s):

Cheon-hwan Lee ◽

Hansu Hwang ◽

SeJin An ◽

Eunchang Lee

Keyword(s):

Climate Change ◽

Data Analysis ◽

Comparative Analysis ◽

Text Data ◽

Policy Issues ◽

Green Policy ◽

Korean Government ◽

Text Data Analysis

Download Full-text

GaggleBridge: collaborative data analysis

Bioinformatics ◽

10.1093/bioinformatics/btr429 ◽

2011 ◽

Vol 27 (18) ◽

pp. 2612-2613 ◽

Cited By ~ 3

Author(s):

Florian Battke ◽

Stephan Symons ◽

Alexander Herbig ◽

Kay Nieselt

Keyword(s):

Data Analysis ◽

Collaborative Data Analysis

Download Full-text

A Document Similarity Computation Method Based on Word Embedding and Citation Analysis

Advances in Intelligent Systems and Computing - Recent Findings in Intelligent Computing Techniques ◽

10.1007/978-981-10-8633-5_17 ◽

2018 ◽

pp. 161-168

Author(s):

K. Lamiya ◽

Anuraj Mohan

Keyword(s):

Citation Analysis ◽

Word Embedding ◽

Computation Method ◽

Document Similarity ◽

Similarity Computation

Download Full-text

CORPUS-BASED CONCEPTUAL COMPONENTIAL ANALYSIS OF THE RELIGIOUS TEXT DATA ANALYSIS. PART I.

DEVELOPMENT OF PHILOLOGY AND LINGUISTICS AT THE MODERN HISTORICAL PERIOD ◽

10.36059/978-966-397-146-9/84-98 ◽

2019 ◽

pp. 84-98

Author(s):

N. M. Popovych ◽

Keyword(s):

Data Analysis ◽

Text Data ◽

Religious Text ◽

Componential Analysis ◽

Text Data Analysis

Download Full-text

Private Similarity Computation in Distributed Systems: From Cryptography to Differential Privacy

Lecture Notes in Computer Science - Principles of Distributed Systems ◽

10.1007/978-3-642-25873-2_25 ◽

2011 ◽

pp. 357-377 ◽

Cited By ~ 4

Author(s):

Mohammad Alaggan ◽

Sébastien Gambs ◽

Anne-Marie Kermarrec

Keyword(s):

Distributed Systems ◽

Differential Privacy ◽

Similarity Computation

Download Full-text

Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models

Ingénierie des systèmes d information ◽

10.18280/isi.250606 ◽

2020 ◽

Vol 25 (6) ◽

pp. 755-769

Author(s):

Noorullah R. Mohammed ◽

Moulana Mohammed

Keyword(s):

Data Clustering ◽

Topic Models ◽

Cluster Validity ◽

Text Documents ◽

Text Data ◽

Validity Assessment ◽

Text Document ◽

Cluster Validity Indices ◽

Validity Indices ◽

Data Clusters

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.

Download Full-text

Using Computational Techniques to Fill the Gap between Qualitative Data Analysis and Text Analytics

KWALON ◽

10.5117/2010.015.003.002 ◽

2010 ◽

Vol 15 (3) ◽

Author(s):

Curtis Atkisson ◽

Colin Monaghan ◽

Edward Brent

Keyword(s):

Data Analysis ◽

Qualitative Data ◽

Computational Techniques ◽

Text Analytics ◽

Qualitative Data Analysis ◽

Text Data ◽

Textual Data ◽

Data Flows ◽

Mass Digitization ◽

Analysis System

The recent mass digitization of text data has led to a need to efficiently and effectively deal with the mountain of textual data that is generated. Digitized text is increasingly in the form of digitized data flows (Brent, 2008). Digitized data flows are non-static streams of generated content – including twitter, electronic news, etc. An oft-cited statistic is that currently 85% of all business data is in the form of text (cited in Hotho, Nürnberger & Paass, 2005). This mountain of data leads us to the question whether the labor-intensive traditional qualitative data analysis techniques are best suited for this large amount of data. Other techniques for dealing with large amounts of data may also be found wanting because those techniques remove the researcher from an immersion in the data. Both dealing with large amounts of data and allowing immersion in data are clearly desired features of any text analysis system.

Download Full-text

The Classification System of Literary Works Based on K-Means Clustering

Journal of Interconnection Networks ◽

10.1142/s0219265921410012 ◽

2021 ◽

pp. 2141001

Author(s):

Sanqiang Wei ◽

Hongxia Hou ◽

Hua Sun ◽

Wei Li ◽

Wenxia Song

Keyword(s):

Clustering Algorithm ◽

Performance Ratio ◽

Levels Of Abstraction ◽

Text Documents ◽

Text Data ◽

Literary Works ◽

Accuracy Comparison ◽

Word Classification ◽

Text Features ◽

And Performance

The plots in certain literary works are very complicated and hinder readers from understanding them. Therefore tools should be proposed to support readers; comprehension of complex literary works supports their understanding by providing the most important information to readers. A human reader must capture multiple levels of abstraction and meaning to formulate an understanding of a document. Hence, in this paper, an Improved [Formula: see text]-means clustering algorithm (IKCA) has been proposed for literary word classification. For text data, the words that can express exact semantic in a class are generally better features. This paper uses the proposed technique to capture numerous cluster centroids for every class and then select the high-frequency words in centroids the text features for classification. Furthermore, neural networks have been used to classify text documents and [Formula: see text]-mean to cluster text documents. To develop the model based on unsupervised and supervised techniques to meet and identify the similarity between documents. The numerical results show that the suggested model will enhance to increases quality comparison of the existing Algorithm and [Formula: see text]-means algorithm, accuracy comparison of ALA and IKCA (95.2%), time is taken for clustering is less than 2 hours, success rate (97.4%) and performance ratio (98.1%).

Download Full-text