A Collaborative Framework for Privacy Preserving Fuzzy Co-Clustering of Vertically Distributed Cooccurrence Matrices

In many real world data analysis tasks, it is expected that we can get much more useful knowledge by utilizing multiple databases stored in different organizations, such as cooperation groups, state organs, and allied countries. However, in many such organizations, they often hesitate to publish their databases because of privacy and security issues although they believe the advantages of collaborative analysis. This paper proposes a novel collaborative framework for utilizing vertically partitioned cooccurrence matrices in fuzzy co-cluster structure estimation, in which cooccurrence information among objects and items is separately stored in several sites. In order to utilize such distributed data sets without fear of information leaks, a privacy preserving procedure is introduced to fuzzy clustering for categorical multivariate data (FCCM). Withholding each element of cooccurrence matrices, only object memberships are shared by multiple sites and their (implicit) joint co-cluster structures are revealed through an iterative clustering process. Several experimental results demonstrate that collaborative analysis can contribute to revealing global intrinsic co-cluster structures of separate matrices rather than individual site-wise analysis. The novel framework makes it possible for many private and public organizations to share common data structural knowledge without fear of information leaks.

Download Full-text

Mainzelliste SecureEpiLinker (MainSEL): Privacy-Preserving Record Linkage using Secure Multi-Party Computation

Bioinformatics ◽

10.1093/bioinformatics/btaa764 ◽

2020 ◽

Author(s):

Sebastian Stammler ◽

Tobias Kussel ◽

Phillipp Schoppmann ◽

Florian Stampe ◽

Galina Tremper ◽

...

Keyword(s):

Record Linkage ◽

Fault Tolerant ◽

Source Code ◽

Privacy Preserving ◽

Third Party ◽

Data Sets ◽

Real World Data ◽

Trusted Third Party ◽

Record Keeping ◽

Network Connection

Abstract Motivation Record Linkage has versatile applications in real-world data analysis contexts, where several data sets need to be linked on the record level in the absence of any exact identifier connecting related records. An example are medical databases of patients, spread across institutions, that have to be linked on personally identifiable entries like name, date of birth or ZIP code. At the same time, privacy laws may prohibit the exchange of this personally identifiable information (PII) across institutional boundaries, ruling out the outsourcing of the record linkage task to a trusted third party. We propose to employ privacy-preserving record linkage (PPRL) techniques that prevent, to various degrees, the leakage of PII while still allowing for the linkage of related records. Results We develop a framework for fault-tolerant PPRL using secure multi-party computation with the medical record keeping software Mainzelliste as the data source. Our solution does not rely on any trusted third party and all PII is guaranteed to not leak under common cryptographic security assumptions. Benchmarks show the feasibility of our approach in realistic networking settings: linkage of a patient record against a database of 10.000 records can be done in 48s over a heavily delayed (100ms) network connection, or 3.9s with a low-latency connection. Availability and implementation The source code of the sMPC node is freely available on Github at https://github.com/medicalinformatics/SecureEpilinker subject to the AGPLv3 license. The source code of the modified Mainzelliste is available at https://github.com/medicalinformatics/MainzellisteSEL.

Download Full-text

Big Data and Cloud Computing: Trends and Challenges

International Journal of Interactive Mobile Technologies (iJIM) ◽

10.3991/ijim.v11i2.6561 ◽

2017 ◽

Vol 11 (2) ◽

pp. 34 ◽

Cited By ~ 11

Author(s):

Samir Abou El-Seoud ◽

Hosam F. El-Sofany ◽

Mohamed Ashraf Fouad Abdelfattah ◽

Reham Mohamed

Keyword(s):

Cloud Computing ◽

Big Data ◽

Business Models ◽

Data Sets ◽

Data Systems ◽

Privacy And Security ◽

Resource Pooling ◽

Security Issues ◽

New Business ◽

Data Volume

Big data is currently one of the most critical emerging technologies. Big Data are used as a concept that refers to the inability of traditional data architectures to efficiently handle the new data sets. The 4V’s of big data – volume, velocity, variety and veracity makes the data management and analytics challenging for the traditional data warehouses. It is important to think of big data and analytics together. Big data is the term used to describe the recent explosion of different types of data from disparate sources. Analytics is about examining data to derive interesting and relevant trends and patterns, which can be used to inform decisions, optimize processes, and even drive new business models. Cloud computing seems to be a perfect vehicle for hosting big data workloads. However, working on big data in the cloud brings its own challenge of reconciling two contradictory design principles. Cloud computing is based on the concepts of consolidation and resource pooling, but big data systems (such as Hadoop) are built on the shared nothing principle, where each node is independent and selfsufficient. The integrating big data with cloud computing technologies, businesses and education institutes can have a better direction to the future. The capability to store large amounts of data in different forms and process it all at very large speeds will result in data that can guide businesses and education institutes in developing fast. Nevertheless, there is a large concern regarding privacy and security issues when moving to the cloud which is the main causes as to why businesses and educational institutes will not move to the cloud. This paper introduces the characteristics, trends and challenges of big data. In addition to that, it investigates the benefits and the risks that may rise out of the integration between big data and cloud computing.

Download Full-text

A Sampling-Based Method for Highly Efficient Privacy-Preserving Data Publication

Wireless Communications and Mobile Computing ◽

10.1155/2021/6648775 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Guoming Lu ◽

Xu Zheng ◽

Jingyuan Duan ◽

Ling Tian ◽

Xia Wang

Keyword(s):

Differential Privacy ◽

Sampling Strategy ◽

Privacy Preserving ◽

Smart Devices ◽

Distributed Data ◽

Network Resources ◽

Real World Data ◽

Data Publication ◽

Highly Efficient ◽

The Cost

The data publication from multiple contributors has been long considered a fundamental task for data processing in various domains. It has been treated as one prominent prerequisite for enabling AI techniques in wireless networks. With the emergence of diversified smart devices and applications, data held by individuals becomes more pervasive and nontrivial for publication. First, the data are more private and sensitive, as they cover every aspect of daily life, from the incoming data to the fitness data. Second, the publication of such data is also bandwidth-consuming, as they are likely to be stored on mobile devices. The local differential privacy has been considered a novel paradigm for such distributed data publication. However, existing works mostly request the encoding of contents into vector space for publication, which is still costly in network resources. Therefore, this work proposes a novel framework for highly efficient privacy-preserving data publication. Specifically, two sampling-based algorithms are proposed for the histogram publication, which is an important statistic for data analysis. The first algorithm applies a bit-level sampling strategy to both reduce the overall bandwidth and balance the cost among contributors. The second algorithm allows consumers to adjust their focus on different intervals and can properly allocate the sampling ratios to optimize the overall performance. Both the analysis and the validation of real-world data traces have demonstrated the advancement of our work.

Download Full-text

A systematic review on privacy-preserving distributed data mining

Data Science ◽

10.3233/ds-210036 ◽

2021 ◽

Vol 4 (2) ◽

pp. 121-150

Author(s):

Chang Sun ◽

Lianne Ippel ◽

Andre Dekker ◽

Michel Dumontier ◽

Johan van Soest

Keyword(s):

Systematic Review ◽

Data Mining ◽

Real Life ◽

Past Research ◽

Privacy Preserving ◽

Distributed Data Mining ◽

Sensitive Information ◽

Distributed Data ◽

Multiple Sources ◽

Privacy And Security

Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the outstanding challenges in the field. This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field, and provide suggestions of how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.

Download Full-text

Privacy Preserving Decomposable Mining Association Rules on Distributed Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.13.16343 ◽

2018 ◽

Vol 7 (3.13) ◽

pp. 157

Author(s):

Ahmed M. Khedr ◽

Zaher AL Aghbari ◽

Ibrahim Kamel

Keyword(s):

Association Rules ◽

Data Privacy ◽

Security Analysis ◽

Privacy Preserving ◽

Data Repository ◽

Distributed Data ◽

Privacy And Security ◽

Data Mining Algorithms ◽

Computational Overhead ◽

Mining Association Rules

In distributed computing, data sharing is inevitable, however, moving local databases from one site to another should be avoided because of the computational overhead and privacy consideration. Most of the data mining algorithms are designed assuming that data repository is stored locally. This paper presents a scheme and algorithms for mining association rules in geographically distributed data. The proposed scheme preserves data privacy of the different geographical site by passing secure messages between them. The algorithms minimize the communication cost by exchanging statistical summaries of the local databases. We provide a privacy and security analysis that shows the privacy preserving aspects of the proposed algorithms. Moreover, the paper presents extensive simulation experiments to evaluate the efficiency of the proposed scheme.

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

Privacy and Security Issues in Aadhaar

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2017.8317 ◽

2017 ◽

Vol V (VIII) ◽

pp. 2213-2217

Author(s):

A.K.R.S. Anusha

Keyword(s):

Privacy And Security ◽

Security Issues

Download Full-text

Performance analysis of privacy preserving distributed data mining based on cryptographic techniques

2021 7th International Conference on Electrical Energy Systems (ICEES) ◽

10.1109/icees51510.2021.9383673 ◽

2021 ◽

Author(s):

Venkatesh Kumar Marimuthu ◽

C. Lakshmi

Keyword(s):

Data Mining ◽

Performance Analysis ◽

Privacy Preserving ◽

Distributed Data Mining ◽

Distributed Data ◽

Cryptographic Techniques

Download Full-text

Hfinger: Malware HTTP Request Fingerprinting

Entropy ◽

10.3390/e23050507 ◽

2021 ◽

Vol 23 (5) ◽

pp. 507

Author(s):

Piotr Białczak ◽

Wojciech Mazurczyk

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Data Sets ◽

Real World Data ◽

Malicious Software ◽

Default Mode ◽

World Data ◽

Effectiveness Analysis ◽

Http Protocol

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text