PRIVACY PRESERVING CLUSTERING BASED ON LINEAR APPROXIMATION OF FUNCTION

Clustering analysis initiativesÂ a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of theÂ requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected byÂ user.Â In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields goodÂ results in practice with an example ofÂ business data are provided.Â It alsoÂ explains privacy preserving clusters of sensitive data objects.

Download Full-text

Practical Implementation of Privacy Preserving Clustering Methods Using a Partially Homomorphic Encryption Algorithm

Electronics ◽

10.3390/electronics9020229 ◽

2020 ◽

Vol 9 (2) ◽

pp. 229

Author(s):

Ferhat Ozgur Catak ◽

Ismail Aydin ◽

Ogerta Elezaj ◽

Sule Yildirim-Yayilgan

Keyword(s):

Data Privacy ◽

Homomorphic Encryption ◽

Clustering Algorithms ◽

Privacy Preserving ◽

Key Exchange ◽

Cloud System ◽

Clustering Methods ◽

Sensitive Data ◽

Processing Power ◽

High Processing

The protection and processing of sensitive data in big data systems are common problems as the increase in data size increases the need for high processing power. Protection of the sensitive data on a system that contains multiple connections with different privacy policies, also brings the need to use proper cryptographic key exchange methods for each party, as extra work. Homomorphic encryption methods can perform similar arithmetic operations on encrypted data in the same way as a plain format of the data. Thus, these methods provide data privacy, as data are processed in the encrypted domain, without the need for a plain form and this allows outsourcing of the computations to cloud systems. This also brings simplicity on key exchange sessions for all sides. In this paper, we propose novel privacy preserving clustering methods, alongside homomorphic encryption schemes that can run on a common high performance computation platform, such as a cloud system. As a result, the parties of this system will not need to possess high processing power because the most power demanding tasks would be done on any cloud system provider. Our system offers a privacy preserving distance matrix calculation for several clustering algorithms. Considering both encrypted and plain forms of the same data for different key and data lengths, our privacy preserving training method’s performance results are obtained for four different data clustering algorithms, while considering six different evaluation metrics.

Download Full-text

Cluster Optimization for Boundary Points using Distributive Progressive Feature Selection Algorithm

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i2.2391 ◽

2021 ◽

Vol 12 (2) ◽

pp. 3320-3328

Author(s):

Ch. Raja Ramesh, Et. al.

Keyword(s):

Processing Time ◽

Similarity Measures ◽

Clustering Methods ◽

Problem Statement ◽

Boundary Points ◽

Cluster Optimization ◽

Data Objects ◽

Single Cluster ◽

Cluster Quality ◽

Homogeneous Data

A group of different data objects is classified as similar objects is known as clusters. It is the process of finding homogeneous data items like patterns, documents etc. and then group the homogenous data items togetherothers groupsmay have dissimilar data items. Most of the clustering methods are either crisp or fuzzy and moreover member allocation to the respective clusters is strictly based on similarity measures and membership functions.Both of the methods have limitations in terms of membership. One strictly decides a sample must belong to single cluster and other anyway fuzzy i.e probability. Finally, Quality and Purity like measure are applied to understand how well clusters are created. But there is a grey area in between i.e. ‘Boundary Points’ and ‘Moderately Far’ points from the cluster centre. We considered the cluster quality [18], processing time and relevant features identification as basis for our problem statement and implemented Zone based clustering by using map reducer concept. I have implemented the process to find far points from different clusters and generate a new cluster, repeat the above process until cluster quantity is stabilized. By using this processwe can improve the cluster quality and processing time also.

Download Full-text

M-Denclue for Effective Data Clustering in High Dimensional Non-Linear Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9109.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2925-2927

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

Research Work ◽

Curse Of Dimensionality ◽

Distance Measures ◽

High Dimensional ◽

Clustering Methods ◽

Non Linear ◽

Low Dimensional ◽

Automatic Grouping

Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.

Download Full-text

An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets

Mathematical Problems in Engineering ◽

10.1155/2014/486075 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 7

Author(s):

Kang Zhang ◽

Xingsheng Gu

Keyword(s):

Real World ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Affinity Propagation ◽

Mixed Data ◽

Clustering Methods ◽

Affinity Propagation Clustering ◽

Real World Datasets ◽

Data Objects ◽

Clustering Problems

Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP) algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.

Download Full-text

Complementing Privacy and Utility Trade-Off with Self-Organising Maps

Cryptography ◽

10.3390/cryptography5030020 ◽

2021 ◽

Vol 5 (3) ◽

pp. 20

Author(s):

Kabiru Mohammed ◽

Aladdin Ayesh ◽

Eerke Boiten

Keyword(s):

Data Mining ◽

Large Scale ◽

Clustering Algorithms ◽

Hybrid Approach ◽

Privacy Preserving ◽

Full Potential ◽

Sensitive Data ◽

Large Scale Analysis ◽

Hidden Knowledge ◽

Analyse Data

In recent years, data-enabled technologies have intensified the rate and scale at which organisations collect and analyse data. Data mining techniques are applied to realise the full potential of large-scale data analysis. These techniques are highly efficient in sifting through big data to extract hidden knowledge and assist evidence-based decisions, offering significant benefits to their adopters. However, this capability is constrained by important legal, ethical and reputational concerns. These concerns arise because they can be exploited to allow inferences to be made on sensitive data, thus posing severe threats to individuals’ privacy. Studies have shown Privacy-Preserving Data Mining (PPDM) can adequately address this privacy risk and permit knowledge extraction in mining processes. Several published works in this area have utilised clustering techniques to enforce anonymisation models on private data, which work by grouping the data into clusters using a quality measure and generalising the data in each group separately to achieve an anonymisation threshold. However, existing approaches do not work well with high-dimensional data, since it is difficult to develop good groupings without incurring excessive information loss. Our work aims to complement this balancing act by optimising utility in PPDM processes. To illustrate this, we propose a hybrid approach, that combines self-organising maps with conventional privacy-based clustering algorithms. We demonstrate through experimental evaluation, that results from our approach produce more utility for data mining tasks and outperforms conventional privacy-based clustering algorithms. This approach can significantly enable large-scale analysis of data in a privacy-preserving and trustworthy manner.

Download Full-text

Correlation and Probability Based Similarity Measure for Detecting Outliers in Categorical Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c9053.019320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2577-2582 ◽

Cited By ~ 1

Keyword(s):

Data Mining ◽

Outlier Detection ◽

Similarity Measure ◽

Categorical Data ◽

Similarity Measures ◽

Detection Task ◽

Distance Measures ◽

Mining Machine ◽

Real World Data ◽

Data Objects

Determining the similarity or distance among data objects is an important part in many research fields such as statistics, data mining, machine learning etc. There are many measures available in the literature to define the distance between two numerical data objects. It is difficult to define such a metric to measure the similarity between two categorical data objects since categorical data objects are not ordered. Only a few distance measures are available in the literature to find the similarities among categorical data objects. This paper presents a comparative evaluation of various similarity measures for categorical data and also introduces a novel similarity measure for categorical data based on occurrence frequency and correlation. We evaluated the performance of these similarity measures in the context of outlier detection task in data mining using real world data sets. Experimental results show that the proposed similarity measure outperform the existing similarity measures to detect outliers in categorical datasets. The performances are evaluated in the context of outlier detection task in data mining

Download Full-text

Clustering Analysis of Data with High Dimensionality

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch039 ◽

2011 ◽

pp. 237-245

Author(s):

Athman Bouguettaya ◽

Qi Yu

Keyword(s):

Clustering Analysis ◽

Dimensional Space ◽

Clustering Algorithms ◽

Statistical Distributions ◽

Clustering Methods ◽

Clustering Method ◽

Number Of Clusters ◽

Access Structures ◽

Data Objects ◽

The Individual

Clustering analysis has been widely applied in diverse fields such as data mining, access structures, knowledge discovery, software engineering, organization of information systems, and machine learning. The main objective of cluster analysis is to create groups of objects based on the degree of their association (Kaufman & Rousseeuw, 1990; Romesburg, 1990). There are two major categories of clustering algorithms with respect to the output structure: partitional and hierarchical (Romesburg, 1990). K-means is a representative of the partitional algorithms. The output of this algorithm is a flat structure of clusters. The K-means is a very attractive algorithm because of its simplicity and efficiency, which make it one of the favorite choices to handle large datasets. On the flip side, it has a dependency on the initial choice of number of clusters. This choice may not be optimal, as it should be made in the very beginning, when there may not exist an informal expectation of what the number of natural clusters would be. Hierarchical clustering algorithms produce a hierarchical structure often presented graphically as a dendrogram. There are two main types of hierarchical algorithms: agglomerative and divisive. The agglomerative method uses a bottom-up approach, i.e., starts with the individual objects, each considered to be in its own cluster, and then merges the clusters until the desired number of clusters is achieved. The divisive method uses the opposite approach, i.e., starts with all objects in one cluster and divides them into separate clusters. The clusters form a tree with each higher level showing higher degree of dissimilarity. The height of the merging point in the tree represents the similarity distance at which the objects merge in one cluster. The agglomerative algorithms are usually able to generate high-quality clusters but suffer a high computational complexity compared with divisive algorithms. In this paper, we focus on investigating the behavior of agglomerative hierarchical algorithms. We further divide these algorithms into two major categories: group based and single-object based clustering methods. Typical examples for the former category include Unweighted Pair-Group using Arithmetic averages (UPGMA), Centroid Linkage, and WARDS, etc. Single LINKage (SLINK) clustering and Complete LINKage clustering (CLINK) fall into the second category. We choose UPGMA and SLINK as the representatives of each category and the comparison of these two representative techniques could also reflect some similarity and difference between these two sets of clustering methods. The study examines three key issues for clustering analysis: (1) the computation of the degree of association between different objects; (2) the designation of an acceptable criterion to evaluate how good and/or successful a clustering method is; and (3) the adaptability of the clustering method used under different statistical distributions of data including random, skewed, concentrated around certain regions, etc. Two different statistical distributions are used to express how data objects are drawn from a 50-dimensional space. This also differentiates our work from some previous ones, where a limited number of dimensions for data features (typically up to three) are considered (Bouguettaya, 1996; Bouguettaya & LeViet, 1998). In addition, three types of distances are used to compare the resultant clustering trees: Euclidean, Canberra Metric, and Bray-Curtis distances. The results of an exhaustive set of experiments that involve data derived from 50- dimensional space are presented. These experiments indicate a surprisingly high level of similarity between the two clustering techniques under most combinations of parameter settings.

Download Full-text

Clustering Algorithms in Hybrid Recommender System on MovieLens Data

Studies in Logic, Grammar and Rhetoric ◽

10.2478/slgr-2014-0021 ◽

2014 ◽

Vol 37 (1) ◽

pp. 125-139 ◽

Cited By ~ 17

Author(s):

Urszula Kuzelewska

Keyword(s):

Recommender System ◽

Web Sites ◽

Clustering Algorithms ◽

Leisure Activities ◽

Similarity Measures ◽

Clustering Methods ◽

Log Files ◽

E Learning ◽

On Line ◽

Hybrid Recommender

AbstractDecisions are taken by humans very often during professional as well as leisure activities. It is particularly evident during surfing the Internet: selecting web sites to explore, choosing needed information in search engine results or deciding which product to buy in an on-line store. Recommender systems are electronic applications, the aim of which is to support humans in this decision making process. They are widely used in many applications: adaptive WWW servers, e-learning, music and video preferences, internet stores etc. In on-line solutions, such as e-shops or libraries, the aim of recommendations is to show customers the products which they are probably interested in. As input data the following are taken: shopping basket archives, ratings of the products or servers log files.The article presents a solution of recommender system which helps users to select an interesting product. The system analyses data from other customers' ratings of the products. It uses clustering methods to find similarities among the users and proposed techniques to identify users' profiles. The system was implemented in Apache Mahout environment and tested on a movie database. Selected similarity measures are based on: Euclidean distance, cosine as well as correlation coefficient and loglikehood function.

Download Full-text

Detection of Sensitive Data Leakage for Privacy Preserving

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i9.765769 ◽

2018 ◽

Vol 6 (9) ◽

pp. 765-769

Author(s):

R.J. Patil ◽

Y.S. Borse

Keyword(s):

Privacy Preserving ◽

Sensitive Data

Download Full-text

Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study

npj Digital Medicine ◽

10.1038/s41746-021-00431-6 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Qi Dou ◽

Tiffany Y. So ◽

Meirui Jiang ◽

Quande Liu ◽

Varut Vardhanabhuti ◽

...

Keyword(s):

Data Privacy ◽

Medical Training ◽

External Validation ◽

Mainland China ◽

Privacy Preserving ◽

Generalization Capability ◽

Major Focus ◽

Sensitive Data ◽

Lung Abnormalities ◽

Medical Image Diagnosis

AbstractData privacy mechanisms are essential for rapidly scaling medical training databases to capture the heterogeneity of patient data distributions toward robust and generalizable machine learning systems. In the current COVID-19 pandemic, a major focus of artificial intelligence (AI) is interpreting chest CT, which can be readily used in the assessment and management of the disease. This paper demonstrates the feasibility of a federated learning method for detecting COVID-19 related CT abnormalities with external validation on patients from a multinational study. We recruited 132 patients from seven multinational different centers, with three internal hospitals from Hong Kong for training and testing, and four external, independent datasets from Mainland China and Germany, for validating model generalizability. We also conducted case studies on longitudinal scans for automated estimation of lesion burden for hospitalized COVID-19 patients. We explore the federated learning algorithms to develop a privacy-preserving AI model for COVID-19 medical image diagnosis with good generalization capability on unseen multinational datasets. Federated learning could provide an effective mechanism during pandemics to rapidly develop clinically useful AI across institutions and countries overcoming the burden of central aggregation of large amounts of sensitive data.

Download Full-text