Clustering Algorithm for Privacy Preservation on MapReduce

Author(s):  
Zheng Zhao ◽  
Tao Shang ◽  
Jianwei Liu ◽  
Zhengyu Guan
Author(s):  
Shobana G ◽  
S. Shankar

Background: The increasing need for various data publishers to release or share the healthcare datasets has imparted a threat for the privacy and confidentiality of the Electronic Medical Records. However, the main goal is to share useful information thereby maximizing utility as well as ensuring that sensitive information is not disclosed. There always exist utility-privacy tradeoff which needs to be handled properly for the researchers to learn statistical properties of the datasets. Objective: The objective of the research article is to introduce a novel SK-Clustering algorithm that overcomes identity disclosure, attribute disclosure and similarity attacks. The algorithm is evaluated using metrics such as discernability measure and relative error so as to show its performance compared with other clustering algorithms. Methodology: The SK-Clustering algorithm flexibly adjusts the level of protection for high utility. Also the size of the clusters is minimized dynamically based on the requirements of the protection required and we add extra tuples accordingly. This will drastically reduce information loss thereby increasing utilization. Result and Conclusion: For a k-value of 50 the discernabilty measure of SK algorithm is 65000 whereas the Mondrian algorithm exhibits 70000 discernability measure and the Anatomy algorithm has a discernability measure of 150000. Similarly, the relative error of our algorithm is less than 10% for a tuple count of 35000 when compared to other k-anonymity algorithms. The proposed algorithm executes more competently in terms of minimal discernability measure as well as relative error, thereby proving higher data utility compared with traditionally available algorithms.


Author(s):  
D. Shiny Irene ◽  
V. Surya ◽  
D. Kavitha ◽  
R. Shankar ◽  
S. John Justin Thangaraj

The objective of the research work is to analyze and validate health records and securing the personal information of patients is a challenging issue in health records mining. The risk prediction task was formulated with the label Cause of Death (COD) as a multi-class classification issue, which views health-related death as the “biggest risk.” This unlabeled data particularly describes the health conditions of the participants during the health examinations. It can differ tremendously between healthy and highly ill. Besides, the problems of distributed secure data management over privacy-preserving are considered. The proposed health record mining is in the following stages. In the initial stage, effective features such as fisher score, Pearson correlation, and information gain is calculated from the health records of the patient. Then, the average values are calculated for the extracted features. In the second stage, feature selection is performed from the average features by applying the Euclidean distance measure. The chosen features are clustered in the third stage using distance adaptive fuzzy c-means clustering algorithm (DAFCM). In the fourth stage, an entropy-based graph is constructed for the classification of data and it categorizes the patient’s record. At the last stage, for security, privacy preservation is applied to the personal information of the patient. This performance is matched against the existing methods and it gives better performance than the existing ones.


2021 ◽  
Vol 17 (10) ◽  
pp. 155014772110425
Author(s):  
Zekun Zhang ◽  
Tongtong Wu ◽  
Xiaoting Sun ◽  
Jiguo Yu

The tremendous growth of Internet of Medical Things has led to a surge in medical user data, and medical data publishing can provide users with numerous services. However, neglectfully publishing the data may lead to severe leakage of user’s privacy. In this article, we investigate the problem of data publishing in Internet of Medical Things with privacy preservation. We present a novel system model for Internet of Medical Things user data publishing which adopts the proposed multiple partition differential privacy k-medoids clustering algorithm for data clustering analysis to ensure the security of user data. Particularly, we propose a multiple partition differential privacy k-medoids clustering algorithm based on differential privacy in data publishing. Based on the traditional k-medoids clustering, multiple partition differential privacy k-medoids clustering algorithm optimizes the randomness of selecting initial center points and adds Laplace noise to the clustering process to improve data availability while protecting user’s privacy information. Comprehensive analysis and simulations demonstrate that our method can not only meet the requirements of differential privacy but also retain the better availability of data clustering.


2020 ◽  
Vol 12 (4) ◽  
pp. 71
Author(s):  
Nancy Awad ◽  
Jean-Francois Couchot ◽  
Bechara Al Bouna ◽  
Laurent Philippe

Data publishing is a challenging task for privacy preservation constraints. To ensure privacy, many anonymization techniques have been proposed. They differ in terms of the mathematical properties they verify and in terms of the functional objectives expected. Disassociation is one of the techniques that aim at anonymizing of set-valued datasets (e.g., discrete locations, search and shopping items) while guaranteeing the confidentiality property known as k m -anonymity. Disassociation separates the items of an itemset in vertical chunks to create ambiguity in the original associations. In a previous work, we defined a new ant-based clustering algorithm for the disassociation technique to preserve some items associated together, called utility rules, throughout the anonymization process, for accurate analysis. In this paper, we examine the disassociated dataset in terms of knowledge extraction. To make data analysis easy on top of the anonymized dataset, we define neighbor datasets or in other terms datasets that are the result of a probabilistic re-association process. To assess the neighborhood notion set-valued datasets are formalized into trees and a tree edit distance (TED) is directly applied between these neighbors. Finally, we prove the faithfulness of the neighbors to knowledge extraction for future analysis, in the experiments.


2016 ◽  
Vol 45 (4) ◽  
pp. 1179-1191 ◽  
Author(s):  
Qingying Yu ◽  
Yonglong Luo ◽  
Chuanming Chen ◽  
Xintao Ding

2020 ◽  
Vol 39 (6) ◽  
pp. 8139-8147
Author(s):  
Ranganathan Arun ◽  
Rangaswamy Balamurugan

In Wireless Sensor Networks (WSN) the energy of Sensor nodes is not certainly sufficient. In order to optimize the endurance of WSN, it is essential to minimize the utilization of energy. Head of group or Cluster Head (CH) is an eminent method to develop the endurance of WSN that aggregates the WSN with higher energy. CH for intra-cluster and inter-cluster communication becomes dependent. For complete, in WSN, the Energy level of CH extends its life of cluster. While evolving cluster algorithms, the complicated job is to identify the energy utilization amount of heterogeneous WSNs. Based on Chaotic Firefly Algorithm CH (CFACH) selection, the formulated work is named “Novel Distributed Entropy Energy-Efficient Clustering Algorithm”, in short, DEEEC for HWSNs. The formulated DEEEC Algorithm, which is a CH, has two main stages. In the first stage, the identification of temporary CHs along with its entropy value is found using the correlative measure of residual and original energy. Along with this, in the clustering algorithm, the rotating epoch and its entropy value must be predicted automatically by its sensor nodes. In the second stage, if any member in the cluster having larger residual energy, shall modify the temporary CHs in the direction of the deciding set. The target of the nodes with large energy has the probability to be CHs which is determined by the above two stages meant for CH selection. The MATLAB is required to simulate the DEEEC Algorithm. The simulated results of the formulated DEEEC Algorithm produce good results with respect to the energy and increased lifetime when it is correlated with the current traditional clustering protocols being used in the Heterogeneous WSNs.


Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


2020 ◽  
Vol 4 (2) ◽  
pp. 780-787
Author(s):  
Ibrahim Hassan Hayatu ◽  
Abdullahi Mohammed ◽  
Barroon Ahmad Isma’eel ◽  
Sahabi Yusuf Ali

Soil fertility determines a plant's development process that guarantees food sufficiency and the security of lives and properties through bumper harvests. The fertility of soil varies according to regions, thereby determining the type of crops to be planted. However, there is no repository or any source of information about the fertility of the soil in any region in Nigeria especially the Northwest of the country. The only available information is soil samples with their attributes which gives little or no information to the average farmer. This has affected crop yield in all the regions, more particularly the Northwest region, thus resulting in lower food production.  Therefore, this study is aimed at classifying soil data based on their fertility in the Northwest region of Nigeria using R programming. Data were obtained from the department of soil science from Ahmadu Bello University, Zaria. The data contain 400 soil samples containing 13 attributes. The relationship between soil attributes was observed based on the data. K-means clustering algorithm was employed in analyzing soil fertility clusters. Four clusters were identified with cluster 1 having the highest fertility, followed by 2 and the fertility decreases with an increasing number of clusters. The identification of the most fertile clusters will guide farmers on where best to concentrate on when planting their crops in order to improve productivity and crop yield.


Sign in / Sign up

Export Citation Format

Share Document