Privacy Preserving Data Mining on Unstructured Data

Author(s):  
Trupti Vishwambhar Kenekar ◽  
Ajay R. Dani

As Big Data is group of structured, unstructured and semi-structure data collected from various sources, it is important to mine and provide privacy to individual data. Differential Privacy is one the best measure which provides strong privacy guarantee. The chapter proposed differentially private frequent item set mining using map reduce requires less time for privately mining large dataset. The chapter discussed problem of preserving data privacy, different challenges to preserving data privacy in big data environment, Data privacy techniques and their applications to unstructured data. The analyses of experimental results on structured and unstructured data set are also presented.

An advanced Incremental processing technique is planned for data examination in knowledge to have the clustering results inform. Data is continuously arriving by different data generating factors like social network, online shopping, sensors, e-commerce etc. [1]. On account of this Big Data the consequences of data mining applications getting stale and neglected after some time. Cloud knowledge applications regularly perform iterative calculations (e.g., PageRank) on continuously converting datasets. Though going before trainings grow Map-Reduce aimed at productive iterative calculations, it's miles also pricey to carry out a whole new big-ruler Map-Reduce iterative task near well-timed quarter new adjustments to fundamental records sets. Our usage of MapReduce keeps running [4] scheduled a big cluster of product technologies and is incredibly walkable: an ordinary Map-Reduce computation procedure several terabytes of records arranged heaps of technologies. Processor operator locates the machine clean to apply: masses of MapReduce applications, we look at that during many instances, The differences result separate a totally little part of the data set, and the recently iteratively merged nation is very near the recently met state. I2MapReduce clustering adventures this commentary to keep re-calculated by way of beginning after the before affected national [2], and by using acting incremental up-dates on the converging information. The approach facilitates in enhancing the process successively period and decreases the jogging period of stimulating the consequences of big data.


Author(s):  
Shalin Eliabeth S. ◽  
Sarju S.

Big data privacy preservation is one of the most disturbed issues in current industry. Sometimes the data privacy problems never identified when input data is published on cloud environment. Data privacy preservation in hadoop deals in hiding and publishing input dataset to the distributed environment. In this paper investigate the problem of big data anonymization for privacy preservation from the perspectives of scalability and time factor etc. At present, many cloud applications with big data anonymization faces the same kind of problems. For recovering this kind of problems, here introduced a data anonymization algorithm called Two Phase Top-Down Specialization (TPTDS) algorithm that is implemented in hadoop. For the data anonymization-45,222 records of adults information with 15 attribute values was taken as the input big data. With the help of multidimensional anonymization in map reduce framework, here implemented proposed Two-Phase Top-Down Specialization anonymization algorithm in hadoop and it will increases the efficiency on the big data processing system. By conducting experiment in both one dimensional and multidimensional map reduce framework with Two Phase Top-Down Specialization algorithm on hadoop, the better result shown in multidimensional anonymization on input adult dataset. Data sets is generalized in a top-down manner and the better result was shown in multidimensional map reduce framework by the better IGPL values generated by the algorithm. The anonymization was performed with specialization operation on taxonomy tree. The experiment shows that the solutions improves the IGPL values, anonymity parameter and decreases the execution time of big data privacy preservation by compared to the existing algorithm. This experimental result will leads to great application to the distributed environment.


2020 ◽  
Vol 2020 ◽  
pp. 1-29 ◽  
Author(s):  
Xingxing Xiong ◽  
Shubo Liu ◽  
Dan Li ◽  
Zhaohui Cai ◽  
Xiaoguang Niu

With the advent of the era of big data, privacy issues have been becoming a hot topic in public. Local differential privacy (LDP) is a state-of-the-art privacy preservation technique that allows to perform big data analysis (e.g., statistical estimation, statistical learning, and data mining) while guaranteeing each individual participant’s privacy. In this paper, we present a comprehensive survey of LDP. We first give an overview on the fundamental knowledge of LDP and its frameworks. We then introduce the mainstream privatization mechanisms and methods in detail from the perspective of frequency oracle and give insights into recent studied on private basic statistical estimation (e.g., frequency estimation and mean estimation) and complex statistical estimation (e.g., multivariate distribution estimation and private estimation over complex data) under LDP. Furthermore, we present current research circumstances on LDP including the private statistical learning/inferencing, private statistical data analysis, privacy amplification techniques for LDP, and some application fields under LDP. Finally, we identify future research directions and open challenges for LDP. This survey can serve as a good reference source for the research of LDP to deal with various privacy-related scenarios to be encountered in practice.


2014 ◽  
Vol 23 (01) ◽  
pp. 21-26 ◽  
Author(s):  
T. Miron-Shatz ◽  
A. Y. S. Lau ◽  
C. Paton ◽  
M. M. Hansen

Summary Objectives: As technology continues to evolve and rise in various industries, such as healthcare, science, education, and gaming, a sophisticated concept known as Big Data is surfacing. The concept of analytics aims to understand data. We set out to portray and discuss perspectives of the evolving use of Big Data in science and healthcare and, to examine some of the opportunities and challenges. Methods: A literature review was conducted to highlight the implications associated with the use of Big Data in scientific research and healthcare innovations, both on a large and small scale. Results: Scientists and health-care providers may learn from one another when it comes to understanding the value of Big Data and analytics. Small data, derived by patients and consumers, also requires analytics to become actionable. Connectivism provides a framework for the use of Big Data and analytics in the areas of science and healthcare. This theory assists individuals to recognize and synthesize how human connections are driving the increase in data. Despite the volume and velocity of Big Data, it is truly about technology connecting humans and assisting them to construct knowledge in new ways. Concluding Thoughts: The concept of Big Data and associated analytics are to be taken seriously when approaching the use of vast volumes of both structured and unstructured data in science and health-care. Future exploration of issues surrounding data privacy, confidentiality, and education are needed. A greater focus on data from social media, the quantified self-movement, and the application of analytics to “small data” would also be useful.


Author(s):  
Khyati R Nirmal ◽  
K.V.V. Satyanarayana

<p><span>In recent times Big Data Analysis are imminent as essential area in the field of Computer Science. Taking out of significant information from Big Data by separating the data in to distinct group is crucial task and it is beyond the scope of commonly used personal machine. It is necessary to adopt the distributed environment similar to map reduce paradigm and migrate the data mining algorithm using it. In Data Mining the partition based K Means Clustering is one of the broadly used algorithms for grouping data according to the degree of similarities between data. It requires the number of K and initial centroid of cluster as input. By surveying the parameters preferred by algorithm or opted by user influence the functionality of Algorithm. It is the necessity to migrate the K means Clustering on MapReduce and predicts the value of k using machine learning approach. For selecting the initial cluster the efficient method is to be devised and united with it. This paper is comprised the survey of several methods for predicting the value of K in K means Clustering and also contains the survey of different methodologies to find out initial center of the cluster. Along with initial value of k and initial centroid selection the objective of proposed work is to compact with analysis of categorical data.</span></p>


IEEE Access ◽  
2014 ◽  
Vol 2 ◽  
pp. 1149-1176 ◽  
Author(s):  
Lei Xu ◽  
Chunxiao Jiang ◽  
Jian Wang ◽  
Jian Yuan ◽  
Yong Ren

2019 ◽  
Vol 16 (2) ◽  
pp. 445-452
Author(s):  
Kishore S. Verma ◽  
A. Rajesh ◽  
Adeline J. S. Johnsana

K anonymization is one of the worldwide used approaches to protect the individual records from the privacy leakage attack of Privacy Preserving Data Mining (PPDM) arena. Typically anonymized dataset will impact the effectiveness of data mining results. Anyhow, currently researchers of PPDM progress in driving their efforts in finding out the optimum trade-off between privacy and utility. This work tends in bringing out the optimum classifier from a set of best classifiers of data mining approaches that are capable enough in generating value-added classifying results on utility aware k-anonymized data set. We performed the analytical approach on the data set that are anonymized in sense of accompanying the anonymity utility factors like null values count and transformation pattern loss. The experimentation is done with three widely used classifiers HNB, PART and J48 and these classifiers are analysed with Accuracy, F-measure, and ROC-AUC which are literately proved to be the perfect measures of classification. Our experimental analysis reveals the best classifiers on the utility aware anonymized data sets of Cell oriented Anonymization (CoA), Attribute oriented Anonymization (AoA) and Record oriented Anonymization (RoA).


Sign in / Sign up

Export Citation Format

Share Document