scholarly journals Achieving Optimal K-Anonymity Parameters for Big Data

Author(s):  
Mohammed Essa Al-Zobbi ◽  
Seyed Shahrestani ◽  
Chun Ruan

Datasets containing private and sensitive information are useful for data analytics. Data owners cautiously release such sensitive data using privacy-preserving publishing techniques. Personal re-identification possibility is much larger than ever before. For instance, social media has dramatically increased the exposure to privacy violation. One well-known technique of k-anonymity proposes a protection approach against privacy exposure. K-anonymity tends to find k equivalent number of data records. The chosen attributes are known as Quasi-identifiers. This approach may reduce the personal re-identification. However, this may lessen the usefulness of information gained. The value of k should be carefully determined, to compromise both security and information gained. Unfortunately, there is no any standard procedure to define the value of k. The problem of the optimal k-anonymization is NP-hard. In this paper, we propose a greedy-based heuristic approach that provides an optimal value for k. The approach evaluates the empirical risk concerning our Sensitivity-Based Anonymization method. Our approach is derived from the fine-grained access and business role anonymization for big data, which forms our framework.

2018 ◽  
Author(s):  
Jérémie Decouchant ◽  
Maria Fernandes ◽  
Marcus Völp ◽  
Francisco M Couto ◽  
Paulo Esteves-Veríssimo

AbstractSequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.


2021 ◽  
Author(s):  
Rohit Ravindra Nikam ◽  
Rekha Shahapurkar

Data mining is a technique that explores the necessary data is extracted from large data sets. Privacy protection of data mining is about hiding the sensitive information or identity of breach security or without losing data usability. Sensitive data contains confidential information about individuals, businesses, and governments who must not agree upon before sharing or publishing his privacy data. Conserving data mining privacy has become a critical research area. Various evaluation metrics such as performance in terms of time efficiency, data utility, and degree of complexity or resistance to data mining techniques are used to estimate the privacy preservation of data mining techniques. Social media and smart phones produce tons of data every minute. To decision making, the voluminous data produced from the different sources can be processed and analyzed. But data analytics are vulnerable to breaches of privacy. One of the data analytics frameworks is recommendation systems commonly used by e-commerce sites such as Amazon, Flip Kart to recommend items to customers based on their purchasing habits that lead to characterized. This paper presents various techniques of privacy conservation, such as data anonymization, data randomization, generalization, data permutation, etc. such techniques which existing researchers use. We also analyze the gap between various processes and privacy preservation methods and illustrate how to overcome such issues with new innovative methods. Finally, our research describes the outcome summary of the entire literature.


2020 ◽  
Vol 13 (2) ◽  
pp. 283-295
Author(s):  
Ajmeera Kiran ◽  
Vasumathi Devara

Background: Big data analytics is the process of utilizing a collection of data accompanied on the internet to store and retrieve anywhere and at any time. Big data is not simply a data but it involves the data generated by variety of gadgets or devices or applications. Objective: When massive volume of data is stored, there is a possibility for malevolent attacks on the searching data are stored in the server because of under privileged privacy preserving approaches. These traditional methods result in many drawbacks due to various attacks on sensitive information. Hence, to enhance the privacy preserving for sensitive information stored in the database, the proposed method makes use of efficient methods. Methods: In this manuscript, an optimal privacy preserving over big data using Hadoop and mapreduce framework is proposed. Initially, the input data is grouped by modified fuzzy c means clustering algorithm. Then we are performing a map reduce framework. And then the clustered data is fed to the mapper; in mapper the privacy of input data is done by convolution process. To validate the privacy of input data the recommended technique utilizes the optimal artificial neural network. Here, oppositional fruit fly algorithm is used to enhancing the neural networks. Results: The routine of the suggested system is assessed by means of clustering accuracy, error value, memory, and time. The experimentation is performed by KDD dataset. Conclusion: A result shows that our proposed system has maximum accuracy and attains the effective convolution process to improve privacy preserving.


Author(s):  
Amine Rahmani ◽  
Abdelmalek Amine ◽  
Reda Mohamed Hamou

In the last years, with the emergence of new technologies in the image of big data, the privacy concerns had grown widely. However, big data means the dematerialization of the data. The classical security solutions are no longer efficient in this case. Nowadays, sharing the data is much easier as well as saying hello. The amount of shared data over the web keeps growing from day to another which creates a wide gap between the purpose of sharing data and the fact that these last contain sensitive information. For that, the researches turned their attention to new issues and domains in order to minimize this gap. In other way, they intended to ensure a good utility of data by preserving its meaning while hiding sensitive information to prevent identity disclosure. Many techniques had been used for that. Some of it is mathematical and other ones using data mining algorithms. This paper deals with the problem of hiding sensitive data in shared structured medical data using a new bio-inspired algorithm from the natural phenomena of apoptosis cells in human body.


Author(s):  
Rajit Nair ◽  
Amit Bhagat

Data is being captured in all domains of society and one of the important aspects is transportation. Large amounts of data have been collected, which are detailed, fine-grained, and of greater coverage and help us to allow traffic and transportation to be tracked to an extent that was not possible in the past. Existing big data analytics for transportation is already yielding useful applications in the areas of traffic routing, congestion management, and scheduling. This is just the origin of the applications of big data that will ultimately make the transportation network able to be managed properly and in an efficient way. It has been observed that so many individuals are not following the traffic rules properly, especially where there are high populations, so to monitor theses types of traffic violators, this chapter proposes a work that is mainly based on big data analytics. In this chapter, the authors trace the vehicle and the data that has been collected by different devices and analyze it using some of the big data analysis methods.


As with prior technological advancements, big data technology is growing at present and we have to identify what are the possible threats to overhead the present security systems. Due to the development of recent technical environment like cloud, network connected smartphones and the omnipresent digital conversion of huge volume of all types of data poses more possible threats to sensitive data. Due to the improved vulnerability big data requires increased responsibility. During the last two years, the amount of data that has been created is about 90% of the whole data created. Strengthening the security of sensitive data from unauthorized discovery is the most challenging process in all kind of data processing. Data Leakage Detection offers a set of methods and techniques that can professionally solve the problem arising in particular critical data. The large amounts of existing data is mostly unstructured. To retrieve meaningful information, we have to develop superior analytical method in big data. At present we have more algorithms for security which are not easy to be implement for huge volume of data. We have to protect the sensitive information as well as details related users with the help of security protocols in big data. The sensitive data of the patient, different types of code patterns and set of attributes to be secured by using machine learning tool. Machine learning tools have a lot of library functions to protect the sensitive information about the clients. We recommend the Secure Pattern-Based Data Sensitivity Framework (PBDSF), to protect such sensitive information from big data using Machine Learning. In the proposed framework, HDFS is implemented to analysis the big data, to classify most important information and converting the sensitive data in a secure manner.


2019 ◽  
Vol 15 (3) ◽  
pp. 155014771984023
Author(s):  
Gang Liu ◽  
Guofang Zhang ◽  
Quan Wang ◽  
Shaomin Ji ◽  
Lizhi Zhang

In Android systems, sensitive information associated with system permission is exposed to the application completely once it gains the permission. To solve this problem, this article presents a fine-grained access control framework for sensitive information based on eXtensible Access Control Markup Language data flow model. In this framework, a user can define access policies for each application and resource and the application’s access request to sensitive information is evaluated by these policies. Therefore, all access requests must comply with the security policy irrespective of whether they have gained the permission associated with the information. This helps to protect sensitive data outside the Android permission mechanism. In order to facilitate users to manage policies, the proposed framework implements automatic policy generation and policy conflict detection functions. The framework is implemented in TaintDroid and experiments indicate that the improvement is effective in achieving fine-grained access control to sensitive information and does not adversely affect the system overhead costs.


Big Data refers to large volume of data and necessitates the usage of cloud for storage and processing. Cloud tenants data is not only stored in the cloud, but it is also shared among multiple users. The data stored in cloud must be well protected as it is prone to malicious attacks and hardware failures. Also, user’s data on cloud contain sensitive information that must be protected and highly restricted from unauthorized access. Cloud deployment models such as public cloud, private cloud, and hybrid cloud can be used for storing data of cloud tenants. This paper proposes a secured storage approach for protecting data in cloud by partitioning big dataset into blocks containing user’s sensitive data, insensitive data, and public data. Sensitive data is moved to private cloud and is well protected using proxy re encryption. Insensitive data is stored in public cloud and some data blocks are randomly encrypted. Also, the storage index information of insensitive data blocks on cloud is encrypted and shared among authorized users. Public data is also moved to public cloud and to protect it the storage path information is only encrypted and shared. The proposed approach shows better results with reduced computation overhead and improved security.


We are in the information age there by collecting very huge volume of data from diverse sources in structured, unstructured and semi structured form ranging to petabytes to exabytes of data. Data is an asset as valuable knowledge and information is hidden in such massive volumes of data. Data analytics is required to have a deeper insights and identify fine grained patterns so as to make accurate predictions enabling the improvement of decision making. Extracting knowledge from data is done by data analytics, Machine learning forms the core of it. The increase in the dimensionality of data both in terms of number of tuples and also in terms of number of features poses several challenges to the machine learning algorithms . Preprocessing of data is done as a prior step to machine learning, so feature selection is done as a preprocessing step to have the dimensionality reduction of the data and thereby removing the irrelevant features and improving the efficiency and accuracy of a machine learning algorithm. In this paper we are studying various feature selection mechanisms and analyze them whether they can be adopted to sentiment analysis of big data.


2016 ◽  
Vol 58 (4) ◽  
Author(s):  
Wolfram Wingerath ◽  
Felix Gessert ◽  
Steffen Friedrich ◽  
Norbert Ritter

AbstractWith the rise of the web 2.0 and the Internet of things, it has become feasible to track all kinds of information over time, in particular fine-grained user activities and sensor data on their environment and even their biometrics. However, while efficiency remains mandatory for any application trying to cope with huge amounts of data, only part of the potential of today's Big Data repositories can be exploited using traditional batch-oriented approaches as the value of data often decays quickly and high latency becomes unacceptable in some applications. In the last couple of years, several distributed data processing systems have emerged that deviate from the batch-oriented approach and tackle data items as they arrive, thus acknowledging the growing importance of timeliness and velocity in Big Data analytics.In this article, we give an overview over the state of the art of stream processors for low-latency Big Data analytics and conduct a qualitative comparison of the most popular contenders, namely Storm and its abstraction layer Trident, Samza and Spark Streaming. We describe their respective underlying rationales, the guarantees they provide and discuss the trade-offs that come with selecting one of them for a particular task.


Sign in / Sign up

Export Citation Format

Share Document