In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.

Download Full-text

A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

ACM Transactions on Privacy and Security ◽

10.1145/3484945 ◽

2022 ◽

Vol 25 (1) ◽

pp. 1-25

Author(s):

Sibghat Ullah Bazai ◽

Julian Jang-Jaccard ◽

Hooman Alavizadeh

Keyword(s):

Critical Analysis ◽

Data Privacy ◽

High Performance ◽

Distributed Processing ◽

Hybrid Approach ◽

Relative Size ◽

Optimal Number ◽

Data Anonymization ◽

Fine Grained ◽

Message Exchange

Multi-dimensional data anonymization approaches (e.g., Mondrian) ensure more fine-grained data privacy by providing a different anonymization strategy applied for each attribute. Many variations of multi-dimensional anonymization have been implemented on different distributed processing platforms (e.g., MapReduce, Spark) to take advantage of their scalability and parallelism supports. According to our critical analysis on overheads, either existing iteration-based or recursion-based approaches do not provide effective mechanisms for creating the optimal number of and relative size of resilient distributed datasets (RDDs), thus heavily suffer from performance overheads. To solve this issue, we propose a novel hybrid approach for effectively implementing a multi-dimensional data anonymization strategy (e.g., Mondrian) that is scalable and provides high-performance. Our hybrid approach provides a mechanism to create far fewer RDDs and smaller size partitions attached to each RDD than existing approaches. This optimal RDD creation and operations approach is critical for many multi-dimensional data anonymization applications that create tremendous execution complexity. The new mechanism in our proposed hybrid approach can dramatically reduce the critical overheads involved in re-computation cost, shuffle operations, message exchange, and cache management.

Download Full-text

Big Data Privacy for End to End Delivery

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1104.0882s819 ◽

2019 ◽

Vol 8 (2S8) ◽

pp. 1563-1566

Keyword(s):

Big Data ◽

Data Privacy ◽

Data Sets ◽

Management Tools ◽

Data Anonymization ◽

Big Data Applications ◽

Area Of Concern ◽

The Individual ◽

Traditional Processing ◽

Processing Techniques

Data privacy is an area of concern to process massive datasets in Big Data applications. Assortment of Big Data-sets is tough to be handled using, with the use of on-hand management tools or traditional processing techniques, the assortment of Big Data sets is difficult to be handled using Big Data has three characteristics i.e. V’s Volume, Varity, and Velocity . Privacy to such Big Data could be a massive snag which might be achieved by Anonymization technique. Datasets like financial data, Health Records and other confidential information of various organizations needs privacy to protect from the intruders and malicious entities. The aim of Big Data Anonymization is to shield the privacy of the individual and make it legal to share the information while not obtaining permission from people. The research paper discusses the basics of Big Data, technology behind it and various challenges

Download Full-text

Big Data Privacy Preservation Using Two Phase Top-Down Specialization Algorithm with Multidimensional Map Reduce Framework on Hadoop

International Journal of Distributed and Cloud Computing ◽

10.21863/ijdcc/2015.3.2.009 ◽

2015 ◽

Vol 3 (2) ◽

Author(s):

Shalin Eliabeth S. ◽

Sarju S.

Keyword(s):

Big Data ◽

Data Privacy ◽

Privacy Preservation ◽

Experimental Result ◽

Map Reduce ◽

Distributed Environment ◽

Top Down ◽

Two Phase ◽

Data Anonymization ◽

Big Data Privacy

Big data privacy preservation is one of the most disturbed issues in current industry. Sometimes the data privacy problems never identified when input data is published on cloud environment. Data privacy preservation in hadoop deals in hiding and publishing input dataset to the distributed environment. In this paper investigate the problem of big data anonymization for privacy preservation from the perspectives of scalability and time factor etc. At present, many cloud applications with big data anonymization faces the same kind of problems. For recovering this kind of problems, here introduced a data anonymization algorithm called Two Phase Top-Down Specialization (TPTDS) algorithm that is implemented in hadoop. For the data anonymization-45,222 records of adults information with 15 attribute values was taken as the input big data. With the help of multidimensional anonymization in map reduce framework, here implemented proposed Two-Phase Top-Down Specialization anonymization algorithm in hadoop and it will increases the efficiency on the big data processing system. By conducting experiment in both one dimensional and multidimensional map reduce framework with Two Phase Top-Down Specialization algorithm on hadoop, the better result shown in multidimensional anonymization on input adult dataset. Data sets is generalized in a top-down manner and the better result was shown in multidimensional map reduce framework by the better IGPL values generated by the algorithm. The anonymization was performed with specialization operation on taxonomy tree. The experiment shows that the solutions improves the IGPL values, anonymity parameter and decreases the execution time of big data privacy preservation by compared to the existing algorithm. This experimental result will leads to great application to the distributed environment.

Download Full-text

Computational storage: an efficient and scalable platform for big data and HPC applications

Journal Of Big Data ◽

10.1186/s40537-019-0265-5 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Mahdi Torabzadehkashi ◽

Siavash Rezaei ◽

Ali HeydariGorji ◽

Hosein Bobarshad ◽

Vladimir Alves ◽

...

Keyword(s):

Big Data ◽

High Performance ◽

Distributed Processing ◽

Data Access ◽

Distributed Applications ◽

Process Data ◽

Storage Devices ◽

Hadoop Mapreduce ◽

Big Data Applications ◽

Application Processor

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

Download Full-text

A high-performance computing method for data allocation in distributed database systems

The Journal of Supercomputing ◽

10.1007/s11227-006-0001-8 ◽

2007 ◽

Vol 39 (1) ◽

pp. 3-18 ◽

Cited By ~ 20

Author(s):

Ismail Omar Hababeh ◽

Muthu Ramachandran ◽

Nicholas Bowring

Keyword(s):

High Performance Computing ◽

High Performance ◽

Database Systems ◽

Distributed Database ◽

Data Allocation ◽

Distributed Database Systems ◽

Computing Method ◽

Performance Computing

Download Full-text

SparkDA: RDD-Based High-Performance Data Anonymization Technique for Spark Platform

Network and System Security - Lecture Notes in Computer Science ◽

10.1007/978-3-030-36938-5_40 ◽

2019 ◽

pp. 646-662

Author(s):

Sibghat Ullah Bazai ◽

Julian Jang-Jaccard

Keyword(s):

High Performance ◽

Performance Data ◽

Data Anonymization

Download Full-text

Federated Learning in Big Data Application and Sharing

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200721 ◽

2020 ◽

Author(s):

Jing Yang ◽

Quan Zhang ◽

Kunpeng Liu ◽

Peng Jin ◽

Guoyi Zhao

Keyword(s):

Big Data ◽

Data Privacy ◽

The State ◽

Ideal Model ◽

Local Data ◽

Data Safety ◽

Big Data Applications ◽

Data Application ◽

Big Data Application ◽

Uniform Model

In recent years, electricity big data has extensive applications in the grid companies across the provinces. However, certain problems are encountered including, the inability to generate an ideal model using the isolated data possessed by each company, and the priority concerns for data privacy and safety during big data application and sharing. In this pursuit, the present research envisaged the application of federated learning to protect the local data, and to build a uniform model for different companies affiliated to the State Grid. Federated learning can serve as an essential means for realizing the grid-wide promotion of the achievements of big data applications, while ensuring the data safety.

Download Full-text

Ship space to database: emerging infrastructures for studies of the deep subseafloor biosphere

PeerJ Computer Science ◽

10.7717/peerj-cs.97 ◽

2016 ◽

Vol 2 ◽

pp. e97 ◽

Cited By ~ 4

Author(s):

Peter T. Darch ◽

Christine L. Borgman

Keyword(s):

Production Management ◽

Public Funding ◽

Epistemic Status ◽

Data Allocation ◽

Ocean Drilling Program ◽

Data Scarcity ◽

Integrated Ocean Drilling Program ◽

Use Of Data ◽

International Ocean Discovery Program ◽

Deep Subseafloor

BackgroundAn increasing array of scientific fields face a “data deluge.” However, in many fields data are scarce, with implications for their epistemic status and ability to command funding. Consequently, they often attempt to develop infrastructure for data production, management, curation, and circulation. A component of a knowledge infrastructure may serve one or more scientific domains. Further, a single domain may rely upon multiple infrastructures simultaneously. Studying how domains negotiate building and accessing scarce infrastructural resources that they share with other domains will shed light on how knowledge infrastructures shape science.MethodsWe conducted an eighteen-month, qualitative study of scientists studying the deep subseafloor biosphere, focusing on the Center for Dark Energy Biosphere Investigations (C-DEBI) and the Integrated Ocean Drilling Program (IODP) and its successor, the International Ocean Discovery Program (IODP2). Our methods comprised ethnographic observation, including eight months embedded in a laboratory, interviews (n = 49), and document analysis.ResultsDeep subseafloor biosphere research is an emergent domain. We identified two reasons for the domain’s concern with data scarcity: limited ability to pursue their research objectives, and the epistemic status of their research. Domain researchers adopted complementary strategies to acquire more data. One was to establish C-DEBI as an infrastructure solely for their domain. The second was to use C-DEBI as a means to gain greater access to, and reconfigure, IODP/IODP2 to their advantage. IODP/IODP2 functions as infrastructure for multiple scientific domains, which creates competition for resources. C-DEBI is building its own data management infrastructure, both to acquire more data from IODP and to make better use of data, once acquired.DiscussionTwo themes emerge. One is data scarcity, which can be understood only in relation to a domain’s objectives. To justify support for public funding, domains must demonstrate their utility to questions of societal concern or existential questions about humanity. The deep subseafloor biosphere domain aspires to address these questions in a more statistically intensive manner than is afforded by the data to which it currently has access. The second theme is the politics of knowledge infrastructures. A single scientific domain may build infrastructure for itself and negotiate access to multi-domain infrastructure simultaneously. C-DEBI infrastructure was designed both as a response to scarce IODP/IODP2 resources, and to configure the data allocation processes of IODP/IODP2 in their favor.

Download Full-text

Task-based programming in COMPSs to converge from HPC to big data

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017701278 ◽

2017 ◽

Vol 32 (1) ◽

pp. 45-60 ◽

Cited By ~ 11

Author(s):

Javier Conejero ◽

Sandra Corella ◽

Rosa M Badia ◽

Jesus Labarta

Keyword(s):

Big Data ◽

High Performance ◽

Programming Model ◽

Good Alternative ◽

Programming Models ◽

Suitable Model ◽

Advantages And Disadvantages ◽

Big Data Applications ◽

And Performance ◽

The Right

Task-based programming has proven to be a suitable model for high-performance computing (HPC) applications. Different implementations have been good demonstrators of this fact and have promoted the acceptance of task-based programming in the OpenMP standard. Furthermore, in recent years, Apache Spark has gained wide popularity in business and research environments as a programming model for addressing emerging big data problems. COMP Superscalar (COMPSs) is a task-based environment that tackles distributed computing (including Clouds) and is a good alternative for a task-based programming model for big data applications. This article describes why we consider that task-based programming models are a good approach for big data applications. The article includes a comparison of Spark and COMPSs in terms of architecture, programming model, and performance. It focuses on the differences that both frameworks have in structural terms, on their programmability interface, and in terms of their efficiency by means of three widely known benchmarking kernels: Wordcount, Kmeans, and Terasort. These kernels enable the evaluation of the more important functionalities of both programming models and analyze different work flows and conditions. The main results achieved from this comparison are (1) COMPSs is able to extract the inherent parallelism from the user code with minimal coding effort as opposed to Spark, which requires the existing algorithms to be adapted and rewritten by explicitly using their predefined functions, (2) it is an improvement in terms of performance when compared with Spark, and (3) COMPSs has shown to scale better than Spark in most cases. Finally, we discuss the advantages and disadvantages of both frameworks, highlighting the differences that make them unique, thereby helping to choose the right framework for each particular objective.

Download Full-text

Architecture for Big Data Storage in Different Cloud Deployment Models

Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing ◽

10.4018/978-1-7998-5339-8.ch009 ◽

2021 ◽

pp. 178-208

Author(s):

Chandu Thota ◽

Gunasekaran Manogaran ◽

Daphne Lopez ◽

Revathi Sundarasekar

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Storage ◽

High Performance ◽

Data Services ◽

Big Data Applications ◽

Nosql Database ◽

Amazon Web Services ◽

Product Domains ◽

Scalable Database

Cloud Computing is a new computing model that distributes the computation on a resource pool. The need for a scalable database capable of expanding to accommodate growth has increased with the growing data in web world. More familiar Cloud Computing vendors such as Amazon Web Services, Microsoft, Google, IBM and Rackspace offer cloud based Hadoop and NoSQL database platforms to process Big Data applications. Variety of services are available that run on top of cloud platforms freeing users from the need to deploy their own systems. Nowadays, integrating Big Data and various cloud deployment models is major concern for Internet companies especially software and data services vendors that are just getting started themselves. This chapter proposes an efficient architecture for integration with comprehensive capabilities including real time and bulk data movement, bi-directional replication, metadata management, high performance transformation, data services and data quality for customer and product domains.

Download Full-text