A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

Sibghat Ullah Bazai; Julian Jang-Jaccard; Hooman Alavizadeh

doi:10.1145/3484945

A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

ACM Transactions on Privacy and Security ◽

10.1145/3484945 ◽

2022 ◽

Vol 25 (1) ◽

pp. 1-25

Author(s):

Sibghat Ullah Bazai ◽

Julian Jang-Jaccard ◽

Hooman Alavizadeh

Keyword(s):

Critical Analysis ◽

Data Privacy ◽

High Performance ◽

Distributed Processing ◽

Hybrid Approach ◽

Relative Size ◽

Optimal Number ◽

Data Anonymization ◽

Fine Grained ◽

Message Exchange

Multi-dimensional data anonymization approaches (e.g., Mondrian) ensure more fine-grained data privacy by providing a different anonymization strategy applied for each attribute. Many variations of multi-dimensional anonymization have been implemented on different distributed processing platforms (e.g., MapReduce, Spark) to take advantage of their scalability and parallelism supports. According to our critical analysis on overheads, either existing iteration-based or recursion-based approaches do not provide effective mechanisms for creating the optimal number of and relative size of resilient distributed datasets (RDDs), thus heavily suffer from performance overheads. To solve this issue, we propose a novel hybrid approach for effectively implementing a multi-dimensional data anonymization strategy (e.g., Mondrian) that is scalable and provides high-performance. Our hybrid approach provides a mechanism to create far fewer RDDs and smaller size partitions attached to each RDD than existing approaches. This optimal RDD creation and operations approach is critical for many multi-dimensional data anonymization applications that create tremendous execution complexity. The new mechanism in our proposed hybrid approach can dramatically reduce the critical overheads involved in re-computation cost, shuffle operations, message exchange, and cache management.

In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Electronics ◽

10.3390/electronics9101732 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1732

Author(s):

Sibghat Ullah Bazai ◽

Julian Jang-Jaccard

Keyword(s):

Data Privacy ◽

High Performance ◽

Data Allocation ◽

Data Anonymization ◽

Inappropriate Use ◽

High Data ◽

Use Of Data ◽

New Novel ◽

Resilient Distributed Dataset ◽

Big Data Applications

Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.

Big Data Privacy Preservation Using Two Phase Top-Down Specialization Algorithm with Multidimensional Map Reduce Framework on Hadoop

International Journal of Distributed and Cloud Computing ◽

10.21863/ijdcc/2015.3.2.009 ◽

2015 ◽

Vol 3 (2) ◽

Author(s):

Shalin Eliabeth S. ◽

Sarju S.

Keyword(s):

Big Data ◽

Data Privacy ◽

Privacy Preservation ◽

Experimental Result ◽

Map Reduce ◽

Distributed Environment ◽

Top Down ◽

Two Phase ◽

Data Anonymization ◽

Big Data Privacy

Big data privacy preservation is one of the most disturbed issues in current industry. Sometimes the data privacy problems never identified when input data is published on cloud environment. Data privacy preservation in hadoop deals in hiding and publishing input dataset to the distributed environment. In this paper investigate the problem of big data anonymization for privacy preservation from the perspectives of scalability and time factor etc. At present, many cloud applications with big data anonymization faces the same kind of problems. For recovering this kind of problems, here introduced a data anonymization algorithm called Two Phase Top-Down Specialization (TPTDS) algorithm that is implemented in hadoop. For the data anonymization-45,222 records of adults information with 15 attribute values was taken as the input big data. With the help of multidimensional anonymization in map reduce framework, here implemented proposed Two-Phase Top-Down Specialization anonymization algorithm in hadoop and it will increases the efficiency on the big data processing system. By conducting experiment in both one dimensional and multidimensional map reduce framework with Two Phase Top-Down Specialization algorithm on hadoop, the better result shown in multidimensional anonymization on input adult dataset. Data sets is generalized in a top-down manner and the better result was shown in multidimensional map reduce framework by the better IGPL values generated by the algorithm. The anonymization was performed with specialization operation on taxonomy tree. The experiment shows that the solutions improves the IGPL values, anonymity parameter and decreases the execution time of big data privacy preservation by compared to the existing algorithm. This experimental result will leads to great application to the distributed environment.

Computational storage: an efficient and scalable platform for big data and HPC applications

Journal Of Big Data ◽

10.1186/s40537-019-0265-5 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Mahdi Torabzadehkashi ◽

Siavash Rezaei ◽

Ali HeydariGorji ◽

Hosein Bobarshad ◽

Vladimir Alves ◽

...

Keyword(s):

Big Data ◽

High Performance ◽

Distributed Processing ◽

Data Access ◽

Distributed Applications ◽

Process Data ◽

Storage Devices ◽

Hadoop Mapreduce ◽

Big Data Applications ◽

Application Processor

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

STIGMERGIC LANDMARK OPTIMIZATION

Advances in Complex Systems ◽

10.1142/s0219525911500251 ◽

2012 ◽

Vol 15 (08) ◽

pp. 1150025 ◽

Cited By ~ 2

Author(s):

N. LEMMENS ◽

K. TUYLS

Keyword(s):

High Performance ◽

Nest Site ◽

Hybrid Approach ◽

Nest Site Selection ◽

The Third ◽

Selection Behavior ◽

Bee Foraging ◽

Global Route ◽

Foraging Task ◽

Vector Strength

In this paper we present three Swarm Intelligence algorithms which we evaluate on the complex foraging task domain. Each of the algorithms draws inspiration from biologic bee foraging/nest-site selection behavior. The main focus will be on the third algorithm, namely STIGMERGIC LANDMARK FORAGING which is a novel hybrid approach. It combines the high performance of bee-inspired navigation with ant-inspired recruitment. More precisely, navigation is based on Path Integration which results in vectors indicating the distance and direction to a destination. Recruitment only occurs at key locations (i.e., landmarks) inside of the environment. Each landmark contains a collection of vectors with which visiting agents can find their way to a certain goal or to another landmark in an unknown environment. Each vector represents a local segment of a global route. In contrast to ant-inspired recruitment, no attracting or repelling pheromone is used to indicate where to go and how worthwhile a route is in comparison to other routes. Instead, each vector in a landmark has a certain strength indicating how worthwhile it is. In analogy to ant-inspired recruitment, vector strength can be reinforced by visiting agents. Moreover, vector strength decays over time. In the end, this results in optimal routes to destinations. STIGMERGIC LANDMARK FORAGING proves to be very efficient in terms of building and adapting solutions.

SparkDA: RDD-Based High-Performance Data Anonymization Technique for Spark Platform

Network and System Security - Lecture Notes in Computer Science ◽

10.1007/978-3-030-36938-5_40 ◽

2019 ◽

pp. 646-662

Author(s):

Sibghat Ullah Bazai ◽

Julian Jang-Jaccard

Keyword(s):

High Performance ◽

Performance Data ◽

Data Anonymization

Enhanced Central System of the Traversing Rod for High-Performance Rotor Spinning Machines

Autex Research Journal ◽

10.1515/aut-2015-0038 ◽

2017 ◽

Vol 17 (1) ◽

pp. 27-34 ◽

Cited By ~ 1

Author(s):

Jan Valtera ◽

Petr Žabka ◽

Jaroslav Beran

Keyword(s):

Mathematical Model ◽

High Performance ◽

Mechanical Energy ◽

Optimal Number ◽

Dynamic Force ◽

Rectilinear Motion ◽

Central System ◽

Rotor Spinning ◽

New System

Abstract The paper deals with the improvement of central traversing system on rotor spinning machines, where rectilinear motion with variable stroke is used. A new system of traversing rod with implemented set of magnetic-mechanical energy accumulators is described. Mathematical model of this system is analysed in the MSC. Software Adams/View and verified by an experimental measurement on a real-length testing rig. Analysis results prove the enhancement of devised traversing system, where the overall dynamic force is reduced considerably. At the same time, the precision of the traversing movement over the machine length is increased. This enables to increase machine operating speed while satisfying both the maximal tensile strength of the traversing rod and also output bobbin size standards. The usage of the developed mathematical model for determination of the optimal number and distribution of accumulators over the traversing rod of optional parameters is proved. The potential of the devised system for high-performance rotor spinning machines with longer traversing rod is also discussed.

Multi-city modeling of epidemics using spatial networks: Application to 2019-nCov (COVID-19) coronavirus in India

10.1101/2020.03.13.20035386 ◽

2020 ◽

Cited By ~ 13

Author(s):

Bhalchandra S. Pujari ◽

Snehal Shekatkar

Keyword(s):

Hybrid Approach ◽

Coarse Grained ◽

Spatial Networks ◽

Computationally Efficient ◽

Transport Networks ◽

Epidemiological Modeling ◽

Fine Grained ◽

City Modeling ◽

Coupling Parameters

The ongoing pandemic of 2019-nCov (COVID-19) coronavirus has made reliable epidemiological modeling an urgent necessity. Unfortunately, most of the existing models are either too fine-grained to be efficient or too coarse-grained to be reliable. Here we propose a computationally efficient hybrid approach that uses SIR model for individual cities which are in turn coupled via empirical transportation networks that facilitate migration among them. The treatment presented here differs from existing models in two crucial ways: first, self-consistent determination of coupling parameters so as to maintain the populations of individual cities, and second, the incorporation of distance dependent temporal delays in migration. We apply our model to Indian aviation as well as railway networks taking into account populations of more than 300 cities. Our results project that through the domestic transportation, the significant population is poised to be exposed within 90 days of the onset of epidemic. Thus, serious supervision of domestic transport networks is warranted even after restricting international migration.

Accurate Filtering of Privacy-Sensitive Information in Raw Genomic Data

10.1101/292185 ◽

2018 ◽

Author(s):

Jérémie Decouchant ◽

Maria Fernandes ◽

Marcus Völp ◽

Francisco M Couto ◽

Paulo Esteves-Veríssimo

Keyword(s):

High Performance ◽

Genomic Data ◽

Sensitive Information ◽

Sensitive Data ◽

Variable Regions ◽

Fine Grained ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads

AbstractSequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.

An integrated hybrid approach to the design of high-performance intelligent controllers

Proceedings IEEE Conference on Industrial Automation and Control Emerging Technology Applications ◽

10.1109/iacet.1995.527600 ◽

2002 ◽

Cited By ~ 5

Author(s):

M. Chiaberge ◽

G. Di Bene ◽

S. Di Pascoli ◽

B. Lazzerini ◽

A. Maggiore ◽

...

Keyword(s):

High Performance ◽

Hybrid Approach ◽

Intelligent Controllers

Effect of internal curing by using superabsorbent polymers (SAP) on autogenous shrinkage and other properties of a high-performance fine-grained concrete: results of a RILEM round-robin test

Materials and Structures ◽

10.1617/s11527-013-0078-5 ◽

2013 ◽

Vol 47 (3) ◽

pp. 541-562 ◽

Cited By ~ 99

Author(s):

Viktor Mechtcherine ◽

Michaela Gorges ◽

Christof Schroefl ◽

Alexander Assmann ◽

Wolfgang Brameshuber ◽

...

Keyword(s):

High Performance ◽

Autogenous Shrinkage ◽

Round Robin ◽

Internal Curing ◽

Superabsorbent Polymers ◽

Round Robin Test ◽

Fine Grained