The State of the Art and Open Problems in Data Replication in Grid Environments

Data Grids provide services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored at distributed locations around the world. For example, the next-generation of scientific applications such as many in high-energy physics, molecular modeling, and earth sciences will involve large collections of data created from simulations or experiments. The size of these data collections is expected to be of multi-terabyte or even petabyte scale in many applications. Ensuring efficient, reliable, secure and fast access to such large data is hindered by the high latencies of the Internet. The need to manage and access multiple petabytes of data in Grid environments, as well as to ensure data availability and access optimization are challenges that must be addressed. To improve data access efficiency, data can be replicated at multiple locations so that a user can access the data from a site near where it will be processed. In addition to the reduction of data access time, replication in Data Grids also uses network and storage resources more efficiently. In this chapter, the state of current research on data replication and arising challenges for the new generation of data-intensive grid environments are reviewed and open problems are identified. First, fundamental data replication strategies are reviewed which offer high data availability, low bandwidth consumption, increased fault tolerance, and improved scalability of the overall system. Then, specific algorithms for selecting appropriate replicas and maintaining replica consistency are discussed. The impact of data replication on job scheduling performance in Data Grids is also analyzed. A set of appropriate metrics including access latency, bandwidth savings, server load, and storage overhead for use in making critical comparisons of various data replication techniques is also discussed. Overall, this chapter provides a comprehensive study of replication techniques in Data Grids that not only serves as a tool to understanding this evolving research area but also provides a reference to which future e orts may be mapped.

Download Full-text

Efficient Dynamic Replication Algorithm Using Agent for Data Grid

The Scientific World JOURNAL ◽

10.1155/2014/767016 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 7

Author(s):

Priyanka Vashisht ◽

Rajesh Kumar ◽

Anju Sharma

Keyword(s):

Data Replication ◽

Data Access ◽

Data Grid ◽

Data Availability ◽

Access Time ◽

Test Bed ◽

Data Grids ◽

Dynamic Replication ◽

Data Files ◽

Using Data

In data grids scientific and business applications produce huge volume of data which needs to be transferred among the distributed and heterogeneous nodes of data grids. Data replication provides a solution for managing data files efficiently in large grids. The data replication helps in enhancing the data availability which reduces the overall access time of the file. In this paper an algorithm, namely, EDRA using agents for data grid, has been proposed and implemented. EDRA consists of dynamic replication of hierarchical structure taken into account for the selection of best replica. Decision for selecting the best replica is based on scheduling parameters. The scheduling parameters are bandwidth, load gauge, and computing capacity of the node. The scheduling in data grid helps in reducing the data access time. The distribution of the load on the nodes of data grid is done evenly by considering scheduling parameters. EDRA is implemented using data grid simulator, namely, OptorSim. European Data Grid CMS test bed topology is used in this experiment. The simulation results are obtained by comparing BHR, LRU, No Replication, and EDRA. The result shows the efficiency of EDRA algorithm in terms of mean job execution time, network usage, and storage usage of node.

Download Full-text

Improve the Performance of Data Grids by Cost-Based Job Scheduling Strategy

Computer Engineering and Applications Journal ◽

10.18495/comengapp.v3i2.52 ◽

2014 ◽

Vol 3 (2) ◽

pp. 100-111

Author(s):

Najme Mansouri

Keyword(s):

Job Scheduling ◽

The Other ◽

Data Grids ◽

Geographic Dispersion ◽

Data Intensive ◽

A Value ◽

Grid Environments ◽

Tremendous Importance ◽

Application Requirements ◽

Grid Resources

Grid environments have gain tremendous importance in recent years since application requirements increased drastically. The heterogeneity and geographic dispersion of grid resources and applications places some complex problems such as job scheduling. Most existing scheduling strategies in Grids only focus on one kind of Grid jobs which can be data-intensive or computation-intensive. However, only considering one kind of jobs in scheduling does not result in suitable scheduling in the viewpoint of all system, and sometimes causes wasting of resources on the other side. To address the challenge of simultaneously considering both kinds of jobs, a new Cost-Based Job Scheduling (CJS) strategy is proposed in this paper. At one hand, CJS algorithm considers both data and computational resource availability of the network, and on the other hand, considering the corresponding requirements of each job, it determines a value called W to the job. Using the W value, the importance of two aspects (being data or computation intensive) for each job is determined, and then the job is assigned to the available resources. The simulation results with OptorSim show that CJS outperforms comparing to the existing algorithms mentioned in literature as number of jobs increases.

Download Full-text

A Two-Level Fuzzy Value-Based Replica Replacement Algorithm in Data Grids

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2016100105 ◽

2016 ◽

Vol 8 (4) ◽

pp. 78-99 ◽

Cited By ~ 3

Author(s):

Nazanin Saadat ◽

Amir Masoud Rahmani

Keyword(s):

Data Grid ◽

Data Availability ◽

Distributed Data ◽

Similar Data ◽

Data Grids ◽

Replacement Algorithm ◽

Minimum Latency ◽

Network Usage ◽

Effective Network ◽

And Storage

One of the challenges of data grid is to access widely distributed data fast and efficiently and providing maximum data availability with minimum latency. Data replication is an efficient way used to address this challenge by replicating and storing replicas, making it possible to access similar data in different locations of the data grid and can shorten the time of getting the files. However, as the number and storage size of grid sites is limited and restricted, an optimized and effective replacement algorithm is needed to improve the efficiency of replication. In this paper, the authors propose a novel two-level replacement algorithm which uses Fuzzy Replica Preserving Value Evaluator System (FRPVES) for evaluating the value of each replica. The algorithm was tested using a grid simulator, OptorSim developed by European Data Grid projects. Results from simulation procedure show that the authors' proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, total number of replications and effective network usage.

Download Full-text

Combination of data replication and scheduling algorithm for improving data availability in Data Grids

Journal of Network and Computer Applications ◽

10.1016/j.jnca.2012.12.021 ◽

2013 ◽

Vol 36 (2) ◽

pp. 711-722 ◽

Cited By ~ 38

Author(s):

Najme Mansouri ◽

Gholam Hosein Dastghaibyfard ◽

Ehsan Mansouri

Keyword(s):

Scheduling Algorithm ◽

Data Replication ◽

Data Availability ◽

Data Grids

Download Full-text

Minimizing data access latency in data grids by neighborhood-based data replication and job scheduling

International Journal of Communication Systems ◽

10.1002/dac.4552 ◽

2020 ◽

pp. e4552

Author(s):

Mahsa Beigrezaei ◽

Abolfazl Toroghi Haghighat ◽

Seyedeh Leili Mirtaheri

Keyword(s):

Job Scheduling ◽

Data Replication ◽

Data Access ◽

Data Grids ◽

Access Latency ◽

Data Access Latency

Download Full-text

Energy efficient data access and storage through HW/SW co-design

ACM SIGPLAN Notices ◽

10.1145/2666357.2602569 ◽

2014 ◽

Vol 49 (5) ◽

pp. 83-83 ◽

Cited By ~ 2

Author(s):

Minyi Guo

Keyword(s):

Energy Efficient ◽

Data Access ◽

Efficient Data ◽

And Storage

Download Full-text

A Lightweight Blockchain-Based IoT Identity Management Approach

Future Internet ◽

10.3390/fi13020024 ◽

2021 ◽

Vol 13 (2) ◽

pp. 24

Author(s):

Mohammed Amine Bouras ◽

Qinghua Lu ◽

Sahraoui Dhelim ◽

Huansheng Ning

Keyword(s):

Identity Management ◽

Single Point ◽

Data Access ◽

Emerging Technology ◽

Management Approach ◽

Proof Of Concept ◽

Data Access Control ◽

And Storage ◽

Privacy Issues ◽

Centralized System

Identity management is a fundamental feature of Internet of Things (IoT) ecosystem, particularly for IoT data access control. However, most of the actual works adopt centralized approaches, which could lead to a single point of failure and privacy issues that are tied to the use of a trusted third parties. A consortium blockchain is an emerging technology that provides a neutral and trustable computation and storage platform that is suitable for building identity management solutions for IoT. This paper proposes a lightweight architecture and the associated protocols for consortium blockchain-based identity management to address privacy, security, and scalability issues in a centralized system for IoT. Besides, we implement a proof-of-concept prototype and evaluate our approach. We evaluate our work by measuring the latency and throughput of the transactions while using different query actions and payload sizes, and we compared it to other similar works. The results show that the approach is suitable for business adoption.

Download Full-text

THREE DIMENSIONAL GRID STRUCTURE FOR EFFICIENT ACCESS OF REPLICATED DATA

Journal of Interconnection Networks ◽

10.1142/s0219265901000415 ◽

2001 ◽

Vol 02 (03) ◽

pp. 317-329 ◽

Cited By ~ 5

Author(s):

MUSTAFA MAT DERIS ◽

ALI MAMAT ◽

PUA CHAI SENG ◽

MOHD YAZID SAMAN

Keyword(s):

Distributed System ◽

Three Dimensional ◽

Data Replication ◽

High Availability ◽

Communication Cost ◽

Data Availability ◽

Grid Structure ◽

Communication Costs ◽

Replicated Data ◽

Efficient Access

This article addresses the performance of data replication protocol in terms of data availability and communication costs. Specifically, we present a new protocol called Three Dimensional Grid Structure (TDGS) protocol, to manage data replication in distributed system. The protocol provides high availability for read and write operations with limited fault-tolerance at low communication cost. With TDGS protocol, a read operation is limited to two data copies, while a write operation is required with minimal number of copies. In comparison to other protocols. TDGS requires lower communication cost for an operation, while providing higher data availability.

Download Full-text