Analysis of data integrity and storage quality of a distributed storage system

CERN uses the world’s largest scientific computing grid, WLCG, for distributed data storage and processing. Monitoring of the CPU and storage resources is an important and essential element to detect operational issues in its systems, for example in the storage elements, and to ensure their proper and efficient function. The processing of experiment data depends strongly on the data access quality, as well as its integrity and both of these key parameters must be assured for the data lifetime. Given the substantial amount of data, O(200 PB), already collected by ALICE and kept at various storage elements around the globe, scanning every single data chunk would be a very expensive process, both in terms of computing resources usage and in terms of execution time. In this paper, we describe a distributed file crawler that addresses these natural limits by periodically extracting and analyzing statistically significant samples of files from storage elements, evaluates the results and is integrated with the existing monitoring solution, MonALISA.

Download Full-text

REGENERATING CODETECHNIQUE IN DISTRIBUTED STORAGE

Jurnal Sains Dasar ◽

10.21831/jsd.v5i1.12669 ◽

2017 ◽

Vol 5 (1) ◽

pp. 60

Author(s):

Agus Maman Abadi ◽

Karyati Karyati ◽

Musthofa Musthofa ◽

Emut Emut

Keyword(s):

Data Storage ◽

Algebraic Structure ◽

Distributed Storage ◽

Storage System ◽

Distributed Data ◽

Distributed Data Storage ◽

Code Technique ◽

Code Module ◽

Data Storage System ◽

Regenerating Code

Abstract The Increasing need of storing large amounts of data presents a new challenge. One way to address this challenge is to use distributed data storage system. One of the strategies implemented in the distributed data storage system is using the technique of regenerating code. The code used in this technique is based on the algebraic structure of fields. Some studies have also been carried out to create code that is based on the other algebraic structure namely module. In this study, we attempted to assess the implementation of the code module at regenerating technique code. The study showed there is a potential properties code based on module that can be used in regenerating code technique. Keywords: Distributed storage, regenerating code technique, module code

Download Full-text

PetaShare: A Reliable, Efficient and Transparent Distributed Storage Management System

Scientific Programming ◽

10.1155/2011/901230 ◽

2011 ◽

Vol 19 (1) ◽

pp. 27-43

Author(s):

Tevfik Kosar ◽

Ismail Akturk ◽

Mehmet Balman ◽

Xinqi Wang

Keyword(s):

Data Management ◽

Data Storage ◽

Management System ◽

Distributed Storage ◽

Storage System ◽

Data Management System ◽

Distributed Data ◽

Distributed Data Storage ◽

Data Movement ◽

Reliability And Availability

Modern collaborative science has placed increasing burden on data management infrastructure to handle the increasingly large data archives generated. Beside functionality, reliability and availability are also key factors in delivering a data management system that can efficiently and effectively meet the challenges posed and compounded by the unbounded increase in the size of data generated by scientific applications. We have developed a reliable and efficient distributed data storage system, PetaShare, which spans multiple institutions across the state of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement across geographically distributed storage sites. At the front-end, it provides light-weight clients the enable easy, transparent and scalable access. In PetaShare, we have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability, and an advanced buffering system for improved data transfer performance. In this paper, we present the details of our design and implementation, show performance results, and describe our experience in developing a reliable and efficient distributed data management system for data-intensive science.

Download Full-text

Metadata Management in PetaShare Distributed Storage Network

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch005 ◽

2012 ◽

pp. 118-139

Author(s):

Ismail Akturk ◽

Xinqi Wang ◽

Tevfik Kosar

Keyword(s):

Data Storage ◽

High Performance ◽

Distributed Storage ◽

Storage System ◽

High Capacity ◽

Distributed Data ◽

Data Handling ◽

Distributed Data Storage ◽

Efficient Data ◽

High Level

The unbounded increase in the size of data generated by scientific applications necessitates collaboration and sharing among the nation’s education and research institutions. Simply purchasing high-capacity, high-performance storage systems and adding them to the existing infrastructure of the collaborating institutions does not solve the underlying and highly challenging data handling problem. Scientists are compelled to spend a great deal of time and energy on solving basic data-handling issues, such as the physical location of data, how to access it, and/or how to move it to visualization and/or compute resources for further analysis. This chapter presents the design and implementation of a reliable and efficient distributed data storage system, PetaShare, which spans multiple institutions across the state of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement across geographically distributed storage sites. At the front-end, it provides light-weight clients the enable easy, transparent, and scalable access. In PetaShare, the authors have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability. The authors also present a high level cross-domain metadata schema to provide a structured systematic view of multiple science domains supported by PetaShare.

Download Full-text

Providing Consistent State to Distributed Storage System

Computers ◽

10.3390/computers10020023 ◽

2021 ◽

Vol 10 (2) ◽

pp. 23

Author(s):

Laskhmi Siva Rama Krishna Talluri ◽

Ragunathan Thirumalaisamy ◽

Ramgopal Kota ◽

Ram Prasad Reddy Sadi ◽

Ujjwal KC ◽

...

Keyword(s):

Data Storage ◽

Cloud Model ◽

Distributed Storage ◽

Storage System ◽

Distributed Data ◽

Data Loss ◽

Distributed Storage System ◽

Distributed Data Storage ◽

Hadoop Distributed File System ◽

Set Up

In cloud storage systems, users must be able to shut down the application when not in use and restart it from the last consistent state when required. BlobSeer is a data storage application, specially designed for distributed systems, that was built as an alternative solution for the existing popular open-source storage system-Hadoop Distributed File System (HDFS). In a cloud model, all the components need to stop and restart from a consistent state when the user requires it. One of the limitations of BlobSeer DFS is the possibility of data loss when the system restarts. As such, it is important to provide a consistent start and stop state to BlobSeer components when used in a Cloud environment to prevent any data loss. In this paper, we investigate the possibility of BlobSeer providing a consistent state distributed data storage system with the integration of checkpointing restart functionality. To demonstrate the availability of a consistent state, we set up a cluster with multiple machines and deploy BlobSeer entities with checkpointing functionality on various machines. We consider uncoordinated checkpoint algorithms for their associated benefits over other alternatives while integrating the functionality to various BlobSeer components such as the Version Manager (VM) and the Data Provider. The experimental results show that with the integration of the checkpointing functionality, a consistent state can be ensured for a distributed storage system even when the system restarts, preventing any possible data loss after the system has encountered various system errors and failures.

Download Full-text

An Architecture for Distributed Electronic Documents Storage in Decentralized Blockchain B2B Applications

Computers ◽

10.3390/computers10110142 ◽

2021 ◽

Vol 10 (11) ◽

pp. 142

Author(s):

Obadah Hammoud ◽

Ivan Tarkhanov ◽

Artyom Kosmarski

Keyword(s):

Distributed Systems ◽

Data Storage ◽

Distributed Storage ◽

Distributed Data ◽

Erasure Coding ◽

Distributed Data Storage ◽

Electronic Documents ◽

File Storage ◽

Load Balancer ◽

The Cost

This paper investigates the problem of distributed storage of electronic documents (both metadata and files) in decentralized blockchain-based b2b systems (DApps). The need to reduce the cost of implementing such systems and the insufficient elaboration of the issue of storing big data in DLT are considered. An approach for building such systems is proposed, which allows optimizing the size of the required storage (by using Erasure coding) and simultaneously providing secure data storage in geographically distributed systems of a company, or within a consortium of companies. The novelty of this solution is that we are the first who combine enterprise DLT with distributed file storage, in which the availability of files is controlled. The results of our experiment demonstrate that the speed of the described DApp is comparable to known b2c torrent projects, and subsequently justify the choice of Hyperledger Fabric and Ethereum Enterprise for its use. Obtained test results show that public blockchain networks are not suitable for creating such a b2b system. The proposed system solves the main challenges of distributed data storage by grouping data into clusters and managing them with a load balancer, while preventing data tempering using a blockchain network. The considered DApps storage methodology easily scales horizontally in terms of distributed file storage and can be deployed on cloud computing technologies, while minimizing the required storage space. We compare this approach with known methods of file storage in distributed systems, including central storage, torrents, IPFS, and Storj. The reliability of this approach is calculated and the result is compared to traditional solutions based on full backup.

Download Full-text

Monitoring of a Grid Storage Virtualization Service

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2013010104 ◽

2013 ◽

Vol 5 (1) ◽

pp. 53-69

Author(s):

Jacques Jorda ◽

Aurélien Ortiz ◽

Abdelaziz M’zoughi ◽

Salam Traboulsi

Keyword(s):

Monitoring System ◽

Data Storage ◽

Large Scale ◽

Distributed Storage ◽

Storage System ◽

Data Access ◽

Data Placement ◽

Workload Prediction ◽

Storage Virtualization

Grid computing is commonly used for large scale application requiring huge computation capabilities. In such distributed architectures, the data storage on the distributed storage resources must be handled by a dedicated storage system to ensure the required quality of service. In order to simplify the data placement on nodes and to increase the performance of applications, a storage virtualization layer can be used. This layer can be a single parallel filesystem (like GPFS) or a more complex middleware. The latter is preferred as it allows the data placement on the nodes to be tuned to increase both the reliability and the performance of data access. Thus, in such a middleware, a dedicated monitoring system must be used to ensure optimal performance. In this paper, the authors briefly introduce the Visage middleware – a middleware for storage virtualization. They present the most broadly used grid monitoring systems, and explain why they are not adequate for virtualized storage monitoring. The authors then present the architecture of their monitoring system dedicated to storage virtualization. We introduce the workload prediction model used to define the best node for data placement, and show on a simple experiment its accuracy.

Download Full-text

Implementation of a Distributed Data Storage System with Resource Monitoring on Cloud Computing

Advances in Grid and Pervasive Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-30767-6_6 ◽

2012 ◽

pp. 64-73 ◽

Cited By ~ 3

Author(s):

Chao-Tung Yang ◽

Wen-Chung Shih ◽

Chih-Lin Huang

Keyword(s):

Cloud Computing ◽

Data Storage ◽

Storage System ◽

Distributed Data ◽

Resource Monitoring ◽

Distributed Data Storage ◽

Data Storage System

Download Full-text

Research on Distributed Storage Technology Based on Mass Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.2710 ◽

2014 ◽

Vol 687-691 ◽

pp. 2710-2713

Author(s):

Jing Yang

Keyword(s):

Data Storage ◽

Distributed Storage ◽

Rapid Development ◽

Distributed Data ◽

Distributed Environment ◽

Environment Analysis ◽

Distributed Data Storage ◽

Storage Technology ◽

Mass Data ◽

Storage Performance

With the rapid development of computer technology and network technology, mass data store distributed and management pattern already received accepted extensively. Thus it can be seen malpractice obviously, data storage structure, storage environment are different and other problems such as data handing. The paper go into how improve data storage performance in the distributed environment, analysis the data storage technology at present and data storage performance in the distributed environment, summarize the claim of distributed storage database design, provide the theory in vacation distributed data storage performance standardization.

Download Full-text

A Distributed Data Storage Method Based on Integrated Threshold

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.595 ◽

2011 ◽

Vol 268-270 ◽

pp. 595-600

Author(s):

Yi Liu

Keyword(s):

Energy Consumption ◽

Data Storage ◽

Hot Spots ◽

Group Analysis ◽

Distributed Data ◽

Distributed Data Storage ◽

Storage Method ◽

Consumption Data ◽

And Storage ◽

Failure Node

Based on the analysis and study of the data storage strategy in wireless sensor networks, this paper presents a distributed data storage method based on sleep scheduling to resolve the problems of network imbalance and storage hot spots problems.Finally, multi group analysis of simulate experiments results show that compared to other data storage method the distributed data storage method based on composite threshold have obviously advantages on the sides of overall energy consumption,data storage capacity,the number of failure node and data quality,thus have a significant effect on reducing energy consumption and extending network life cycle.

Download Full-text

REQUEST BALANCING METHOD FOR INCREASING THEIR PROCESSING EFFICIENCY WITH INFORMATION REPLICATION IN A DISTRIBUTED DATA STORAGE SYSTEM

TECHNICAL SCIENCES AND TECHNOLOG IES ◽

10.25140/2411-5363-2021-2(24)-75-82 ◽

2021 ◽

pp. 75-82

Author(s):

Igor Boyarshin ◽

Anna Doroshenko ◽

Pavlo Rehida

Keyword(s):

Data Storage ◽

Storage Systems ◽

Storage System ◽

New Method ◽

Distributed Data ◽

Processing Efficiency ◽

Distributed Data Storage ◽

Shared Data ◽

Multiple Data ◽

Data Storage System

The article describes a new method of improving efficiency of the systems that deal with storage and providing access of shared data of many users by utilizing replication. Existing methods of load balancing in data storage systems are described, namely RR and WRR. A new method of request balancing among multiple data storage nodes is proposed, that is able to adjust to input request stream intensity in real time and utilize disk space efficiently while doing so.

Download Full-text