scholarly journals Erasure Coding for production in the EOS Open Storage system

2020 ◽  
Vol 245 ◽  
pp. 04008
Author(s):  
Andreas-Joachim Peters ◽  
Michal Kamil Simon ◽  
Elvin Alin Sindrilaru

The storage group of CERN IT operates more than 20 individual EOS[1] storage services with a raw data storage volume of more than 340 PB. Storage space is a major cost factor in HEP computing and the planned future LHC Run 3 and 4 increase storage space demands by at least an order of magnitude. A cost effective storage model providing durability is Erasure Coding (EC) [2]. The decommissioning of CERN’s remote computer center (Wigner/Budapest) allows a reconsideration of the currently configured dual-replica strategy where EOS provides one replica in each computer center. EOS allows one to configure EC on a per file bases and exposes four different redundancy levels with single, dual, triple and fourfold parity to select different quality of service and variable costs. This paper will highlight tests which have been performed to migrate files on a production instance from dual-replica to various EC profiles. It will discuss performance and operational impact, and highlight various policy scenarios to select the best file layout with respect to IO patterns, file age and file size. We will conclude with the current status and future optimizations, an evaluation of cost savings and discuss an erasure encoded EOS setup as a possible tape storage replacement.

2011 ◽  
Vol 14 (2) ◽  
Author(s):  
Thomas G Koch

Current estimates of obesity costs ignore the impact of future weight loss and gain, and may either over or underestimate economic consequences of weight loss. In light of this, I construct static and dynamic measures of medical costs associated with body mass index (BMI), to be balanced against the cost of one-time interventions. This study finds that ignoring the implications of weight loss and gain over time overstates the medical-cost savings of such interventions by an order of magnitude. When the relationship between spending and age is allowed to vary, weight-loss attempts appear to be cost-effective starting and ending with middle age. Some interventions recently proven to decrease weight may also be cost-effective.


2018 ◽  
Vol 2018 ◽  
pp. 1-17
Author(s):  
Qing Liao ◽  
Haoyu Tan ◽  
Wuman Luo ◽  
Ye Ding

The value of large amount of location-based mobile data has received wide attention in many research fields including human behavior analysis, urban transportation planning, and various location-based services. Nowadays, both scientific and industrial communities are encouraged to collect as much location-based mobile data as possible, which brings two challenges: (1) how to efficiently process the queries of big location-based mobile data and (2) how to reduce the cost of storage services, because it is too expensive to store several exact data replicas for fault-tolerance. So far, several dedicated storage systems have been proposed to address these issues. However, they do not work well when the ranges of queries vary widely. In this work, we design a storage system based on diverse replica scheme which not only can improve the query processing efficiency but also can reduce the cost of storage space. To the best of our knowledge, this is the first work to investigate the data storage and processing in the context of big location-based mobile data. Specifically, we conduct in-depth theoretical and empirical analysis of the trade-offs between different spatial-temporal partitioning and data encoding schemes. Moreover, we propose an effective approach to select an appropriate set of diverse replicas, which is optimized for the expected query loads while conforming to the given storage space budget. The experiment results show that using diverse replicas can significantly improve the overall query performance and the proposed algorithms for the replica selection problem are both effective and efficient.


Webology ◽  
2021 ◽  
Vol 18 (Special Issue 01) ◽  
pp. 288-301
Author(s):  
G. Sujatha ◽  
Dr. Jeberson Retna Raj

Data storage is one of the significant cloud services available to the cloud users. Since the magnitude of information outsourced grows extremely high, there is a need of implementing data deduplication technique in the cloud storage space for efficient utilization. The cloud storage space supports all kind of digital data like text, audio, video and image. In the hash-based deduplication system, cryptographic hash value should be calculated for all data irrespective of its type and stored in the memory for future reference. Using these hash value only, duplicate copies can be identified. The problem in this existing scenario is size of the hash table. To find a duplicate copy, all the hash values should be checked in the worst case irrespective of its data type. At the same time, all kind of digital data does not suit with same structure of hash table. In this study we proposed an approach to have multiple hash tables for different digital data. By having dedicated hash table for each digital data type will improve the searching time of duplicate data.


Author(s):  
Nikola Davidović ◽  
Slobodan Obradović ◽  
Borislav Đorđević ◽  
Valentina Timčenko ◽  
Bojan Škorić

The rapid technological progress has led to a growing need for more data storage space. The appearance of big data requires larger storage space, faster access and exchange of data as well as data security. RAID (Redundant Array of Independent Disks) technology is one of the most cost-effective ways to satisfy needs for larger storage space, data access and protection. However, the connection of multiple secondary memory devices in RAID 0 aims to improve the secondary memory system in a way to provide greater storage capacity, increase both read data speed and write data speed but it is not fault-tolerant or error-free. This paper provides an analysis of the system for storing the data on the paired arrays of magnetic disks in a RAID 0 formation, with different number of queue entries for overlapped I/O, where queue depth parameter has the value of 1 and 4. The paper presents a range of test results and analysis for RAID 0 series for defined workload characteristics. The tests were carried on in Microsoft Windows Server 2008 R2 Standard operating system, using 2, 3, 4 and 6 paired magnetic disks and controlled by Dell PERC 6/i hardware RAID controller. For the needs of obtaining the measurement results, ATTO Disk Benchmark has been used. The obtained results have been analyzed and compared to the expected behavior.


2014 ◽  
Vol 556-562 ◽  
pp. 6179-6183
Author(s):  
Zhi Gang Chai ◽  
Ming Zhao ◽  
Xiao Yu

With the rapid development of information technology, the extensive use of cloud computing promotes technological change in the IT industry. The use of cloud storage industry is also one solution to the problem of an amount of data storing, which is traditionally large, and unimaginably redundant. The use of cloud computing in the storage system connects the user's data with network clients via the Internet. That is to say, it not only solves a lot of data storage space requirements in request, but also greatly reduces the cost of the storage system. But in the application of cloud storage, there are also many problems to be solved, even to some extent which has hindered the development of cloud storage. Among these issues, the most concerning one is cloud storage security. The following passages discuss the problem and propose a solution to it.


2020 ◽  
Vol 15 (1) ◽  
pp. 15
Author(s):  
Felix Bach ◽  
Björn Schembera ◽  
Jos Van Wezel

Research data as the true valuable good in science must be saved and subsequently kept findable, accessible and reusable for reasons of proper scientific conduct for a time span of several years. However, managing long-term storage of research data is a burden for institutes and researchers. Because of the sheer size and the required retention time apt storage providers are hard to find. Aiming to solve this puzzle, the bwDataArchive project started development of a long-term research data archive that is reliable, cost effective and able store multiple petabytes of data. The hardware consists of data storage on magnetic tape, interfaced with disk caches and nodes for data movement and access. On the software side, the High Performance Storage System (HPSS) was chosen for its proven ability to reliably store huge amounts of data. However, the implementation of bwDataArchive is not dependant on HPSS. For authentication the bwDataArchive is integrated into the federated identity management for educational institutions in the State of Baden-Württemberg in Germany. The archive features data protection by means of a dual copy at two distinct locations on different tape technologies, data accessibility by common storage protocols, data retention assurance for more than ten years, data preservation with checksums, and data management capabilities supported by a flexible directory structure allowing sharing and publication. As of September 2019, the bwDataArchive holds over 9 PB and 90 million files and sees a constant increase in usage and users from many communities.


2020 ◽  
Vol 245 ◽  
pp. 04026
Author(s):  
Haykuhi Musheghyan ◽  
Andreas Petzold ◽  
Andreas Heiss ◽  
Doris Ressmann ◽  
Martin Beitzinger

Data growth over several years within HEP experiments requires a wider use of storage systems for WLCG Tiered Centers. It also increases the complexity of storage systems, which includes the expansion of hardware components and thereby complicates existing software products more. To cope with such systems is a non-trivial task and requires highly qualified specialists. Storing petabytes of data on tape storage is a still the most cost-effective way. Year after year, the use of a tape storage increases, consequently a detailed study of its optimal use and verification of performance is a key aspect for such a system. It includes several factors, such as performing various performance tests, identifying and eliminating bottlenecks, properly adjusting and improving the current GridKa setup, etc. At present, GridKa uses dCache as the storage system in frontend and TSM as the tape storage backend. dCache provides a plugin interface for exchanging data between dcache and tape. TSS is a TSM-based client developed by the GridKa team. TSS has been in production for over 10 years. The interaction between the GridKa dCache instance and TSM is accomplished using additional scripts that can be further optimized to improve the overall performance of the tape storage. This contribution provides detailed information on the results of various performance tests performed on the GridKa tape and significant improvements of our tape storage performance.


2019 ◽  
Vol 214 ◽  
pp. 04029
Author(s):  
Doris Ressmann ◽  
Dorin Lobontu ◽  
Martin Beitzinger ◽  
Karin Schaefer ◽  
Andreas Heiss ◽  
...  

Tape storage is still a cost effective way to keep large amounts of data over a long period of time and it is expected that this will continue in the future. The GridKa tape environment is a complex system of many hardware components and software layers. Configuring this system for optimal performance for all use cases is a non-trivial task and requires a lot of experience. We present the current status of the GridKa tape environment, report on recent upgrades and improvements and plans to further develop and enhance the system, especially with regard to the future requirements of the HEP experiments and their large data centers. The short-term planning mainly includes the transition from TSM to HPSS as the backend and the effects on the connection of dCache and xrootd. Recent changes of the vendor situation of certain tape technologies require a precise analysis of the impact and eventual adaptation of the mid-term planning, in particular with respect to scalability challenge that comes with HL-LHC on the horizon.


2021 ◽  
pp. 1063293X2199201
Author(s):  
Anto Praveena M.D. ◽  
Bharathi B

Duplication of data in an application will become an expensive factor. These replication of data need to be checked and if it is needed it has to be removed from the dataset as it occupies huge volume of data in the storage space. The cloud is the main source of data storage and all organizations are already started to move their dataset into the cloud since it is cost effective, storage space, data security and data Privacy. In the healthcare sector, storing the duplicated records leads to wrong prediction. Also uploading same files by many users, data storage demand will be occurred. To address those issues, this paper proposes an Optimal Removal of Deduplication (ORD) in heart disease data using hybrid trust based neural network algorithm. In ORD scheme, the Chaotic Whale Optimization (CWO) algorithm is used for trust computation of data using multiple decision metrics. The computed trust values and the nature of the data’s are sequentially applied to the training process by the Mimic Deep Neural Network (MDNN). It classify the data is a duplicate or not. Hence the duplicates files are identified and they were removed from the data storage. Finally, the simulation evaluates to examine the proposed MDNN based model and simulation results show the effectiveness of ORD scheme in terms of data duplication removal. From the simulation result it is found that the model’s accuracy, sensitivity and specificity was good.


2021 ◽  
Vol 251 ◽  
pp. 02023
Author(s):  
Maria Arsuaga-Rios ◽  
Vladimír Bahyl ◽  
Manuel Batalha ◽  
Cédric Caffy ◽  
Eric Cano ◽  
...  

The CERN IT Storage Group ensures the symbiotic development and operations of storage and data transfer services for all CERN physics data, in particular the data generated by the four LHC experiments (ALICE, ATLAS, CMS and LHCb). In order to accomplish the objectives of the next run of the LHC (Run-3), the Storage Group has undertaken a thorough analysis of the experiments’ requirements, matching them to the appropriate storage and data transfer solutions, and undergoing a rigorous programme of testing to identify and solve any issues before the start of Run-3. In this paper, we present the main challenges presented by each of the four LHC experiments. We describe their workflows, in particular how they communicate with and use the key components provided by the Storage Group: the EOS disk storage system; its archival back-end, the CERN Tape Archive (CTA); and the File Transfer Service (FTS). We also describe the validation and commissioning tests that have been undertaken and challenges overcome: the ATLAS stress tests to push their DAQ system to its limits; the CMS migration from PhEDEx to Rucio, followed by large-scale tests between EOS and CTA with the new FTS “archive monitoring” feature; the LHCb Tier-0 to Tier-1 staging tests and XRootD Third Party Copy (TPC) validation; and the erasure coding performance in ALICE.


Sign in / Sign up

Export Citation Format

Share Document