scholarly journals Design and Evaluation of a Simple Data Interface for Efficient Data Transfer across Diverse Storage

Author(s):  
Zhengchun Liu ◽  
Rajkumar Kettimuthu ◽  
Joaquin Chung ◽  
Rachana Ananthakrishnan ◽  
Michael Link ◽  
...  

Modern science and engineering computing environments often feature storage systems of different types, from parallel file systems in high-performance computing centers to object stores operated by cloud providers. To enable easy, reliable, secure, and performant data exchange among these different systems, we propose Connector, a plug-able data access architecture for diverse, distributed storage. By abstracting low-level storage system details, this abstraction permits a managed data transfer service (Globus, in our case) to interact with a large and easily extended set of storage systems. Equally important, it supports third-party transfers: that is, direct data transfers from source to destination that are initiated by a third-party client but do not engage that third party in the data path. The abstraction also enables management of transfers for performance optimization, error handling, and end-to-end integrity. We present the Connector design, describe implementations for different storage services, evaluate tradeoffs inherent in managed vs. direct transfers, motivate recommended deployment options, and propose a model-based method that allows for easy characterization of performance in different contexts without exhaustive benchmarking.

2021 ◽  
Author(s):  
Marco Kulüke ◽  
Fabian Wachsmann ◽  
Georg Leander Siemund ◽  
Hannes Thiemann ◽  
Stephan Kindermann

<p>This study provides a guidance to data providers on how to transfer existing NetCDF data from a hierarchical storage system into Zarr to an object storage system.</p><p>In recent years, object storage systems became an alternative to traditional hierarchical file systems, because they are easily scalable and offer faster data retrieval, as compared to hierarchical storage systems.</p><p>Earth system sciences, and climate science in particular, handle large amounts of data. These data usually are represented as multi-dimensional arrays and traditionally stored in netCDF format on hierarchical file systems. However, the current netCDF-4 format is not yet optimized for object storage systems. NetCDF data transfers from an object storage can only be conducted on file level which results in heavy download volumes. An improvement to mitigate this problem can be the Zarr format, which reduces data transfers, due to the direct chunk and meta data access and hence increases the input/output operation speed in parallel computing environments.</p><p>As one of the largest climate data providers worldwide, the German Climate Computing Center (DKRZ) continuously works towards efficient ways to make data accessible for the user. This use case shows the conversion and the transfer of a subset of the Coupled Model Intercomparison Project Phase 6 (CMIP6) climate data archive from netCDF on the hierarchical file system into Zarr to the OpenStack object store, known as Swift, by using the Zarr Python package. Conclusively, this study will evaluate to what extent Zarr formatted climate data on an object storage system is a meaningful addition to the existing high performance computing environment of the DKRZ.</p>


Electronics ◽  
2021 ◽  
Vol 10 (20) ◽  
pp. 2486
Author(s):  
Se-young Yu

Distributing Big Data for science is pushing the capabilities of networks and computing systems. However, the fundamental concept of copying data from one machine to another has not been challenged in collaborative science. As recent storage system development uses modern fabrics to provide faster remote data access with lower overhead, traditional data movement using Data Transfer Nodes must cope with the paradigm shift from a store-and-forward model to streaming data with direct storage access over the networks. This study evaluates NVMe-over-TCP (NVMe-TCP) in a long-distance network using different file systems and configurations to characterize remote NVMe file system access performance in MAN and WAN data moving scenarios. We found that NVMe-TCP is more suitable for remote data read than remote data write over the networks, and using RAID0 can significantly improve performance in a long-distance network. Additionally, a fine-tuning file system can improve remote write performance in DTNs with a long-distance network.


2013 ◽  
Vol 5 (1) ◽  
pp. 53-69
Author(s):  
Jacques Jorda ◽  
Aurélien Ortiz ◽  
Abdelaziz M’zoughi ◽  
Salam Traboulsi

Grid computing is commonly used for large scale application requiring huge computation capabilities. In such distributed architectures, the data storage on the distributed storage resources must be handled by a dedicated storage system to ensure the required quality of service. In order to simplify the data placement on nodes and to increase the performance of applications, a storage virtualization layer can be used. This layer can be a single parallel filesystem (like GPFS) or a more complex middleware. The latter is preferred as it allows the data placement on the nodes to be tuned to increase both the reliability and the performance of data access. Thus, in such a middleware, a dedicated monitoring system must be used to ensure optimal performance. In this paper, the authors briefly introduce the Visage middleware – a middleware for storage virtualization. They present the most broadly used grid monitoring systems, and explain why they are not adequate for virtualized storage monitoring. The authors then present the architecture of their monitoring system dedicated to storage virtualization. We introduce the workload prediction model used to define the best node for data placement, and show on a simple experiment its accuracy.


2018 ◽  
Vol 37 (3) ◽  
pp. 29-49
Author(s):  
Kumar Sharma ◽  
Ujjal Marjit ◽  
Utpal Biswas

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.


2020 ◽  
Vol 245 ◽  
pp. 04037
Author(s):  
Xiaowei Aaron Chu ◽  
Jeff LeFevre ◽  
Aldrin Montana ◽  
Dana Robinson ◽  
Quincey Koziol ◽  
...  

Access libraries such as ROOT[1] and HDF5[2] allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. For example, access libraries often implement buffering and data layout that assume that large, single-threaded sequential access patterns are causing less overall latency than small parallel random access: while this is true for spinning media, it is not true for flash media. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph’s extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations.


2016 ◽  
Vol 4 (1) ◽  
Author(s):  
Agus Maman Abadi ◽  
Musthofa Musthofa ◽  
Emut Emut

The increasing need in techniques of storing big data presents a new challenge. One way to address this challenge is the use of distributed storage systems. One strategy that implemented in distributed data storage systems is the use of Erasure Code which applied to network coding. The code used in this technique is based on the algebraic structure which is called as vector space. Some studies have also been carried out to create code that is based on other algebraic structures such as module.  In this study, we are going to try to set up a code based on the algebraic structure which is a generalization of the module that is semimodule by utilizing the max operations and sum operations at max plus algebra. The results of this study indicate that the max operation and the addition operation on max plus algebra cannot be used to establish a semimodule code, but by modifying the operation "+" as "min", we get a code based on semimodule. Keywords:   code, distributed storage systems, network coding, semimodule, max plus algebra


Author(s):  
Yu Guo ◽  
Shenling Wang ◽  
Jianhui Huang

AbstractThe explosive growth of big data is pushing forward the paradigm of cloud-based data store today. Among other, distributed storage systems are widely adopted due to their superior performance and continuous availability. However, due to the potentially wide attacking surfaces of the public cloud, outsourcing data store inevitably raises new concerns on user privacy exposure and unauthorized data access. Besides, directly introducing a centralized third-party authority for query authorization management does not work because it still can be compromised. In this paper, we propose a blockchain-assisted framework that can support trustworthy data sharing services. In particular, data owners allow to outsource their sensitive data to distributed systems in encrypted form. By leveraging smart contracts of blockchain, a data owner can distribute secret keys for authorized users without extra round interaction to generate the permitted search tokens. Meanwhile, such blockchain-assisted framework naturally solves the trust issues of query authorization. Besides, we devise a secure local index framework to support encrypted keyword search with forward privacy and mitigate blockchain overhead. To validate our design, we implement the prototype and deploy it at Amazon Cloud. Extensive experiments demonstrate the security, efficiency, and effectiveness of the blockchain-assisted design.


Sign in / Sign up

Export Citation Format

Share Document