A Cloud-Aware Distributed Object Storage System to Retrieve Large Data via HTML5-Enabled Web Browsers

Major Internet services are required to process a tremendous amount of data at real time. As we put these services under the magnifying glass, It's seen that distributed object storage systems play an important role at back-end in achieving this success. In this chapter, overall information of the current state-of –the-art storage systems are given which are used for reliable, high performance and scalable storage needs in data centers and cloud. Then, an experimental distributed object storage system (CADOS) is introduced for retrieving large data, such as hundreds of megabytes, efficiently through HTML5-enabled web browsers over big data – terabytes of data – in cloud infrastructure. The objective of the system is to minimize latency and propose a scalable storage system on the cloud using a thin RESTful web service and modern HTML5 capabilities.

Download Full-text

A Cloud-Aware Distributed Object Storage System to Retrieve Large Data via HTML5-Enabled Web Browsers

Big Data ◽

10.4018/978-1-4666-9840-6.ch039 ◽

2016 ◽

pp. 828-847

Author(s):

Ahmet Artu Yıldırım ◽

Dan Watson

Keyword(s):

High Performance ◽

Storage Systems ◽

Storage System ◽

Large Data ◽

Web Browsers ◽

Cloud Infrastructure ◽

Scalable Storage ◽

Distributed Object ◽

Object Storage ◽

Restful Web Service

Major Internet services are required to process a tremendous amount of data at real time. As we put these services under the magnifying glass, it's seen that distributed object storage systems play an important role at back-end in achieving this success. In this chapter, overall information of the current state-of –the-art storage systems are given which are used for reliable, high performance and scalable storage needs in data centers and cloud. Then, an experimental distributed object storage system (CADOS) is introduced for retrieving large data, such as hundreds of megabytes, efficiently through HTML5-enabled web browsers over big data – terabytes of data – in cloud infrastructure. The objective of the system is to minimize latency and propose a scalable storage system on the cloud using a thin RESTful web service and modern HTML5 capabilities.

Download Full-text

Implementation Techniques for Extensible Object Storage Systems

Advances in Database Research - Advanced Topics in Database Research, Volume 2 ◽

10.4018/978-1-59140-063-9.ch006 ◽

2011 ◽

pp. 104-127

Author(s):

Jung-Ho Ahn ◽

Ha-Joo Song ◽

Hyoung-Joo Kim

Keyword(s):

High Performance ◽

Storage System ◽

Object Oriented ◽

Middle Layer ◽

Database Systems ◽

Object Oriented Design ◽

Design And Implementation ◽

Pointer Swizzling ◽

Implementation Techniques ◽

Object Storage

An efficient object manager, a middle layer on top of a storage system, is essential to ensure acceptable performance of object-oriented database systems, since a traditional record-based storage system is too simple to provide object abstraction. In this chapter, we design and implement an extensible object storage system, called Soprano, in an object-oriented fashion which has shown great potential in extensibility and code reusability. Soprano provides a uniform object abstraction and gives us the convenience of persistent programming through many useful persistent classes. Also, Soprano supports efficient object management and pointer swizzling for fast object access. This chapter investigates several aspects of the design and implementation of the extensible object storage system. Our experience shows the feasibility of using an object-oriented design and implementation in building an object storage system that should have both extensibility and high performance.

Download Full-text

State of current data storage market and development of tools for increasing data storage systems reliability

E3S Web of Conferences ◽

10.1051/e3sconf/201913504076 ◽

2019 ◽

Vol 135 ◽

pp. 04076 ◽

Cited By ~ 2

Author(s):

Marina Bolsunovskaya ◽

Svetlana Shirokova ◽

Aleksandra Loginova

Keyword(s):

Data Storage ◽

Traffic Management ◽

Smart Home ◽

Storage Systems ◽

Storage System ◽

Large Data ◽

Current Data ◽

Management Systems ◽

Software Complex ◽

Market State

This paper is devoted to the problem of developing and application of data storage systems (DSS) and tools for managing such systems to predict failures and provide fault tolerance specifications. Nowadays DSS are widely used for collecting data in Smart Home and Smart Cites management systems. For example, large data warehouses are utilized in traffic management systems. The results of the current data storage market state analysis are shown, and the project the purpose of which is to develop a hardware and software complex to predict failures in the storage system is presented.

Download Full-text

Managing Object Versioning in Geo-Distributed Object Storage Systems

Proceedings of the ACM 7th Workshop on Scientific Cloud Computing - ScienceCloud '16 ◽

10.1145/2913712.2913714 ◽

2016 ◽

Author(s):

Joao Neto ◽

Vianney Rancurel ◽

Vinh Tao

Keyword(s):

Storage Systems ◽

Distributed Object ◽

Object Storage

Download Full-text

Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet

Information Technology and Libraries ◽

10.6017/ital.v37i3.10177 ◽

2018 ◽

Vol 37 (3) ◽

pp. 29-49

Author(s):

Kumar Sharma ◽

Ujjal Marjit ◽

Utpal Biswas

Keyword(s):

Large Volume ◽

Data Model ◽

Experimental Evaluation ◽

Linked Data ◽

Storage Systems ◽

Storage System ◽

File Systems ◽

Large Data ◽

Apache Spark ◽

Rdf Data

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.

Download Full-text

Enabling near-data processing in distributed object storage systems

Proceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems ◽

10.1145/3465332.3470881 ◽

2021 ◽

Author(s):

Ian F. Adams ◽

Neha Agrawal ◽

Michael P. Mesnier

Keyword(s):

Data Processing ◽

Storage Systems ◽

Distributed Object ◽

Object Storage

Download Full-text

Programming Abstractions for Managing Workflows on Tiered Storage Systems

ACM Transactions on Storage ◽

10.1145/3457119 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-21

Author(s):

Devarshi Ghoshal ◽

Lavanya Ramakrishnan

Keyword(s):

Data Management ◽

High Performance ◽

New Technologies ◽

Storage Capacity ◽

Storage Systems ◽

Storage System ◽

Management Strategies ◽

Scientific Workflows ◽

Programming Abstractions ◽

Storage Hierarchy

Scientific workflows in High Performance Computing ( HPC ) environments are processing large amounts of data. The storage hierarchy on HPC systems is getting deeper, driven by new technologies (NVRAMs, SSDs, etc.) There is a need for new programming abstractions that allow users to seamlessly manage data at the workflow level on multi-tiered storage systems, and provide optimal workflow performance and use of storage resources. In previous work, we introduced a software architecture Managing Data on Tiered Storage for Scientific Workflows (MaDaTS ) that used a Virtual Data Space ( VDS ) abstraction to hide the complexities of the underlying storage system while allowing users to control data management strategies. In this article, we detail the data-centric programming abstractions that allow users to manage a workflow around its data on the storage layer. The programming abstractions simplify data management for scientific workflows on multi-tiered storage systems, without affecting workflow performance or storage capacity. We measure the overheads and effectiveness introduced by the programming abstractions of MaDaTS. Our results show that these abstractions can optimally use the storage capacity in lesser capacity storage tiers, and simplify data management without adding any performance overheads.

Download Full-text

It’s Time to Talk About HPC Storage: Perspectives on the Past and Future

10.22541/au.163300777.72776692/v1 ◽

2021 ◽

Author(s):

Bradley Settlemyer ◽

George Amvrosiadis ◽

Philip Carns ◽

Robert Ross

Keyword(s):

Artificial Intelligence ◽

High Performance ◽

New Technologies ◽

Storage Systems ◽

Storage System ◽

Status Quo ◽

Future Research ◽

The Past ◽

Experimental Data Analysis ◽

Performance Computing

High-performance computing (HPC) storage systems are a key component of the success of HPC to date. Recently, we have seen major developments in storage-related technologies, as well as changes to how HPC platforms are used, especially in relation to artificial intelligence and experimental data analysis workloads. These developments merit a revisit of HPC storage system architectural designs. In this paper we discuss the drivers, identify key challenges to status quo posed by these developments, and discuss directions future research might take to unlock the potential of new technologies for the breadth of HPC applications.

Download Full-text

Distributed Streaming Storage Performance Benchmarking: Kafka and Pravega

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1001.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 1-8

Keyword(s):

High Performance ◽

Storage Systems ◽

Storage System ◽

Performance Benchmarking ◽

Design And Implementation ◽

Storage Performance ◽

Performance Benchmark

The performance benchmarking tool for a distributed streaming storage system should be targeted toachieve maximum possible throughput from the streaming storage system by thrusting data massively. This paper details the design and implementation of high-performance benchmark tool for Kafka and Pravega streaming storage systems. The benchmark tool presented in this paper supports multiple writers and readers. The Pravega streaming storage is evaluated against Kafka with respect to performance.

Download Full-text

Transfer Data from NetCDF on Hierarchical Storage to Zarr on Object Storage: CMIP6 Climate Data Use Case

10.5194/egusphere-egu21-2442 ◽

2021 ◽

Author(s):

Marco Kulüke ◽

Fabian Wachsmann ◽

Georg Leander Siemund ◽

Hannes Thiemann ◽

Stephan Kindermann

Keyword(s):

Storage Systems ◽

Storage System ◽

File Systems ◽

Data Access ◽

Data Retrieval ◽

Use Case ◽

Climate Data ◽

Object Storage ◽

Hierarchical Storage ◽

Data Transfers

This study provides a guidance to data providers on how to transfer existing NetCDF data from a hierarchical storage system into Zarr to an object storage system.In recent years, object storage systems became an alternative to traditional hierarchical file systems, because they are easily scalable and offer faster data retrieval, as compared to hierarchical storage systems.Earth system sciences, and climate science in particular, handle large amounts of data. These data usually are represented as multi-dimensional arrays and traditionally stored in netCDF format on hierarchical file systems. However, the current netCDF-4 format is not yet optimized for object storage systems. NetCDF data transfers from an object storage can only be conducted on file level which results in heavy download volumes. An improvement to mitigate this problem can be the Zarr format, which reduces data transfers, due to the direct chunk and meta data access and hence increases the input/output operation speed in parallel computing environments.As one of the largest climate data providers worldwide, the German Climate Computing Center (DKRZ) continuously works towards efficient ways to make data accessible for the user. This use case shows the conversion and the transfer of a subset of the Coupled Model Intercomparison Project Phase 6 (CMIP6) climate data archive from netCDF on the hierarchical file system into Zarr to the OpenStack object store, known as Swift, by using the Zarr Python package. Conclusively, this study will evaluate to what extent Zarr formatted climate data on an object storage system is a meaningful addition to the existing high performance computing environment of the DKRZ.

Download Full-text