A Benchmark for Suitability of Alluxio over Spark

Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB.

Download Full-text

Distributed Storage Strategy and Visual Analysis for Economic Big Data

Journal of Mathematics ◽

10.1155/2021/3224190 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Xiangli Chang ◽

Hailang Cui

Keyword(s):

Big Data ◽

Data Storage ◽

Visual Analysis ◽

Distributed Storage ◽

Storage System ◽

Service Level Agreement ◽

Wiener Filter ◽

Service Level ◽

Prototype System ◽

Data Types

With the increasing popularity of a large number of Internet-based services and a large number of services hosted on cloud platforms, a more powerful back-end storage system is needed to support these services. At present, it is very difficult or impossible to implement a distributed storage to meet all the above assumptions. Therefore, the focus of research is to limit different characteristics to design different distributed storage solutions to meet different usage scenarios. Economic big data should have the basic requirements of high storage efficiency and fast retrieval speed. The large number of small files and the diversity of file types make the storage and retrieval of economic big data face severe challenges. This paper is oriented to the application requirements of cross-modal analysis of economic big data. According to the source and characteristics of economic big data, the data types are analyzed and the database storage architecture and data storage structure of economic big data are designed. Taking into account the spatial, temporal, and semantic characteristics of economic big data, this paper proposes a unified coding method based on the spatiotemporal data multilevel division strategy combined with Geohash and Hilbert and spatiotemporal semantic constraints. A prototype system was constructed based on Mongo DB, and the performance of the multilevel partition algorithm proposed in this paper was verified by the prototype system based on the realization of data storage management functions. The Wiener distributed memory based on the principle of Wiener filter is used to store the workload of each workload distributed storage window in a distributed manner. For distributed storage workloads, this article adopts specific types of workloads. According to its periodicity, the workload is divided into distributed storage windows of specific duration. At the beginning of each distributed storage window, distributed storage is distributed to the next distributed storage window. Experiments and tests have verified the distributed storage strategy proposed in this article, which proves that the Wiener distributed storage solution can save platform resources and configuration costs while ensuring Service Level Agreement (SLA).

Download Full-text

State of current data storage market and development of tools for increasing data storage systems reliability

E3S Web of Conferences ◽

10.1051/e3sconf/201913504076 ◽

2019 ◽

Vol 135 ◽

pp. 04076 ◽

Cited By ~ 2

Author(s):

Marina Bolsunovskaya ◽

Svetlana Shirokova ◽

Aleksandra Loginova

Keyword(s):

Data Storage ◽

Traffic Management ◽

Smart Home ◽

Storage Systems ◽

Storage System ◽

Large Data ◽

Current Data ◽

Management Systems ◽

Software Complex ◽

Market State

This paper is devoted to the problem of developing and application of data storage systems (DSS) and tools for managing such systems to predict failures and provide fault tolerance specifications. Nowadays DSS are widely used for collecting data in Smart Home and Smart Cites management systems. For example, large data warehouses are utilized in traffic management systems. The results of the current data storage market state analysis are shown, and the project the purpose of which is to develop a hardware and software complex to predict failures in the storage system is presented.

Download Full-text

Resource Provisioning and Scheduling of Big Data Processing Jobs

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch014 ◽

2018 ◽

pp. 382-401

Author(s):

Rajni Aron ◽

Deepak Kumar Aggarwal

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Research Area ◽

Resource Provisioning ◽

Time Data ◽

Big Data Processing ◽

Cloud Resource Management ◽

Big Data Applications ◽

Cloud Resource

Cloud Computing has become a buzzword in the IT industry. Cloud Computing which provides inexpensive computing resources on the pay-as-you-go basis is promptly gaining momentum as a substitute for traditional Information Technology (IT) based organizations. Therefore, the increased utilization of Clouds makes an execution of Big Data processing jobs a vital research area. As more and more users have started to store/process their real-time data in Cloud environments, Resource Provisioning and Scheduling of Big Data processing jobs becomes a key element of consideration for efficient execution of Big Data applications. This chapter discusses the fundamental concepts supporting Cloud Computing & Big Data terms and the relationship between them. This chapter will help researchers find the important characteristics of Cloud Resource Management Systems to handle Big Data processing jobs and will also help to select the most suitable technique for processing Big Data jobs in Cloud Computing environment.

Download Full-text

PRELIMINARY STUDY ON APPLICATION OF MAX PLUS ALGEBRA IN DISTRIBUTED STORAGE SYSTEM THROUGH NETWORK CODING

Jurnal Sains Dasar ◽

10.21831/jsd.v4i1.8420 ◽

2016 ◽

Vol 4 (1) ◽

Author(s):

Agus Maman Abadi ◽

Musthofa Musthofa ◽

Emut Emut

Keyword(s):

Network Coding ◽

Data Storage ◽

Algebraic Structure ◽

Storage Systems ◽

Distributed Storage ◽

Storage System ◽

Distributed Data Storage ◽

Erasure Code ◽

Distributed Storage Systems ◽

Set Up

The increasing need in techniques of storing big data presents a new challenge. One way to address this challenge is the use of distributed storage systems. One strategy that implemented in distributed data storage systems is the use of Erasure Code which applied to network coding. The code used in this technique is based on the algebraic structure which is called as vector space. Some studies have also been carried out to create code that is based on other algebraic structures such as module. In this study, we are going to try to set up a code based on the algebraic structure which is a generalization of the module that is semimodule by utilizing the max operations and sum operations at max plus algebra. The results of this study indicate that the max operation and the addition operation on max plus algebra cannot be used to establish a semimodule code, but by modifying the operation "+" as "min", we get a code based on semimodule. Keywords: code, distributed storage systems, network coding, semimodule, max plus algebra

Download Full-text

Overview of Big-Data-Intensive Storage and Its Technologies

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch002 ◽

2018 ◽

pp. 33-74

Author(s):

Richard S. Segall ◽

Jeffrey S. Cook

Keyword(s):

Big Data ◽

Data Storage ◽

Storage Systems ◽

Storage System ◽

Management Strategies ◽

Sensor Data ◽

Data Intensive Computing ◽

Data Intensive ◽

Future Challenges ◽

Data Storage System

This chapter deals with a detailed discussion on the storage systems for data-intensive computing using Big Data. The chapter begins with a brief introduction about data-intensive computing and types of parallel processing approaches. It also highlights the points that display how data-intensive computing systems differ from other forms of computing. A discussion on the importance of Big Data computing is put forth. The current and future challenges of storage in genomics are discussed in detail. Also, storage and data management strategies are given. The chapter's focus is then on the software challenges for storage. Storage use cases are provided like DataDirect Networks, SDSC, etc. The list of storage tools and their details are provided. A small section discusses the sensor data storage system. Then a table is provided that shows the top 10 cloud storage systems for data-intensive computing using Big Data in the world. Top 500 Big Data storage servers statistics are also displayed effectively by the images from Top500 website.

Download Full-text

Resource Provisioning and Scheduling of Big Data Processing Jobs

Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing ◽

10.4018/978-1-7998-5339-8.ch083 ◽

2021 ◽

pp. 1694-1713

Author(s):

Rajni Aron ◽

Deepak Kumar Aggarwal

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Research Area ◽

Resource Provisioning ◽

Time Data ◽

Big Data Processing ◽

Cloud Resource Management ◽

Big Data Applications ◽

Cloud Resource

Download Full-text

Classification and Processing of Big Data in Sensor Network Based on Suffix Tree Clustering

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v15i01.9785 ◽

2019 ◽

Vol 15 (01) ◽

pp. 171

Author(s):

Jun Tian ◽

Lirong Huang

Keyword(s):

Big Data ◽

Standard Deviation ◽

Sensor Network ◽

Data Storage ◽

Large Scale ◽

Suffix Tree ◽

Distributed Storage ◽

Storage System ◽

Base Station ◽

Universal System

Aiming at the perception data acquired by the widely used, fast-developing but still not perfect wireless sensor network system, a relatively complete and universal system for the collection, transmission, storage and cluster analysis of perception data is designed. Perception data is spliced and compressed at the node and reconstructed at the base station, the problem of the acquisition of perception data and energy consumption of transmission is optimized, the distributed storage system is established, and the data reading mechanism and data storage architecture are designed accordingly.The data acquisition protocol and the traditional protocol, the storage system itself and the Oracle database system, and <a name="_Hlk527548018"></a>Standard Deviation and Eigensystem Realization Algorithm are respectively adopted for comparison test.Based on Standard Deviation algorithm, the operation of suffix tree clustering is carried out, and the general steps of suffix tree clustering are studied and the structure of perception data and the characteristics of storage are adapted, and the data classification operation based on suffix tree clustering is completed. The results show that proposed Standard Deviationalgorithm algorithm not only inherits the efficiency of the classical algorithm for processing big data, but also has obvious effect on large-scale discrete data processing, and the efficiency is obviously improved compared with the traditional method.

Download Full-text

A SECURE DATA FORWARDING SCHEMA FOR CLOUD STORAGE SYSTEMS

International Journal of Smart Sensor and Adhoc Network. ◽

10.47893/ijssan.2013.1187 ◽

2013 ◽

pp. 10-14

Author(s):

G.CHINNA PULLAIAH ◽

DILIP VENKATA KUMAR VENGALA

Keyword(s):

Cloud Computing ◽

Data Storage ◽

Economies Of Scale ◽

Distributed Storage ◽

Storage System ◽

Large Data ◽

Code Word ◽

Seamless Integration ◽

Dynamic Data ◽

Erasure Code

Cloud Computing has been envisioned as the next-generation architecture of IT Enterprise. It moves the application software and databases to the centralized large data centers, where the management of the data and services may not be fully trustworthy. This unique paradigm brings about many new security challenges, which have not been well understood. This work studies the problem of ensuring the integrity of data storage in Cloud Computing. In particular, we consider the task of allowing a threshold proxy re-encryption, on behalf of the cloud client, to verify the integrity of the dynamic data stored in the cloud. The introduction of TPA eliminates the involvement of the client through the auditing of whether his data stored in the cloud are indeed intact, which can be important in achieving economies of scale for Cloud Computing. The distributed storage system not only supports secure and robust data storage and retrieval, but also lets a user forward his data in the storage servers to another user without retrieving the data back, since services in Cloud Computing are not limited to archive or backup data only. While prior works on ensuring remote data integrity often lacks the support of either public Audit ability or dynamic data operations, this paper achieves both. We first identify the difficulties and potential security problems of direct extensions with fully dynamic data updates from prior works and then show how to construct an elegant verification scheme for the seamless integration of these two salient features in our protocol design A decentralized erasure code is an erasure code that independently computes each code word symbol for a message, where TPA can perform multiple auditing tasks simultaneously. Extensive security and performance analysis show that the proposed schemes are highly efficient and provably secure.

Download Full-text

A New Hybrid Storage System Base on Openstack

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.5371 ◽

2014 ◽

Vol 556-562 ◽

pp. 5371-5376

Author(s):

Ding Wei Wu ◽

Qiang Wu ◽

Xi Cheng Fu ◽

Zhi Zhong Ye ◽

Jia Lun Lin

Keyword(s):

Big Data ◽

Data Storage ◽

High Performance ◽

File System ◽

Storage Systems ◽

Low Cost ◽

Storage System ◽

Small Data ◽

Hybrid Storage ◽

Hybrid Storage System

In recent years, hybrid storage has gradually become a hotspot in the research of data storage owing to its high-performance and low cost. An OpenStack-based hybrid storage system is presented in this paper. According to the characteristics, the data is divided into small data, big data and temporary data in this hybrid storage system; meanwhile a storage strategy, combining database storage system, the virtual file system and servers file system, is designed. In the application of iCampus project, this proposed hybrid storage system shows better performance and higher efficiency than the traditional single storage systems.

Download Full-text