scholarly journals A Benchmark for Suitability of Alluxio over Spark

Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB.

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Xiangli Chang ◽  
Hailang Cui

With the increasing popularity of a large number of Internet-based services and a large number of services hosted on cloud platforms, a more powerful back-end storage system is needed to support these services. At present, it is very difficult or impossible to implement a distributed storage to meet all the above assumptions. Therefore, the focus of research is to limit different characteristics to design different distributed storage solutions to meet different usage scenarios. Economic big data should have the basic requirements of high storage efficiency and fast retrieval speed. The large number of small files and the diversity of file types make the storage and retrieval of economic big data face severe challenges. This paper is oriented to the application requirements of cross-modal analysis of economic big data. According to the source and characteristics of economic big data, the data types are analyzed and the database storage architecture and data storage structure of economic big data are designed. Taking into account the spatial, temporal, and semantic characteristics of economic big data, this paper proposes a unified coding method based on the spatiotemporal data multilevel division strategy combined with Geohash and Hilbert and spatiotemporal semantic constraints. A prototype system was constructed based on Mongo DB, and the performance of the multilevel partition algorithm proposed in this paper was verified by the prototype system based on the realization of data storage management functions. The Wiener distributed memory based on the principle of Wiener filter is used to store the workload of each workload distributed storage window in a distributed manner. For distributed storage workloads, this article adopts specific types of workloads. According to its periodicity, the workload is divided into distributed storage windows of specific duration. At the beginning of each distributed storage window, distributed storage is distributed to the next distributed storage window. Experiments and tests have verified the distributed storage strategy proposed in this article, which proves that the Wiener distributed storage solution can save platform resources and configuration costs while ensuring Service Level Agreement (SLA).


2019 ◽  
Vol 135 ◽  
pp. 04076 ◽  
Author(s):  
Marina Bolsunovskaya ◽  
Svetlana Shirokova ◽  
Aleksandra Loginova

This paper is devoted to the problem of developing and application of data storage systems (DSS) and tools for managing such systems to predict failures and provide fault tolerance specifications. Nowadays DSS are widely used for collecting data in Smart Home and Smart Cites management systems. For example, large data warehouses are utilized in traffic management systems. The results of the current data storage market state analysis are shown, and the project the purpose of which is to develop a hardware and software complex to predict failures in the storage system is presented.


Author(s):  
Rajni Aron ◽  
Deepak Kumar Aggarwal

Cloud Computing has become a buzzword in the IT industry. Cloud Computing which provides inexpensive computing resources on the pay-as-you-go basis is promptly gaining momentum as a substitute for traditional Information Technology (IT) based organizations. Therefore, the increased utilization of Clouds makes an execution of Big Data processing jobs a vital research area. As more and more users have started to store/process their real-time data in Cloud environments, Resource Provisioning and Scheduling of Big Data processing jobs becomes a key element of consideration for efficient execution of Big Data applications. This chapter discusses the fundamental concepts supporting Cloud Computing & Big Data terms and the relationship between them. This chapter will help researchers find the important characteristics of Cloud Resource Management Systems to handle Big Data processing jobs and will also help to select the most suitable technique for processing Big Data jobs in Cloud Computing environment.


2016 ◽  
Vol 4 (1) ◽  
Author(s):  
Agus Maman Abadi ◽  
Musthofa Musthofa ◽  
Emut Emut

The increasing need in techniques of storing big data presents a new challenge. One way to address this challenge is the use of distributed storage systems. One strategy that implemented in distributed data storage systems is the use of Erasure Code which applied to network coding. The code used in this technique is based on the algebraic structure which is called as vector space. Some studies have also been carried out to create code that is based on other algebraic structures such as module.  In this study, we are going to try to set up a code based on the algebraic structure which is a generalization of the module that is semimodule by utilizing the max operations and sum operations at max plus algebra. The results of this study indicate that the max operation and the addition operation on max plus algebra cannot be used to establish a semimodule code, but by modifying the operation "+" as "min", we get a code based on semimodule. Keywords:   code, distributed storage systems, network coding, semimodule, max plus algebra


Author(s):  
Richard S. Segall ◽  
Jeffrey S. Cook

This chapter deals with a detailed discussion on the storage systems for data-intensive computing using Big Data. The chapter begins with a brief introduction about data-intensive computing and types of parallel processing approaches. It also highlights the points that display how data-intensive computing systems differ from other forms of computing. A discussion on the importance of Big Data computing is put forth. The current and future challenges of storage in genomics are discussed in detail. Also, storage and data management strategies are given. The chapter's focus is then on the software challenges for storage. Storage use cases are provided like DataDirect Networks, SDSC, etc. The list of storage tools and their details are provided. A small section discusses the sensor data storage system. Then a table is provided that shows the top 10 cloud storage systems for data-intensive computing using Big Data in the world. Top 500 Big Data storage servers statistics are also displayed effectively by the images from Top500 website.


Author(s):  
Rajni Aron ◽  
Deepak Kumar Aggarwal

Cloud Computing has become a buzzword in the IT industry. Cloud Computing which provides inexpensive computing resources on the pay-as-you-go basis is promptly gaining momentum as a substitute for traditional Information Technology (IT) based organizations. Therefore, the increased utilization of Clouds makes an execution of Big Data processing jobs a vital research area. As more and more users have started to store/process their real-time data in Cloud environments, Resource Provisioning and Scheduling of Big Data processing jobs becomes a key element of consideration for efficient execution of Big Data applications. This chapter discusses the fundamental concepts supporting Cloud Computing & Big Data terms and the relationship between them. This chapter will help researchers find the important characteristics of Cloud Resource Management Systems to handle Big Data processing jobs and will also help to select the most suitable technique for processing Big Data jobs in Cloud Computing environment.


Author(s):  
Jun Tian ◽  
Lirong Huang

<span lang="EN-US">Aiming at the perception data acquired by the widely used, fast-developing but still not perfect wireless sensor network system, a relatively complete and universal system for the collection, transmission, storage and cluster analysis of perception data is designed. P</span><span lang="EN-US">erception data is spliced and compressed at the node and reconstructed at the base station, the problem of the acquisition of </span><span lang="EN-US">perception data</span><span lang="EN-US"> and energy consumption of transmission is optimized, the distributed storage system is established, and the data reading mechanism and data storage architecture are designed accordingly.</span><span lang="EN-US">The data acquisition protocol and the traditional protocol, the storage system itself and the Oracle database system, and <a name="_Hlk527548018"></a>Standard Deviation and Eigensystem Realization Algorithm are respectively adopted for comparison test.</span><span lang="EN-US">Based on Standard Deviation algorithm, the operation of suffix tree clustering is carried out, and the general steps of suffix tree clustering are studied and the structure of perception data and the characteristics of storage are adapted, and the data classification operation based on suffix tree clustering is completed.</span><span lang="EN-US"> The results show that </span><span lang="EN-US">proposed Standard Deviationalgorithm algorithm not only inherits the efficiency of the classical algorithm for processing big data, but also has obvious effect on large-scale discrete data processing, and the efficiency is obviously improved compared with the traditional method.</span>


Author(s):  
G.CHINNA PULLAIAH ◽  
DILIP VENKATA KUMAR VENGALA

Cloud Computing has been envisioned as the next-generation architecture of IT Enterprise. It moves the application software and databases to the centralized large data centers, where the management of the data and services may not be fully trustworthy. This unique paradigm brings about many new security challenges, which have not been well understood. This work studies the problem of ensuring the integrity of data storage in Cloud Computing. In particular, we consider the task of allowing a threshold proxy re-encryption, on behalf of the cloud client, to verify the integrity of the dynamic data stored in the cloud. The introduction of TPA eliminates the involvement of the client through the auditing of whether his data stored in the cloud are indeed intact, which can be important in achieving economies of scale for Cloud Computing. The distributed storage system not only supports secure and robust data storage and retrieval, but also lets a user forward his data in the storage servers to another user without retrieving the data back, since services in Cloud Computing are not limited to archive or backup data only. While prior works on ensuring remote data integrity often lacks the support of either public Audit ability or dynamic data operations, this paper achieves both. We first identify the difficulties and potential security problems of direct extensions with fully dynamic data updates from prior works and then show how to construct an elegant verification scheme for the seamless integration of these two salient features in our protocol design A decentralized erasure code is an erasure code that independently computes each code word symbol for a message, where TPA can perform multiple auditing tasks simultaneously. Extensive security and performance analysis show that the proposed schemes are highly efficient and provably secure.


2014 ◽  
Vol 556-562 ◽  
pp. 5371-5376
Author(s):  
Ding Wei Wu ◽  
Qiang Wu ◽  
Xi Cheng Fu ◽  
Zhi Zhong Ye ◽  
Jia Lun Lin

In recent years, hybrid storage has gradually become a hotspot in the research of data storage owing to its high-performance and low cost. An OpenStack-based hybrid storage system is presented in this paper. According to the characteristics, the data is divided into small data, big data and temporary data in this hybrid storage system; meanwhile a storage strategy, combining database storage system, the virtual file system and servers file system, is designed. In the application of iCampus project, this proposed hybrid storage system shows better performance and higher efficiency than the traditional single storage systems.


Sign in / Sign up

Export Citation Format

Share Document