Overview of Big Data-Intensive Storage and its Technologies for Cloud and Fog Computing

Computing systems are becoming increasingly data-intensive because of the explosion of data and the needs for processing the data, and subsequently storage management is critical to application performance in such data-intensive computing systems. However, if existing resource management frameworks in these systems lack the support for storage management, this would cause unpredictable performance degradation when applications are under input/output (I/O) contention. Storage management of data-intensive systems is a challenge. Big Data plays a most major role in storage systems for data-intensive computing. This article deals with these difficulties along with discussion of High Performance Computing (HPC) systems, background for storage systems for data-intensive applications, storage patterns and storage mechanisms for Big Data, the Top 10 Cloud Storage Systems for data-intensive computing in today's world, and the interface between Big Data Intensive Storage and Cloud/Fog Computing. Big Data storage and its server statistics and usage distributions for the Top 500 Supercomputers in the world are also presented graphically and discussed as data-intensive storage components that can be interfaced with Fog-to-cloud interactions and enabling protocols.

Download Full-text

Overview of Big-Data-Intensive Storage and Its Technologies

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch002 ◽

2018 ◽

pp. 33-74

Author(s):

Richard S. Segall ◽

Jeffrey S. Cook

Keyword(s):

Big Data ◽

Data Storage ◽

Storage Systems ◽

Storage System ◽

Management Strategies ◽

Sensor Data ◽

Data Intensive Computing ◽

Data Intensive ◽

Future Challenges ◽

Data Storage System

This chapter deals with a detailed discussion on the storage systems for data-intensive computing using Big Data. The chapter begins with a brief introduction about data-intensive computing and types of parallel processing approaches. It also highlights the points that display how data-intensive computing systems differ from other forms of computing. A discussion on the importance of Big Data computing is put forth. The current and future challenges of storage in genomics are discussed in detail. Also, storage and data management strategies are given. The chapter's focus is then on the software challenges for storage. Storage use cases are provided like DataDirect Networks, SDSC, etc. The list of storage tools and their details are provided. A small section discusses the sensor data storage system. Then a table is provided that shows the top 10 cloud storage systems for data-intensive computing using Big Data in the world. Top 500 Big Data storage servers statistics are also displayed effectively by the images from Top500 website.

Download Full-text

Distributed Storage Systems for Data Intensive Computing

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch004 ◽

2012 ◽

pp. 95-117

Author(s):

Sudharshan S. Vazhkudai ◽

Ali R. Butt ◽

Xiaosong Ma

Keyword(s):

Storage Systems ◽

Distributed Storage ◽

Data Availability ◽

Data Intensive Computing ◽

Application Performance ◽

Data Staging ◽

Data Intensive ◽

Distributed Storage Systems ◽

User Access ◽

Day To Day Operations

In this chapter, the authors present an overview of the utility of distributed storage systems in supporting modern applications that are increasingly becoming data intensive. Their coverage of distributed storage systems is based on the requirements imposed by data intensive computing and not a mere summary of storage systems. To this end, they delve into several aspects of supporting data-intensive analysis, such as data staging, offloading, checkpointing, and end-user access to terabytes of data, and illustrate the use of novel techniques and methodologies for realizing distributed storage systems therein. The data deluge from scientific experiments, observations, and simulations is affecting all of the aforementioned day-to-day operations in data-intensive computing. Modern distributed storage systems employ techniques that can help improve application performance, alleviate I/O bandwidth bottleneck, mask failures, and improve data availability. They present key guiding principles involved in the construction of such storage systems, associated tradeoffs, design, and architecture, all with an eye toward addressing challenges of data-intensive scientific applications. They highlight the concepts involved using several case studies of state-of-the-art storage systems that are currently available in the data-intensive computing landscape.

Download Full-text

Marmot: A Hadoop-based High Performance Data Storage Management System for Processing Geospatial or Geo-Spatial Big Data

Journal of Korean Society for Geospatial Information System ◽

10.7319/kogsis.2018.26.1.003 ◽

2018 ◽

Vol 26 (1) ◽

pp. 3-10

Author(s):

Jung Hee Jo ◽

Kang-Woo Lee

Keyword(s):

Big Data ◽

Data Storage ◽

Management System ◽

High Performance ◽

Storage Management ◽

Performance Data ◽

Spatial Big Data

Download Full-text

Storage Management of Data-intensive Computing Systems

10.25148/etd.fidc000251 ◽

2016 ◽

Author(s):

Yiqi Xu

Keyword(s):

Storage Management ◽

Data Intensive Computing ◽

Computing Systems ◽

Data Intensive

Download Full-text

Louisiana: a model for advancing regional e-Research through cyberinfrastructure

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2009.0037 ◽

2009 ◽

Vol 367 (1897) ◽

pp. 2459-2469 ◽

Cited By ~ 3

Author(s):

Daniel S. Katz ◽

Gabrielle Allen ◽

Ricardo Cortez ◽

Carolina Cruz-Neira ◽

Raju Gottumukkala ◽

...

Keyword(s):

Data Storage ◽

High Performance ◽

Storage Systems ◽

The State ◽

Collaborative Effort ◽

Data Repositories ◽

Computing Systems ◽

Significant Difference ◽

Software Programs

Louisiana researchers and universities are leading a concentrated, collaborative effort to advance statewide e-Research through a new cyberinfrastructure: computing systems, data storage systems, advanced instruments and data repositories, visualization environments and people, all linked together by software programs and high-performance networks. This effort has led to a set of interlinked projects that have started making a significant difference in the state, and has created an environment that encourages increased collaboration, leading to new e-Research. This paper describes the overall effort, the new projects and environment and the results to date.

Download Full-text

A New Hybrid Storage System Base on Openstack

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.5371 ◽

2014 ◽

Vol 556-562 ◽

pp. 5371-5376

Author(s):

Ding Wei Wu ◽

Qiang Wu ◽

Xi Cheng Fu ◽

Zhi Zhong Ye ◽

Jia Lun Lin

Keyword(s):

Big Data ◽

Data Storage ◽

High Performance ◽

File System ◽

Storage Systems ◽

Low Cost ◽

Storage System ◽

Small Data ◽

Hybrid Storage ◽

Hybrid Storage System

In recent years, hybrid storage has gradually become a hotspot in the research of data storage owing to its high-performance and low cost. An OpenStack-based hybrid storage system is presented in this paper. According to the characteristics, the data is divided into small data, big data and temporary data in this hybrid storage system; meanwhile a storage strategy, combining database storage system, the virtual file system and servers file system, is designed. In the application of iCampus project, this proposed hybrid storage system shows better performance and higher efficiency than the traditional single storage systems.

Download Full-text

The bounds of the distributed data-intensive computing systems

Pollack Periodica ◽

10.1556/pollack.2.2007.s.8 ◽

2007 ◽

Vol 2 (Supplement 1) ◽

pp. 85-96 ◽

Cited By ~ 1

Author(s):

Antal Buza

Keyword(s):

Distributed Data ◽

Data Intensive Computing ◽

Computing Systems ◽

Data Intensive

Download Full-text

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

Custom templates based heterogeneous resource allocation for data-intensive applications

10.32469/10355/86482 ◽

2020 ◽

Author(s):

◽

Ronny Bazan Antequera

Keyword(s):

High Performance ◽

Real Data ◽

University Of Missouri ◽

Application Performance ◽

Data Intensive ◽

Edge Based ◽

The Right ◽

Heterogeneous Cloud ◽

Data Intensive Applications ◽

Cloud Resources

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI-COLUMBIA AT REQUEST OF AUTHOR.] The increase of data-intensive applications in science and engineering fields (i.e., bioinformatics, cybermanufacturing) demand the use of high-performance computing resources. However, data-intensive applications' local resources usually present limited capacity and availability due to sizable upfront costs. Moreover, using remote public resources presents constraints at the private edge network domain. Specifically, mis-configured network policies cause bottlenecks due to the other application cross-traffic attempting to use shared networking resources. Additionally, selecting the right remote resources can be cumbersome especially for those users who are interested in the application execution considering nonfunctional requirements such as performance, security and cost. The data-intensive applications have recurrent deployments and similar infrastructure requirements that can be addressed by creating templates. In this thesis, we handle applications requirements through intelligent resource 'abstractions' coupled with 'reusable' approaches that save time and effort in deploying new cloud architectures. Specifically, we design a novel custom template middleware that can retrieve blue prints of resource configuration, technical/policy information, and benchmarks of workflow performance to facilitate repeatable/reusable resource composition. The middleware considers hybrid-recommendation methodology (Online and offline recommendation) to leverage a catalog to rapidly check custom template solution correctness before/during resource consumption. Further, it prescribes application adaptations by fostering effective social interactions during the application's scaling stages. Based on the above approach, we organize the thesis contributions under two main thrusts: (i) Custom Templates for Cloud Networking for Data-intensive Applications: This involves scheduling transit selection, engineering at the campus-edge based upon real-time policy control. Our solution ensures prioritized application performance delivery for multi-tenant traffic profiles from a diverse set of actual data intensive applications in bioinformatics. (ii) Custom Templates for Cloud Computing for Data-intensive Applications: This involves recommending cloud resources for data-intensive applications based on a custom template catalog. We develop a novel expert system approach that is implemented as a middleware to abstracts data-intensive application requirements for custom templates composition. We uniquely consider heterogeneous cloud resources selection for the deployment of cloud architectures for real data-intensive applications in cybermanufacturing.

Download Full-text