Computational storage: an efficient and scalable platform for big data and HPC applications

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

Download Full-text

Task-based programming in COMPSs to converge from HPC to big data

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017701278 ◽

2017 ◽

Vol 32 (1) ◽

pp. 45-60 ◽

Cited By ~ 11

Author(s):

Javier Conejero ◽

Sandra Corella ◽

Rosa M Badia ◽

Jesus Labarta

Keyword(s):

Big Data ◽

High Performance ◽

Programming Model ◽

Good Alternative ◽

Programming Models ◽

Suitable Model ◽

Advantages And Disadvantages ◽

Big Data Applications ◽

And Performance ◽

The Right

Task-based programming has proven to be a suitable model for high-performance computing (HPC) applications. Different implementations have been good demonstrators of this fact and have promoted the acceptance of task-based programming in the OpenMP standard. Furthermore, in recent years, Apache Spark has gained wide popularity in business and research environments as a programming model for addressing emerging big data problems. COMP Superscalar (COMPSs) is a task-based environment that tackles distributed computing (including Clouds) and is a good alternative for a task-based programming model for big data applications. This article describes why we consider that task-based programming models are a good approach for big data applications. The article includes a comparison of Spark and COMPSs in terms of architecture, programming model, and performance. It focuses on the differences that both frameworks have in structural terms, on their programmability interface, and in terms of their efficiency by means of three widely known benchmarking kernels: Wordcount, Kmeans, and Terasort. These kernels enable the evaluation of the more important functionalities of both programming models and analyze different work flows and conditions. The main results achieved from this comparison are (1) COMPSs is able to extract the inherent parallelism from the user code with minimal coding effort as opposed to Spark, which requires the existing algorithms to be adapted and rewritten by explicitly using their predefined functions, (2) it is an improvement in terms of performance when compared with Spark, and (3) COMPSs has shown to scale better than Spark in most cases. Finally, we discuss the advantages and disadvantages of both frameworks, highlighting the differences that make them unique, thereby helping to choose the right framework for each particular objective.

Download Full-text

High Performance Storage for Big Data Analytics and Visualization

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch010 ◽

2018 ◽

pp. 254-275

Author(s):

Armando Fandango ◽

William Rivera

Keyword(s):

Big Data ◽

High Speed ◽

High Performance ◽

File System ◽

Predictive Analytics ◽

Big Data Analytics ◽

File Systems ◽

Distributed Applications ◽

System Level ◽

File Formats

Scientific Big Data being gathered at exascale needs to be stored, retrieved and manipulated. The storage stack for scientific Big Data includes a file system at the system level for physical organization of the data, and a file format and input/output (I/O) system at the application level for logical organization of the data; both of them of high-performance variety for exascale. The high-performance file system is designed with concurrent access, high-speed transmission and fault tolerance characteristics. High-performance file formats and I/O are designed to allow parallel and distributed applications with easy and fast access to Big Data. These specialized file formats make it easier to store and access Big Data for scientific visualization and predictive analytics. This chapter provides a brief review of the characteristics of high-performance file systems such as Lustre and GPFS, and high-performance file formats such as HDF5, NetCDF, MPI-IO, and HDFS.

Download Full-text

Architecture for Big Data Storage in Different Cloud Deployment Models

Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing ◽

10.4018/978-1-7998-5339-8.ch009 ◽

2021 ◽

pp. 178-208

Author(s):

Chandu Thota ◽

Gunasekaran Manogaran ◽

Daphne Lopez ◽

Revathi Sundarasekar

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Storage ◽

High Performance ◽

Data Services ◽

Big Data Applications ◽

Nosql Database ◽

Amazon Web Services ◽

Product Domains ◽

Scalable Database

Cloud Computing is a new computing model that distributes the computation on a resource pool. The need for a scalable database capable of expanding to accommodate growth has increased with the growing data in web world. More familiar Cloud Computing vendors such as Amazon Web Services, Microsoft, Google, IBM and Rackspace offer cloud based Hadoop and NoSQL database platforms to process Big Data applications. Variety of services are available that run on top of cloud platforms freeing users from the need to deploy their own systems. Nowadays, integrating Big Data and various cloud deployment models is major concern for Internet companies especially software and data services vendors that are just getting started themselves. This chapter proposes an efficient architecture for integration with comprehensive capabilities including real time and bulk data movement, bi-directional replication, metadata management, high performance transformation, data services and data quality for customer and product domains.

Download Full-text

Rethinking High Performance Computing System Architecture for Scientific Big Data Applications

2016 IEEE Trustcom/BigDataSE/ISPA ◽

10.1109/trustcom.2016.0248 ◽

2016 ◽

Author(s):

Yong Chen ◽

Chao Chen ◽

Yanlong Yin ◽

Xian-He Sun ◽

Rajeev Thakur ◽

...

Keyword(s):

Big Data ◽

High Performance Computing ◽

System Architecture ◽

High Performance ◽

Computing System ◽

Big Data Applications ◽

High Performance Computing System ◽

Performance Computing

Download Full-text

A Big Data Analytics Approach for the Development of Advanced Cardiology Applications

Information ◽

10.3390/info11020060 ◽

2020 ◽

Vol 11 (2) ◽

pp. 60 ◽

Cited By ~ 2

Author(s):

Lorenzo Carnevale ◽

Antonio Celesti ◽

Maria Fazio ◽

Massimo Villari

Keyword(s):

Big Data ◽

Data Analytics ◽

Response Times ◽

Distributed Processing ◽

Big Data Analytics ◽

Reliable System ◽

Huge Amount ◽

Ecg Signals ◽

Big Data Applications ◽

Electrocardiogram Ecg

Nowadays, we are observing a growing interest about Big Data applications in different healthcare sectors. One of this is definitely cardiology. In fact, electrocardiogram produces a huge amount of data about the heart health status that need to be stored and analysed in order to detect a possible issues. In this paper, we focus on the arrhythmia detection problem. Specifically, our objective is to address the problem of distributed processing considering big data generated by electrocardiogram (ECG) signals in order to carry out pre-processing analysis. Specifically, an algorithm for the identification of heartbeats and arrhythmias is proposed. Such an algorithm is designed in order to carry out distributed processing over the Cloud since big data could represent the bottleneck for cardiology applications. In particular, we implemented the Menard algorithm in Apache Spark in order to process big data coming form ECG signals in order to identify arrhythmias. Experiments conducted using a dataset provided by the Physionet.org European ST-T Database show an improvement in terms of response times. As highlighted by our outcomes, our solution provides a scalable and reliable system, which may address the challenges raised by big data in healthcare.

Download Full-text

Multilevel Active Storage for big data applications in high performance computing

2013 IEEE International Conference on Big Data ◽

10.1109/bigdata.2013.6691570 ◽

2013 ◽

Cited By ~ 3

Author(s):

Chao Chen ◽

Michael Lang ◽

Yong Chen

Keyword(s):

Big Data ◽

High Performance Computing ◽

High Performance ◽

Active Storage ◽

Big Data Applications ◽

Performance Computing

Download Full-text

CedCom: A high-performance architecture for Big Data applications

2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA) ◽

10.1109/aiccsa.2014.7073257 ◽

2014 ◽

Cited By ~ 4

Author(s):

Tanguy Raynaud ◽

Rafiqul Haque ◽

Hassan Ait-kaci

Keyword(s):

Big Data ◽

High Performance ◽

Big Data Applications

Download Full-text

HAMR: A dataflow-based real-time in-memory cluster computing engine

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016672080 ◽

2016 ◽

Vol 31 (5) ◽

pp. 361-374 ◽

Cited By ~ 3

Author(s):

Yao Wu ◽

Long Zheng ◽

Brian Heilig ◽

Guang R Gao

Keyword(s):

Big Data ◽

Memory Management ◽

High Performance ◽

Cluster Computing ◽

Programming Model ◽

Distributed Processing ◽

Large Data ◽

Computing System ◽

Fine Grain ◽

Execution Model

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is Hadoop, which uses MapReduce as a programming/execution model and takes disks as intermedia to process huge volumes of data. Spark, as an in-memory computing engine, can solve the iterative and interactive problems more efficiently. However, currently it is a consensus that they are not the final solutions to big data due to a MapReduce-like programming model, synchronous execution model and the constraint that only supports batch processing, and so on. A new solution, especially, a fundamental evolution is needed to bring big data solutions into a new era. In this paper, we introduce a new cluster computing system called HAMR which supports both batch and streaming processing. To achieve better performance, HAMR integrates high performance computing approaches, i.e. dataflow fundamental into a big data solution. With more specifications, HAMR is fully designed based on in-memory computing to reduce the unnecessary disk access overhead; task scheduling and memory management are in fine-grain manner to explore more parallelism; asynchronous execution improves efficiency of computation resource usage, and also makes workload balance across the whole cluster better. The experimental results show that HAMR can outperform Hadoop MapReduce and Spark by up to 19x and 7x respectively, in the same cluster environment. Furthermore, HAMR can handle scaling data size well beyond the capabilities of Spark.

Download Full-text