ExSeisDat: A set of parallel I/O and workflow libraries for petroleum seismology

Seismic data-sets are extremely large and are broken into data files, ranging in size from 100s of GiBs to 10s of TiBs and larger. The parallel I/O for these files is complex due to the amount of data along with varied and multiple access patterns within individual files. Properties of legacy file formats, such as the de-facto standard SEG-Y, also contribute to the decrease in developer productivity while working with these files. SEG-Y files embed their own internal layout which could lead to conflict with traditional, file-system-level layout optimization schemes. Additionally, as seismic files continue to increase in size, memory bottlenecks will be exacerbated, resulting in the need for smart I/O optimization not only to increase the efficiency of read/writes, but to manage memory usage as well. The ExSeisDat (Extreme-Scale Seismic Data) set of libraries addresses these problems through the development and implementation of easy to use, object oriented libraries that are portable and open source with bindings available in multiple languages. The lower level parallel I/O library, ExSeisPIOL (Extreme-Scale Seismic Parallel I/O Library), targets SEG-Y and other proprietary formats, simplifying I/O by internally interfacing MPI-I/O and other I/O interfaces. The I/O is explicitly handled; end users only need to define the memory limits, decomposition of I/O across processes, and data access patterns when reading and writing data. ExSeisPIOL bridges the layout gap between the SEG-Y file structure and file system organization. The higher level parallel seismic workflow library, ExSeisFlow (Extreme-Scale Seismic workFlow), leverages ExSeisPIOL, further simplifying I/O by implicitly handling all I/O parameters, thus allowing geophysicists to focus on domain-specific development. Operations in ExSeisFlow focus on prestack processing and can be performed on single traces, individual gathers, and across entire surveys, including out of core sorting, binning, filtering, and transforming. To optimize memory management, the workflow only reads in data pertinent to the operations being performed instead of an entire file. A smart caching system manages the read data, discarding it when no longer needed in the workflow. As the libraries are optimized to handle spatial and temporal locality, they are a natural fit to burst buffer technologies, particularly DDN’s Infinite Memory Engine (IME) system. With appropriate access semantics or through the direct exploitation of the low-level interfaces, the ExSeisDat stack on IME delivers a significant improvement to I/O performance over standalone parallel file systems like Lustre.

Download Full-text

GFFS — THE XSEDE GLOBAL FEDERATED FILE SYSTEM

Parallel Processing Letters ◽

10.1142/s0129626413400057 ◽

2013 ◽

Vol 23 (02) ◽

pp. 1340005 ◽

Cited By ~ 6

Author(s):

ANDREW GRIMSHAW ◽

MARK MORGAN ◽

AVINASH KALYANARAMAN

Keyword(s):

Access Control ◽

File System ◽

Data Access ◽

Computational Science ◽

Science And Engineering ◽

Organizational Boundaries ◽

Science Community ◽

Data Access Patterns ◽

Access Patterns ◽

Existing Data

Federated, secure, standardized, scalable, and transparent mechanism to access and share resources, particularly data resources, across organizational boundaries that does not require application modification and does not disrupt existing data access patterns has been needed for some time in the computational science community. The Global Federated File System (GFFS) addresses this need and is a foundational component of the NSF-funded eXtreme Science and Engineering Discovery Environment (XSEDE) program. The GFFS allows user applications to access (create, read, update, delete) remote resources in a location-transparent fashion. Existing applications, whether they are statically linked binaries, dynamically linked binaries, or scripts (shell, PERL, Python), can access resources anywhere in the GFFS without modification (subject to access control). In this paper we present an overview of the GFFS and its most common use cases: accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, directly sharing data with a collaborator at another institution, accessing remote computing resources, and interacting with remote running jobs. We present these uses cases and how they are realized using the GFFS.

Download Full-text

Towards cost-effective and elastic cloud database deployment via memory disaggregation

Proceedings of the VLDB Endowment ◽

10.14778/3467861.3467877 ◽

2021 ◽

Vol 14 (10) ◽

pp. 1900-1912

Author(s):

Yingqiang Zhang ◽

Chaoyi Ruan ◽

Cheng Li ◽

Xinjun Yang ◽

Wei Cao ◽

...

Keyword(s):

Memory Management ◽

High Speed ◽

Relational Databases ◽

Large Fraction ◽

Cost Effective ◽

Data Access ◽

Remote Memory ◽

Recent Emergence ◽

Data Access Patterns ◽

Access Patterns

It is challenging for cloud-native relational databases to meet the ever-increasing needs of scaling compute and memory resources independently and elastically. The recent emergence of memory disaggregation architecture, relying on high-speed RDMA network, offers opportunities to build cost-effective and elastic cloud-native databases. There exist proposals to let unmodified applications run transparently on disaggregated systems. However, running relational database kernel atop such proposals experiences notable performance degradation and time-consuming failure recovery, offsetting the benefits of disaggregation. To address these challenges, in this paper, we propose a novel database architecture called LegoBase, which explores the co-design of database kernel and memory disaggregation. It pushes the memory management back to the database layer for bypassing the Linux I/O stack and re-using or designing (remote) memory access optimizations with an understanding of data access patterns. LegoBase further splits the conventional ARIES fault tolerance protocol to independently handle the local and remote memory failures for fast recovery of compute instances. We implemented LegoBase atop MySQL. We compare LegoBase against MySQL running on a standalone machine and the state-of-the-art disaggregation proposal Infiniswap. Our evaluation shows that even with a large fraction of data placed on the remote memory, LegoBase's system performance in terms of throughput (up to 9.41% drop) and P99 latency (up to 11.58% increase) is comparable to the monolithic MySQL setup, and significantly outperforms (1.99x-2.33x, respectively) the deployment of MySQL over Infiniswap. Meanwhile, LegoBase introduces an up to 3.87x and 5.48x speedup of the recovery and warm-up time, respectively, over the monolithic MySQL and MySQL over Infiniswap, when handling failures or planned re-configurations.

Download Full-text

Profiling Dynamic Data Access Patterns with Controlled Overhead and Quality

Proceedings of the 20th International Middleware Conference Industrial Track ◽

10.1145/3366626.3368125 ◽

2019 ◽

Author(s):

SeongJae Park ◽

Yunjae Lee ◽

Heon Y. Yeom

Keyword(s):

Data Access ◽

Dynamic Data ◽

Data Access Patterns ◽

Access Patterns

Download Full-text

Manycore Performance-Portability: Kokkos Multidimensional Array Library

Scientific Programming ◽

10.1155/2012/917630 ◽

2012 ◽

Vol 20 (2) ◽

pp. 89-114 ◽

Cited By ~ 13

Author(s):

H. Carter Edwards ◽

Daniel Sunderland ◽

Vicki Porter ◽

Chris Amsler ◽

Sam Mish

Keyword(s):

Programming Model ◽

Engineering Application ◽

Data Access ◽

Memory Space ◽

Performance Requirements ◽

Application Programming ◽

Multidimensional Array ◽

And Performance ◽

Data Access Patterns ◽

Access Patterns

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].

Download Full-text

Big Data Access Patterns

Big Data Application Architecture Q & A ◽

10.1007/978-1-4302-6293-0_5 ◽

2013 ◽

pp. 57-68

Author(s):

Nitin Sawant ◽

Himanshu Shah

Keyword(s):

Big Data ◽

Data Access ◽

Data Access Patterns ◽

Access Patterns

Download Full-text

High Performance Storage for Big Data Analytics and Visualization

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch010 ◽

2018 ◽

pp. 254-275

Author(s):

Armando Fandango ◽

William Rivera

Keyword(s):

Big Data ◽

High Speed ◽

High Performance ◽

File System ◽

Predictive Analytics ◽

Big Data Analytics ◽

File Systems ◽

Distributed Applications ◽

System Level ◽

File Formats

Scientific Big Data being gathered at exascale needs to be stored, retrieved and manipulated. The storage stack for scientific Big Data includes a file system at the system level for physical organization of the data, and a file format and input/output (I/O) system at the application level for logical organization of the data; both of them of high-performance variety for exascale. The high-performance file system is designed with concurrent access, high-speed transmission and fault tolerance characteristics. High-performance file formats and I/O are designed to allow parallel and distributed applications with easy and fast access to Big Data. These specialized file formats make it easier to store and access Big Data for scientific visualization and predictive analytics. This chapter provides a brief review of the characteristics of high-performance file systems such as Lustre and GPFS, and high-performance file formats such as HDF5, NetCDF, MPI-IO, and HDFS.

Download Full-text