scholarly journals ExSeisDat: A set of parallel I/O and workflow libraries for petroleum seismology

Author(s):  
Meghan A. Fisher ◽  
Pádraig Ó. Conbhuí ◽  
Cathal Ó. Brion ◽  
Jean-Thomas Acquaviva ◽  
Seán Delaney ◽  
...  

Seismic data-sets are extremely large and are broken into data files, ranging in size from 100s of GiBs to 10s of TiBs and larger. The parallel I/O for these files is complex due to the amount of data along with varied and multiple access patterns within individual files. Properties of legacy file formats, such as the de-facto standard SEG-Y, also contribute to the decrease in developer productivity while working with these files. SEG-Y files embed their own internal layout which could lead to conflict with traditional, file-system-level layout optimization schemes. Additionally, as seismic files continue to increase in size, memory bottlenecks will be exacerbated, resulting in the need for smart I/O optimization not only to increase the efficiency of read/writes, but to manage memory usage as well. The ExSeisDat (Extreme-Scale Seismic Data) set of libraries addresses these problems through the development and implementation of easy to use, object oriented libraries that are portable and open source with bindings available in multiple languages. The lower level parallel I/O library, ExSeisPIOL (Extreme-Scale Seismic Parallel I/O Library), targets SEG-Y and other proprietary formats, simplifying I/O by internally interfacing MPI-I/O and other I/O interfaces. The I/O is explicitly handled; end users only need to define the memory limits, decomposition of I/O across processes, and data access patterns when reading and writing data. ExSeisPIOL bridges the layout gap between the SEG-Y file structure and file system organization. The higher level parallel seismic workflow library, ExSeisFlow (Extreme-Scale Seismic workFlow), leverages ExSeisPIOL, further simplifying I/O by implicitly handling all I/O parameters, thus allowing geophysicists to focus on domain-specific development. Operations in ExSeisFlow focus on prestack processing and can be performed on single traces, individual gathers, and across entire surveys, including out of core sorting, binning, filtering, and transforming. To optimize memory management, the workflow only reads in data pertinent to the operations being performed instead of an entire file. A smart caching system manages the read data, discarding it when no longer needed in the workflow. As the libraries are optimized to handle spatial and temporal locality, they are a natural fit to burst buffer technologies, particularly DDN’s Infinite Memory Engine (IME) system. With appropriate access semantics or through the direct exploitation of the low-level interfaces, the ExSeisDat stack on IME delivers a significant improvement to I/O performance over standalone parallel file systems like Lustre.

2013 ◽  
Vol 23 (02) ◽  
pp. 1340005 ◽  
Author(s):  
ANDREW GRIMSHAW ◽  
MARK MORGAN ◽  
AVINASH KALYANARAMAN

Federated, secure, standardized, scalable, and transparent mechanism to access and share resources, particularly data resources, across organizational boundaries that does not require application modification and does not disrupt existing data access patterns has been needed for some time in the computational science community. The Global Federated File System (GFFS) addresses this need and is a foundational component of the NSF-funded eXtreme Science and Engineering Discovery Environment (XSEDE) program. The GFFS allows user applications to access (create, read, update, delete) remote resources in a location-transparent fashion. Existing applications, whether they are statically linked binaries, dynamically linked binaries, or scripts (shell, PERL, Python), can access resources anywhere in the GFFS without modification (subject to access control). In this paper we present an overview of the GFFS and its most common use cases: accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, directly sharing data with a collaborator at another institution, accessing remote computing resources, and interacting with remote running jobs. We present these uses cases and how they are realized using the GFFS.


2021 ◽  
Vol 14 (10) ◽  
pp. 1900-1912
Author(s):  
Yingqiang Zhang ◽  
Chaoyi Ruan ◽  
Cheng Li ◽  
Xinjun Yang ◽  
Wei Cao ◽  
...  

It is challenging for cloud-native relational databases to meet the ever-increasing needs of scaling compute and memory resources independently and elastically. The recent emergence of memory disaggregation architecture, relying on high-speed RDMA network, offers opportunities to build cost-effective and elastic cloud-native databases. There exist proposals to let unmodified applications run transparently on disaggregated systems. However, running relational database kernel atop such proposals experiences notable performance degradation and time-consuming failure recovery, offsetting the benefits of disaggregation. To address these challenges, in this paper, we propose a novel database architecture called LegoBase, which explores the co-design of database kernel and memory disaggregation. It pushes the memory management back to the database layer for bypassing the Linux I/O stack and re-using or designing (remote) memory access optimizations with an understanding of data access patterns. LegoBase further splits the conventional ARIES fault tolerance protocol to independently handle the local and remote memory failures for fast recovery of compute instances. We implemented LegoBase atop MySQL. We compare LegoBase against MySQL running on a standalone machine and the state-of-the-art disaggregation proposal Infiniswap. Our evaluation shows that even with a large fraction of data placed on the remote memory, LegoBase's system performance in terms of throughput (up to 9.41% drop) and P99 latency (up to 11.58% increase) is comparable to the monolithic MySQL setup, and significantly outperforms (1.99x-2.33x, respectively) the deployment of MySQL over Infiniswap. Meanwhile, LegoBase introduces an up to 3.87x and 5.48x speedup of the recovery and warm-up time, respectively, over the monolithic MySQL and MySQL over Infiniswap, when handling failures or planned re-configurations.


2012 ◽  
Vol 20 (2) ◽  
pp. 89-114 ◽  
Author(s):  
H. Carter Edwards ◽  
Daniel Sunderland ◽  
Vicki Porter ◽  
Chris Amsler ◽  
Sam Mish

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].


Author(s):  
Armando Fandango ◽  
William Rivera

Scientific Big Data being gathered at exascale needs to be stored, retrieved and manipulated. The storage stack for scientific Big Data includes a file system at the system level for physical organization of the data, and a file format and input/output (I/O) system at the application level for logical organization of the data; both of them of high-performance variety for exascale. The high-performance file system is designed with concurrent access, high-speed transmission and fault tolerance characteristics. High-performance file formats and I/O are designed to allow parallel and distributed applications with easy and fast access to Big Data. These specialized file formats make it easier to store and access Big Data for scientific visualization and predictive analytics. This chapter provides a brief review of the characteristics of high-performance file systems such as Lustre and GPFS, and high-performance file formats such as HDF5, NetCDF, MPI-IO, and HDFS.


2020 ◽  
Vol 13 (12) ◽  
pp. 1656-1671 ◽  
Author(s):  
Jizhe Xia ◽  
Sicheng Huang ◽  
Shaobiao Zhang ◽  
Xiaoming Li ◽  
Jianrong Lyu ◽  
...  

2013 ◽  
Vol 10 (4) ◽  
pp. 1-19
Author(s):  
Andrei Hagiescu ◽  
Bing Liu ◽  
R. Ramanathan ◽  
Sucheendra K. Palaniappan ◽  
Zheng Cui ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document