Exploring Efficient Architectures on Remote In-Memory NVM over RDMA

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-20
Author(s):  
Qingfeng Zhuge ◽  
Hao Zhang ◽  
Edwin Hsing-Mean Sha ◽  
Rui Xu ◽  
Jun Liu ◽  
...  

Efficiently accessing remote file data remains a challenging problem for data processing systems. Development of technologies in non-volatile dual in-line memory modules (NVDIMMs), in-memory file systems, and RDMA networks provide new opportunities towards solving the problem of remote data access. A general understanding about NVDIMMs, such as Intel Optane DC Persistent Memory (DCPM), is that they expand main memory capacity with a cost of multiple times lower performance than DRAM. With an in-depth exploration presented in this paper, however, we show an interesting finding that the potential of NVDIMMs for high-performance, remote in-memory accesses can be revealed through careful design. We explore multiple architectural structures for accessing remote NVDIMMs in a real system using Optane DCPM, and compare the performance of various structures. Experiments are conducted to show significant performance gaps among different ways of using NVDIMMs as memory address space accessible through RDMA interface. Furthermore, we design and implement a prototype of user-level, in-memory file system, RIMFS, in the device DAX mode on Optane DCPM. By comparing against the DAX-supported Linux file system, Ext4-DAX, we show that the performance of remote reads on RIMFS over RDMA is 11.44 higher than that on a remote Ext4-DAX on average. The experimental results also show that the performance of remote accesses on RIMFS is maintained on a heavily loaded data server with CPU utilization as high as 90%, while the performance of remote reads on Ext4-DAX is significantly reduced by 49.3%, and the performance of local reads on Ext4-DAX is even more significantly reduced by 90.1%. The performance comparisons of writes exhibit the same trends.

2021 ◽  
Vol 17 (3) ◽  
pp. 1-25
Author(s):  
Bohong Zhu ◽  
Youmin Chen ◽  
Qing Wang ◽  
Youyou Lu ◽  
Jiwu Shu

Non-volatile memory and remote direct memory access (RDMA) provide extremely high performance in storage and network hardware. However, existing distributed file systems strictly isolate file system and network layers, and the heavy layered software designs leave high-speed hardware under-exploited. In this article, we propose an RDMA-enabled distributed persistent memory file system, Octopus + , to redesign file system internal mechanisms by closely coupling non-volatile memory and RDMA features. For data operations, Octopus + directly accesses a shared persistent memory pool to reduce memory copying overhead, and actively fetches and pushes data all in clients to rebalance the load between the server and network. For metadata operations, Octopus + introduces self-identified remote procedure calls for immediate notification between file systems and networking, and an efficient distributed transaction mechanism for consistency. Octopus + is enabled with replication feature to provide better availability. Evaluations on Intel Optane DC Persistent Memory Modules show that Octopus + achieves nearly the raw bandwidth for large I/Os and orders of magnitude better performance than existing distributed file systems.


2021 ◽  
Vol 17 (1) ◽  
pp. 1-22
Author(s):  
Wen Cheng ◽  
Chunyan Li ◽  
Lingfang Zeng ◽  
Yingjin Qian ◽  
Xi Li ◽  
...  

In high-performance computing (HPC), data and metadata are stored on special server nodes and client applications access the servers’ data and metadata through a network, which induces network latencies and resource contention. These server nodes are typically equipped with (slow) magnetic disks, while the client nodes store temporary data on fast SSDs or even on non-volatile main memory (NVMM). Therefore, the full potential of parallel file systems can only be reached if fast client side storage devices are included into the overall storage architecture. In this article, we propose an NVMM-based hierarchical persistent client cache for the Lustre file system (NVMM-LPCC for short). NVMM-LPCC implements two caching modes: a read and write mode (RW-NVMM-LPCC for short) and a read only mode (RO-NVMM-LPCC for short). NVMM-LPCC integrates with the Lustre Hierarchical Storage Management (HSM) solution and the Lustre layout lock mechanism to provide consistent persistent caching services for I/O applications running on client nodes, meanwhile maintaining a global unified namespace of the entire Lustre file system. The evaluation results presented in this article show that NVMM-LPCC can increase the average read throughput by up to 35.80 times and the average write throughput by up to 9.83 times compared with the native Lustre system, while providing excellent scalability.


Author(s):  
Armando Fandango ◽  
William Rivera

Scientific Big Data being gathered at exascale needs to be stored, retrieved and manipulated. The storage stack for scientific Big Data includes a file system at the system level for physical organization of the data, and a file format and input/output (I/O) system at the application level for logical organization of the data; both of them of high-performance variety for exascale. The high-performance file system is designed with concurrent access, high-speed transmission and fault tolerance characteristics. High-performance file formats and I/O are designed to allow parallel and distributed applications with easy and fast access to Big Data. These specialized file formats make it easier to store and access Big Data for scientific visualization and predictive analytics. This chapter provides a brief review of the characteristics of high-performance file systems such as Lustre and GPFS, and high-performance file formats such as HDF5, NetCDF, MPI-IO, and HDFS.


2020 ◽  
Author(s):  
Stefan Versick ◽  
Ole Kirner ◽  
Jörg Meyer ◽  
Holger Obermaier ◽  
Mehmet Soysal

<p>Earth System Models (ESM) got much more demanding over the last years. Modelled processes got more complex and more and more processes are considered in models. In addition resolutions of the models got higher to improve weather and climate forecasts. This requires faster high performance computers (HPC) and better I/O performance.</p><p>Within our Pilot Lab Exascale Earth System Modelling (PL-EESM) we do performance analysis of the ESM EMAC using a standard Lustre file system for output and compare it to the performance using a parallel ad-hoc overlay file system. We will show the impact for two scenarios: one for todays standard amount of output and one with artificial heavy output simulating future ESMs.</p><p>An ad-hoc file system is a private parallel file system which is created on-demand for an HPC job using the node-local storage devices, in our case solid-state-disks (SSD). It only exists during the runtime of the job. Therefore output data have to be moved to a permanent file system before the job has finished. Quasi in-situ data analysis and post-processing allows to gain performance as it might result in a decreased amount of data which you have to store - saving disk space and time during the transfer of data to permanent storage. We will show first tests for quasi in-situ post-processing.</p>


2018 ◽  
Vol 210 ◽  
pp. 04042
Author(s):  
Ammar Alhaj Ali ◽  
Pavel Varacha ◽  
Said Krayem ◽  
Roman Jasek ◽  
Petr Zacek ◽  
...  

Nowadays, a wide set of systems and application, especially in high performance computing, depends on distributed environments to process and analyses huge amounts of data. As we know, the amount of data increases enormously, and the goal to provide and develop efficient, scalable and reliable storage solutions has become one of the major issue for scientific computing. The storage solution used by big data systems is Distributed File Systems (DFSs), where DFS is used to build a hierarchical and unified view of multiple file servers and shares on the network. In this paper we will offer Hadoop Distributed File System (HDFS) as DFS in big data systems and we will present an Event-B as formal method that can be used in modeling, where Event-B is a mature formal method which has been widely used in a number of industry projects in a number of domains, such as automotive, transportation, space, business information, medical device and so on, And will propose using the Rodin as modeling tool for Event-B, which integrates modeling and proving as well as the Rodin platform is open source, so it supports a large number of plug-in tools.


2021 ◽  
Author(s):  
Stefan Versick ◽  
Thomas Fischer ◽  
Ole Kirner ◽  
Tobias Meisel ◽  
Jörg Meyer

<p>Earth System Models (ESM) got much more demanding over the last years. Modelled processes got more complex and more and more processes are considered in models. In addition resolutions of the models got higher to improve accuracy of predictions. This requires faster high performance computers (HPC) and better I/O performance. One way to improve I/O performance is to use faster file systems. Last year we showed the impact of the ad-hoc file system on the performance of the ESM EMAC. An ad-hoc file system is a private parallel file system which is created on-demand for an HPC job using the node-local storage devices, in our case solid-state-disks (SSD). It only exists during the runtime of the job. Therefore output data have to be moved to a permanent file system before the job has finished. Performance improvements are due to the use of SSDs in case of small chunks of I/O or a high amount of I/O operations per second. Another reason for a performace boost is because the running job can exclusively access the file system. To get a better overview in which cases ESMs benefit from using ad-hoc file systems we repeated our performance tests with further ESMs with different I/O strategies. In total we now analyzed EMAC (parallel netcdf), ICON2.5 (netcdf with asynchronous I/O), ICON2.6 (netcdf with Climate Data Interface (CDI) library) and OpenGeoSys (parallel VTU).</p>


Author(s):  
Charalampos Chalios ◽  
Giorgis Georgakoudis ◽  
Konstantinos Tovletoglou ◽  
George Karakonstantis ◽  
Hans Vandierendonck ◽  
...  

Power consumption and reliability of memory components are two of the most important hurdles in realizing exascale systems. Dynamic random access memory (DRAM) scaling projections predict significant performance and power penalty due to the conventional use of pessimistic refresh periods catering for worst-case cell retention times. Recent approaches relax those pessimistic refresh rates only on ``strong'' cells, or build on application-specific error resilience for data placement. However, these approaches cannot reveal the full potential of a relaxed refresh paradigm shift, since they neglect additional application resilience properties related to the inherent functioning of DRAM. In this article, we elevate Refresh-by-Access as a first-class property of application resilience. We develop a complete, non-intrusive system stack, armed with low-cost Data-Access Aware Refresh (DARE) methods, to facilitate aggressive refresh relaxation and ensure non-disruptive operation on commodity servers. Essentially, our proposed access-aware scheduling of application tasks intelligently amplifies the impact of the implicit refresh of memory accesses, extending the period during which hardware refresh remains disabled, while limiting the number of potential errors, hence their impact on an application's output quality. The stack, implemented on an off-the-shelf server and running a full-fledged Linux OS, captures for the first time the intricate time-dependent system and data interactions in the presence of hardware errors, in contrast to previous architectural simulations approaches of limited detail. Results demonstrate that by applying DARE, it is possible to completely disable hardware refresh, with minor quality loss that ranges from 2% to 18%.


Author(s):  
Jan Stender ◽  
Michael Berlin ◽  
Alexander Reinefeld

Cloud computing poses new challenges to data storage. While cloud providers use shared distributed hardware, which is inherently unreliable and insecure, cloud users expect their data to be safely and securely stored, available at any time, and accessible in the same way as their locally stored data. In this chapter, the authors present XtreemFS, a file system for the cloud. XtreemFS reconciles the need of cloud providers for cheap scale-out storage solutions with that of cloud users for a reliable, secure, and easy data access. The main contributions of the chapter are: a description of the internal architecture of XtreemFS, which presents an approach to build large-scale distributed POSIX-compliant file systems on top of cheap, off-the-shelf hardware; a description of the XtreemFS security infrastructure, which guarantees an isolation of individual users despite shared and insecure storage and network resources; a comprehensive overview of replication mechanisms in XtreemFS, which guarantee consistency, availability, and durability of data in the face of component failures; an overview of the snapshot infrastructure of XtreemFS, which allows to capture and freeze momentary states of the file system in a scalable and fault-tolerant fashion. The authors also compare XtreemFS with existing solutions and argue for its practicability and potential in the cloud storage market.


Electronics ◽  
2021 ◽  
Vol 10 (20) ◽  
pp. 2486
Author(s):  
Se-young Yu

Distributing Big Data for science is pushing the capabilities of networks and computing systems. However, the fundamental concept of copying data from one machine to another has not been challenged in collaborative science. As recent storage system development uses modern fabrics to provide faster remote data access with lower overhead, traditional data movement using Data Transfer Nodes must cope with the paradigm shift from a store-and-forward model to streaming data with direct storage access over the networks. This study evaluates NVMe-over-TCP (NVMe-TCP) in a long-distance network using different file systems and configurations to characterize remote NVMe file system access performance in MAN and WAN data moving scenarios. We found that NVMe-TCP is more suitable for remote data read than remote data write over the networks, and using RAID0 can significantly improve performance in a long-distance network. Additionally, a fine-tuning file system can improve remote write performance in DTNs with a long-distance network.


Sign in / Sign up

Export Citation Format

Share Document