ACM Transactions on Storage

Lightweight Dynamic Redundancy Control with Adaptive Encoding for Server-based Storage

ACM Transactions on Storage ◽

10.1145/3456292 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-38

Author(s):

Takayuki Fukatani ◽

Hieu Hanh Le ◽

Haruo Yokota

Keyword(s):

Low Cost ◽

Storage System ◽

Erasure Coding ◽

Performance Impact ◽

Data Redundancy ◽

Performance Improvements ◽

Redundant Data ◽

Redundancy Control ◽

User Data ◽

Capacity Efficiency

With the recent performance improvements in commodity hardware, low-cost commodity server-based storage has become a practical alternative to dedicated-storage appliances. Because of the high failure rate of commodity servers, data redundancy across multiple servers is required in a server-based storage system. However, the extra storage capacity for this redundancy significantly increases the system cost. Although erasure coding (EC) is a promising method to reduce the amount of redundant data, it requires distributing and encoding data among servers. There remains a need to reduce the performance impact of these processes involving much network traffic and processing overhead. Especially, the performance impact becomes significant for random-intensive applications. In this article, we propose a new lightweight redundancy control for server-based storage. Our proposed method uses a new local filesystem-based approach that avoids distributing data by adding data redundancy to locally stored user data. Our method switches the redundancy method of user data between replication and EC according to workloads to improve capacity efficiency while achieving higher performance. Our experiments show up to 230% better online-transaction-processing performance for our method compared with CephFS, a widely used alternative system. We also confirmed that our proposed method prevents unexpected performance degradation while achieving better capacity efficiency.

Download Full-text

Introduction to the Special Section on USENIX FAST 2021

ACM Transactions on Storage ◽

10.1145/3485449 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-1

Author(s):

Marcos K. Aguilera ◽

Gala Yadgar

Keyword(s):

Special Section

Download Full-text

FluidSMR: Adaptive Management for Hybrid SMR Drives

ACM Transactions on Storage ◽

10.1145/3465404 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-30

Author(s):

Fenggang Wu ◽

Bingzhe Li ◽

David H. C. Du

Keyword(s):

Adaptive Management ◽

Magnetic Recording ◽

Disk Drive ◽

Allocation Scheme ◽

Performance Impact ◽

Two Phase ◽

Dynamic Data ◽

Management Scheme ◽

Shingled Magnetic Recording ◽

Single Disk

Hybrid Shingled Magnetic Recording (H-SMR) drives are the most recently developed SMR drives, which allow dynamic conversion of the recording format between Conventional Magnetic Recording (CMR) and SMR on a single disk drive. We identify the unique opportunities of H-SMR drives to manage the tradeoffs between performance and capacity, including the possibility of adjusting the SMR area capacity based on storage usage and the flexibility of dynamic data swapping between the CMR area and SMR area. We design and implement FluidSMR, an adaptive management scheme for hybrid SMR Drives, to fully utilize H-SMR drives under different workloads and capacity usages. FluidSMR has a two-phase allocation scheme to support a growing usage of the H-SMR drive. The scheme can intelligently determine the sizes of the CMR and the SMR space in an H-SMR drive based on the dynamic changing of workloads. Moreover, FluidSMR uses a cache in the CMR region, managed by a proposed loop-back log policy, to reduce the overhead of updates to the SMR region. Evaluations using enterprise traces demonstrate that FluidSMR outperforms baseline schemes in various workloads by decreasing the average I/O latency and effectively reducing/controlling the performance impact of the format conversion between CMR and SMR.

Download Full-text

Leveraging NVMe SSDs for Building a Fast, Cost-effective, LSM-tree-based KV Store

ACM Transactions on Storage ◽

10.1145/3480963 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-29

Author(s):

Cheng Li ◽

Hao Chen ◽

Chaoyi Ruan ◽

Xiaosong Ma ◽

Yinlong Xu

Keyword(s):

High Speed ◽

Scale Up ◽

Cost Effective ◽

Data Migration ◽

Solid State Drives ◽

Content Type ◽

Memory Processing ◽

Recent Emergence ◽

Non Volatile Memory ◽

High Bandwidth

Key-value (KV) stores support many crucial applications and services. They perform fast in-memory processing but are still often limited by I/O performance. The recent emergence of high-speed commodity non-volatile memory express solid-state drives (NVMe SSDs) has propelled new KV system designs that take advantage of their ultra-low latency and high bandwidth. Meanwhile, to switch to entirely new data layouts and scale up entire databases to high-end SSDs requires considerable investment. As a compromise, we propose SpanDB, an LSM-tree-based KV store that adapts the popular RocksDB system to utilize selective deployment of high-speed SSDs . SpanDB allows users to host the bulk of their data on cheaper and larger SSDs (and even hard disc drives with certain workloads), while relocating write-ahead logs (WAL) and the top levels of the LSM-tree to a much smaller and faster NVMe SSD. To better utilize this fast disk, SpanDB provides high-speed, parallel WAL writes via SPDK, and enables asynchronous request processing to mitigate inter-thread synchronization overhead and work efficiently with polling-based I/O. To ease the live data migration between fast and slow disks, we introduce TopFS, a stripped-down file system providing familiar file interface wrappers on top of SPDK I/O. Our evaluation shows that SpanDB simultaneously improves RocksDB's throughput by up to 8.8 \times and reduces its latency by 9.5–58.3%. Compared with KVell, a system designed for high-end SSDs, SpanDB achieves 96–140% of its throughput, with a 2.3–21.6 \times lower latency, at a cheaper storage configuration.

Download Full-text

Programming Abstractions for Managing Workflows on Tiered Storage Systems

ACM Transactions on Storage ◽

10.1145/3457119 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-21

Author(s):

Devarshi Ghoshal ◽

Lavanya Ramakrishnan

Keyword(s):

Data Management ◽

High Performance ◽

New Technologies ◽

Storage Capacity ◽

Storage Systems ◽

Storage System ◽

Management Strategies ◽

Scientific Workflows ◽

Programming Abstractions ◽

Storage Hierarchy

Scientific workflows in High Performance Computing ( HPC ) environments are processing large amounts of data. The storage hierarchy on HPC systems is getting deeper, driven by new technologies (NVRAMs, SSDs, etc.) There is a need for new programming abstractions that allow users to seamlessly manage data at the workflow level on multi-tiered storage systems, and provide optimal workflow performance and use of storage resources. In previous work, we introduced a software architecture Managing Data on Tiered Storage for Scientific Workflows (MaDaTS ) that used a Virtual Data Space ( VDS ) abstraction to hide the complexities of the underlying storage system while allowing users to control data management strategies. In this article, we detail the data-centric programming abstractions that allow users to manage a workflow around its data on the storage layer. The programming abstractions simplify data management for scientific workflows on multi-tiered storage systems, without affecting workflow performance or storage capacity. We measure the overheads and effectiveness introduced by the programming abstractions of MaDaTS. Our results show that these abstractions can optimally use the storage capacity in lesser capacity storage tiers, and simplify data management without adding any performance overheads.

Download Full-text

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

ACM Transactions on Storage ◽

10.1145/3459626 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-23

Author(s):

Datong Zhang ◽

Yuhui Deng ◽

Yi Zhou ◽

Yifeng Zhu ◽

Xiao Qin

Keyword(s):

High Efficiency ◽

False Positive Rate ◽

Bloom Filter ◽

Data Locality ◽

Data Deduplication ◽

Backup System ◽

Memory Overhead ◽

Data Fragmentation ◽

Positive Rate ◽

Salient Features

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.

Download Full-text

Two Reconfigurable NDP Servers: Understanding the Impact of Near-Data Processing on Data Center Applications

ACM Transactions on Storage ◽

10.1145/3460201 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-27

Author(s):

Xiaojia Song ◽

Tao Xie ◽

Stephen Fischer

Keyword(s):

Energy Efficiency ◽

Data Processing ◽

Data Center ◽

Experimental Results ◽

Full Size ◽

Data Intensive ◽

Processing Rate ◽

New Findings ◽

Data Intensive Applications ◽

The Impact

Existing near-data processing (NDP)-powered architectures have demonstrated their strength for some data-intensive applications. Data center servers, however, have to serve not only data-intensive but also compute-intensive applications. An in-depth understanding of the impact of NDP on various data center applications is still needed. For example, can a compute-intensive application also benefit from NDP? In addition, current NDP techniques focus on maximizing the data processing rate by always utilizing all computing resources at all times. Is this “always running in full gear” strategy consistently beneficial for an application? To answer these questions, we first propose two reconfigurable NDP-powered servers called RANS ( R econfigurable A RM-based N DP S erver) and RFNS ( R econfigurable F PGA-based N DP S erver). Next, we implement a single-engine prototype for each of them based on a conventional data center and then evaluate their effectiveness. Experimental results measured from the two prototypes are then extrapolated to estimate the properties of the two full-size reconfigurable NDP servers. Finally, several new findings are presented. For example, we find that while RANS can only benefit data-intensive applications, RFNS can offer benefits for both data-intensive and compute-intensive applications. Moreover, we find that for certain applications the reconfigurability of RANS/RFNS can deliver noticeable energy efficiency without any performance degradation.

Download Full-text

RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications

ACM Transactions on Storage ◽

10.1145/3483840 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-32

Author(s):

Siying Dong ◽

Andrew Kryczka ◽

Yanqin Jin ◽

Michael Stumm

Keyword(s):

Large Scale ◽

Future Research ◽

Solid State Drives ◽

Open Problems ◽

Failure Handling ◽

Data Formats ◽

Data Corruption ◽

Write Amplification ◽

Protection Mechanisms ◽

Integrity Protection

This article is an eight-year retrospective on development priorities for RocksDB, a key-value store developed at Facebook that targets large-scale distributed systems and that is optimized for Solid State Drives (SSDs). We describe how the priorities evolved over time as a result of hardware trends and extensive experiences running RocksDB at scale in production at a number of organizations: from optimizing write amplification, to space amplification, to CPU utilization. We describe lessons from running large-scale applications, including that resource allocation needs to be managed across different RocksDB instances, that data formats need to remain backward- and forward-compatible to allow incremental software rollouts, and that appropriate support for database replication and backups are needed. Lessons from failure handling taught us that data corruption errors needed to be detected earlier and that data integrity protection mechanisms are needed at every layer of the system. We describe improvements to the key-value interface. We describe a number of efforts that in retrospect proved to be misguided. Finally, we describe a number of open problems that could benefit from future research.

Download Full-text

A Large-scale Analysis of Hundreds of In-memory Key-value Cache Clusters at Twitter

ACM Transactions on Storage ◽

10.1145/3468521 ◽

2021 ◽

Vol 17 (3) ◽

pp. 1-35

Author(s):

Juncheng Yang ◽

Yao Yue ◽

K. V. Rashmi

Keyword(s):

Large Scale ◽

Production Systems ◽

Wide Spectrum ◽

Use Cases ◽

Scale Analysis ◽

Business Logic ◽

Traffic Pattern ◽

Fine Grained ◽

Memory Cache ◽

Large Scale Analysis

Modern web services use in-memory caching extensively to increase throughput and reduce latency. There have been several workload analyses of production systems that have fueled research in improving the effectiveness of in-memory caching systems. However, the coverage is still sparse considering the wide spectrum of industrial cache use cases. In this work, we significantly further the understanding of real-world cache workloads by collecting production traces from 153 in-memory cache clusters at Twitter, sifting through over 80 TB of data, and sometimes interpreting the workloads in the context of the business logic behind them. We perform a comprehensive analysis to characterize cache workloads based on traffic pattern, time-to-live (TTL), popularity distribution, and size distribution. A fine-grained view of different workloads uncover the diversity of use cases: many are far more write-heavy or more skewed than previously shown and some display unique temporal patterns. We also observe that TTL is an important and sometimes defining parameter of cache working sets. Our simulations show that ideal replacement strategy in production caches can be surprising, for example, FIFO works the best for a large number of workloads.

Download Full-text

Octopus + : An RDMA-Enabled Distributed Persistent Memory File System

ACM Transactions on Storage ◽

10.1145/3448418 ◽

2021 ◽

Vol 17 (3) ◽

pp. 1-25

Author(s):

Bohong Zhu ◽

Youmin Chen ◽

Qing Wang ◽

Youyou Lu ◽

Jiwu Shu

Keyword(s):

High Speed ◽

High Performance ◽

File System ◽

Direct Memory Access ◽

File Systems ◽

Distributed File Systems ◽

Persistent Memory ◽

Memory Modules ◽

Non Volatile Memory ◽

Volatile Memory

Non-volatile memory and remote direct memory access (RDMA) provide extremely high performance in storage and network hardware. However, existing distributed file systems strictly isolate file system and network layers, and the heavy layered software designs leave high-speed hardware under-exploited. In this article, we propose an RDMA-enabled distributed persistent memory file system, Octopus + , to redesign file system internal mechanisms by closely coupling non-volatile memory and RDMA features. For data operations, Octopus + directly accesses a shared persistent memory pool to reduce memory copying overhead, and actively fetches and pushes data all in clients to rebalance the load between the server and network. For metadata operations, Octopus + introduces self-identified remote procedure calls for immediate notification between file systems and networking, and an efficient distributed transaction mechanism for consistency. Octopus + is enabled with replication feature to provide better availability. Evaluations on Intel Optane DC Persistent Memory Modules show that Octopus + achieves nearly the raw bandwidth for large I/Os and orders of magnitude better performance than existing distributed file systems.

Download Full-text

ACM Transactions on Storage
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Lightweight Dynamic Redundancy Control with Adaptive Encoding for Server-based Storage

Introduction to the Special Section on USENIX FAST 2021

FluidSMR: Adaptive Management for Hybrid SMR Drives

Leveraging NVMe SSDs for Building a Fast, Cost-effective, LSM-tree-based KV Store

Programming Abstractions for Managing Workflows on Tiered Storage Systems

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Two Reconfigurable NDP Servers: Understanding the Impact of Near-Data Processing on Data Center Applications

RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications

A Large-scale Analysis of Hundreds of In-memory Key-value Cache Clusters at Twitter

Octopus + : An RDMA-Enabled Distributed Persistent Memory File System

Export Citation Format

ACM Transactions on StorageLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Lightweight Dynamic Redundancy Control with Adaptive Encoding for Server-based Storage

Introduction to the Special Section on USENIX FAST 2021

FluidSMR: Adaptive Management for Hybrid SMR Drives

Leveraging NVMe SSDs for Building a Fast, Cost-effective, LSM-tree-based KV Store

Programming Abstractions for Managing Workflows on Tiered Storage Systems

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Two Reconfigurable NDP Servers: Understanding the Impact of Near-Data Processing on Data Center Applications

RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications

A Large-scale Analysis of Hundreds of In-memory Key-value Cache Clusters at Twitter

Octopus + : An RDMA-Enabled Distributed Persistent Memory File System

ACM Transactions on Storage
Latest Publications