Measuring data locality in internal sorting algorithms

AbstractIn unstructured finite volume method, loop on different mesh components such as cells, faces, nodes, etc is used widely for the traversal of data. Mesh loop results in direct or indirect data access that affects data locality significantly. By loop on mesh, many threads accessing the same data lead to data dependence. Both data locality and data dependence play an important part in the performance of GPU simulations. For optimizing a GPU-accelerated unstructured finite volume Computational Fluid Dynamics (CFD) program, the performance of hot spots under different loops on cells, faces, and nodes is evaluated on Nvidia Tesla V100 and K80. Numerical tests under different mesh scales show that the effects of mesh loop modes are different on data locality and data dependence. Specifically, face loop makes the best data locality, so long as access to face data exists in kernels. Cell loop brings the smallest overheads due to non-coalescing data access, when both cell and node data are used in computing without face data. Cell loop owns the best performance in the condition that only indirect access of cell data exists in kernels. Atomic operations reduced the performance of kernels largely in K80, which is not obvious on V100. With the suitable mesh loop mode in all kernels, the overall performance of GPU simulations can be increased by 15%-20%. Finally, the program on a single GPU V100 can achieve maximum 21.7 and average 14.1 speed up compared with 28 MPI tasks on two Intel CPUs Xeon Gold 6132.

Download Full-text

Performance evaluation of LAN sorting algorithms

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/29904.29929 ◽

1987 ◽

Vol 15 (1) ◽

pp. 226-233

Author(s):

Mohamed Salehmohamed ◽

W. S. Luk ◽

Joseph G. Peters

Keyword(s):

Performance Evaluation ◽

Sorting Algorithms

Download Full-text

Bandwidth-Aware Rescheduling Mechanism in SDN-Based Data Center Networks

Electronics ◽

10.3390/electronics10151774 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1774

Author(s):

Ming-Chin Chuang ◽

Chia-Cheng Yen ◽

Chia-Jui Hung

Keyword(s):

Data Center ◽

Completion Time ◽

Network Performance ◽

Data Locality ◽

Task Completion ◽

Data Center Networks ◽

Task Completion Time ◽

Data Packets ◽

Network Bandwidth ◽

Data Process

Recently, with the increase in network bandwidth, various cloud computing applications have become popular. A large number of network data packets will be generated in such a network. However, most existing network architectures cannot effectively handle big data, thereby necessitating an efficient mechanism to reduce task completion time when large amounts of data are processed in data center networks. Unfortunately, achieving the minimum task completion time in the Hadoop system is an NP-complete problem. Although many studies have proposed schemes for improving network performance, they have shortcomings that degrade their performance. For this reason, in this study, we propose a centralized solution, called the bandwidth-aware rescheduling (BARE) mechanism for software-defined network (SDN)-based data center networks. BARE improves network performance by employing a prefetching mechanism and a centralized network monitor to collect global information, sorting out the locality data process, splitting tasks, and executing a rescheduling mechanism with a scheduler to reduce task completion time. Finally, we used simulations to demonstrate our scheme’s effectiveness. Simulation results show that our scheme outperforms other existing schemes in terms of task completion time and the ratio of data locality.

Download Full-text

Average time behavior of distributive sorting algorithms

Computing ◽

10.1007/bf02243418 ◽

1981 ◽

Vol 26 (1) ◽

pp. 1-7 ◽

Cited By ~ 21

Author(s):

L. Devroye ◽

T. Klincsek

Keyword(s):

Time Behavior ◽

Sorting Algorithms

Download Full-text

Simba

Communications of the ACM ◽

10.1145/3460227 ◽

2021 ◽

Vol 64 (6) ◽

pp. 107-116

Author(s):

Yakun Sophia Shao ◽

Jason Cemons ◽

Rangharajan Venkatesan ◽

Brian Zimmer ◽

Matthew Fojtik ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Data Locality ◽

Coarse Grained ◽

Batch Size ◽

Peak Performance ◽

Large Scale Systems ◽

High Area ◽

On Chip ◽

And Storage

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

Download Full-text

Penalty- and Locality-aware Memory Allocation in Redis Using Enhanced AET

ACM Transactions on Storage ◽

10.1145/3447573 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-45

Author(s):

Cheng Pan ◽

Xiaolin Wang ◽

Yingwei Luo ◽

Zhenlin Wang

Keyword(s):

Large Data ◽

Data Locality ◽

Memory Allocation ◽

Cache Management ◽

Cache Replacement ◽

Time Model ◽

Memory Cache ◽

Management Scheme ◽

Data Volume ◽

Replacement Algorithms

Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.

Download Full-text

Exposing data locality in HPC-based systems by using the HDFS backend

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC) ◽

10.1109/hipc50609.2020.00038 ◽

2020 ◽

Author(s):

Jose Rivadeneira ◽

Felix Garcia-Carballeira ◽

Jesus Carretero ◽

Javier Garcia-Blas

Keyword(s):

Data Locality

Download Full-text