Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

AbstractIn unstructured finite volume method, loop on different mesh components such as cells, faces, nodes, etc is used widely for the traversal of data. Mesh loop results in direct or indirect data access that affects data locality significantly. By loop on mesh, many threads accessing the same data lead to data dependence. Both data locality and data dependence play an important part in the performance of GPU simulations. For optimizing a GPU-accelerated unstructured finite volume Computational Fluid Dynamics (CFD) program, the performance of hot spots under different loops on cells, faces, and nodes is evaluated on Nvidia Tesla V100 and K80. Numerical tests under different mesh scales show that the effects of mesh loop modes are different on data locality and data dependence. Specifically, face loop makes the best data locality, so long as access to face data exists in kernels. Cell loop brings the smallest overheads due to non-coalescing data access, when both cell and node data are used in computing without face data. Cell loop owns the best performance in the condition that only indirect access of cell data exists in kernels. Atomic operations reduced the performance of kernels largely in K80, which is not obvious on V100. With the suitable mesh loop mode in all kernels, the overall performance of GPU simulations can be increased by 15%-20%. Finally, the program on a single GPU V100 can achieve maximum 21.7 and average 14.1 speed up compared with 28 MPI tasks on two Intel CPUs Xeon Gold 6132.

Download Full-text

Bandwidth-Aware Rescheduling Mechanism in SDN-Based Data Center Networks

Electronics ◽

10.3390/electronics10151774 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1774

Author(s):

Ming-Chin Chuang ◽

Chia-Cheng Yen ◽

Chia-Jui Hung

Keyword(s):

Data Center ◽

Completion Time ◽

Network Performance ◽

Data Locality ◽

Task Completion ◽

Data Center Networks ◽

Task Completion Time ◽

Data Packets ◽

Network Bandwidth ◽

Data Process

Recently, with the increase in network bandwidth, various cloud computing applications have become popular. A large number of network data packets will be generated in such a network. However, most existing network architectures cannot effectively handle big data, thereby necessitating an efficient mechanism to reduce task completion time when large amounts of data are processed in data center networks. Unfortunately, achieving the minimum task completion time in the Hadoop system is an NP-complete problem. Although many studies have proposed schemes for improving network performance, they have shortcomings that degrade their performance. For this reason, in this study, we propose a centralized solution, called the bandwidth-aware rescheduling (BARE) mechanism for software-defined network (SDN)-based data center networks. BARE improves network performance by employing a prefetching mechanism and a centralized network monitor to collect global information, sorting out the locality data process, splitting tasks, and executing a rescheduling mechanism with a scheduler to reduce task completion time. Finally, we used simulations to demonstrate our scheme’s effectiveness. Simulation results show that our scheme outperforms other existing schemes in terms of task completion time and the ratio of data locality.

Download Full-text

Simba

Communications of the ACM ◽

10.1145/3460227 ◽

2021 ◽

Vol 64 (6) ◽

pp. 107-116

Author(s):

Yakun Sophia Shao ◽

Jason Cemons ◽

Rangharajan Venkatesan ◽

Brian Zimmer ◽

Matthew Fojtik ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Data Locality ◽

Coarse Grained ◽

Batch Size ◽

Peak Performance ◽

Large Scale Systems ◽

High Area ◽

On Chip ◽

And Storage

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

Download Full-text

Penalty- and Locality-aware Memory Allocation in Redis Using Enhanced AET

ACM Transactions on Storage ◽

10.1145/3447573 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-45

Author(s):

Cheng Pan ◽

Xiaolin Wang ◽

Yingwei Luo ◽

Zhenlin Wang

Keyword(s):

Large Data ◽

Data Locality ◽

Memory Allocation ◽

Cache Management ◽

Cache Replacement ◽

Time Model ◽

Memory Cache ◽

Management Scheme ◽

Data Volume ◽

Replacement Algorithms

Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.

Download Full-text

Exposing data locality in HPC-based systems by using the HDFS backend

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC) ◽

10.1109/hipc50609.2020.00038 ◽

2020 ◽

Author(s):

Jose Rivadeneira ◽

Felix Garcia-Carballeira ◽

Jesus Carretero ◽

Javier Garcia-Blas

Keyword(s):

Data Locality

Download Full-text

Linear Weighted Regression and Energy-Aware Greedy Scheduling for Heterogeneous Big Data

Electronics ◽

10.3390/electronics10050554 ◽

2021 ◽

Vol 10 (5) ◽

pp. 554

Author(s):

Suresh Kallam ◽

Rizwan Patan ◽

Tathapudi V. Ramana ◽

Amir H. Gandomi

Keyword(s):

Big Data ◽

Energy Consumption ◽

Total Energy ◽

Optimal Solution ◽

Data Locality ◽

Weighted Regression ◽

Heterogeneous Environments ◽

Energy Aware ◽

Total Energy Consumption ◽

Greedy Scheduling

Data are presently being produced at an increased speed in different formats, which complicates the design, processing, and evaluation of the data. The MapReduce algorithm is a distributed file system that is used for big data parallel processing. Current implementations of MapReduce assist in data locality along with robustness. In this study, a linear weighted regression and energy-aware greedy scheduling (LWR-EGS) method were combined to handle big data. The LWR-EGS method initially selects tasks for an assignment and then selects the best available machine to identify an optimal solution. With this objective, first, the problem was modeled as an integer linear weighted regression program to choose tasks for the assignment. Then, the best available machines were selected to find the optimal solution. In this manner, the optimization of resources is said to have taken place. Then, an energy efficiency-aware greedy scheduling algorithm was presented to select a position for each task to minimize the total energy consumption of the MapReduce job for big data applications in heterogeneous environments without a significant performance loss. To evaluate the performance, the LWR-EGS method was compared with two related approaches via MapReduce. The experimental results showed that the LWR-EGS method effectively reduced the total energy consumption without producing large scheduling overheads. Moreover, the method also reduced the execution time when compared to state-of-the-art methods. The LWR-EGS method reduced the energy consumption, average processing time, and scheduling overhead by 16%, 20%, and 22%, respectively, compared to existing methods.

Download Full-text

Virtual reuse distance analysis of SPECjvm2008 data locality

Proceedings of the 7th International Conference on Principles and Practice of Programming in Java - PPPJ '09 ◽

10.1145/1596655.1596684 ◽

2009 ◽

Author(s):

Xiaoming Gu ◽

Xiao-Feng Li ◽

Buqi Cheng ◽

Eric Huang

Keyword(s):

Data Locality ◽

Reuse Distance ◽

Distance Analysis

Download Full-text

Data locality enhancement by memory reduction

Proceedings of the 15th international conference on Supercomputing - ICS '01 ◽

10.1145/377792.377806 ◽

2001 ◽

Cited By ~ 30

Author(s):

Yonghong Song ◽

Rong Xu ◽

Cheng Wang ◽

Zhiyuan Li

Keyword(s):

Data Locality ◽

Locality Enhancement ◽

Memory Reduction

Download Full-text

PARALLEL FFT ALGORITHMS ON NETWORK-ON-CHIPS

Journal of Circuits System and Computers ◽

10.1142/s0218126609005046 ◽

2009 ◽

Vol 18 (02) ◽

pp. 255-269 ◽

Cited By ~ 1

Author(s):

JUN HO BAHN ◽

JUNG SOOK YANG ◽

WEN-HSIANG HU ◽

NADER BAGHERZADEH

Keyword(s):

Data Communication ◽

Digital Signal ◽

Variable Number ◽

Data Locality ◽

Communication Traffic ◽

On Chip ◽

Fft Algorithms ◽

Signal Processors ◽

Parallel Fft ◽

Parallel Fft Algorithm

This paper presents parallel FFT algorithms with different degree of computation and communication overheads for multiprocessors in a Network-on-Chip (NoC) environment. Of the three parallel FFT algorithms presented in this paper, we propose two parallel FFT algorithms for a 2D NoC that can contain a variable number of processing elements (PEs) and one is a reference parallel FFT algorithm for comparison. A parallel FFT algorithm we propose increases performance by assigning well-balanced computation tasks to PEs. The execution times are reduced because the algorithm uses data locality well to avoid unnecessary data exchanges among PEs and removes the overall idle periods by2 a balanced task scheduling. An enhanced version of this algorithm is suggested in which communication traffic is reduced. In this algorithm, returning transformed data to an original PE after one computation stage before sending them to a next PE for the following stage is removed. Instead, we propose a method that enables to keep regularity of the data communication and computations with twiddle factors. According to the simulation result from our cycle-accurate SystemC NoC model with a parametrizable 2-D mesh architecture, and the analysis of the algorithms in time and complexity, our proposed algorithms are shown to outperform the reference parallel FFT algorithm and FFT implementations on TI Digital Signal Processors (DSPs) that have similar specifications to our simulation environment.

Download Full-text

LolliRAM: A Cross-Layer Design to Exploit Data Locality in Oblivious RAM

10.1109/dac18074.2021.9586126 ◽

2021 ◽

Author(s):

Yi Wang ◽

Weixuan Chen ◽

Xianhua Wang ◽

Rui Mao

Keyword(s):

Data Locality ◽

Cross Layer ◽

Cross Layer Design ◽

Oblivious Ram

Download Full-text