synchronization overhead Latest Research Papers

The recent unprecedented success of deep learning (DL) in various fields is underlied by its use of large-scale data and models. Training a large-scale deep neural network (DNN) model with large-scale data, however, is time-consuming. To speed up the training of massive DNN models, data-parallel distributed training based on the parameter server (PS) has been widely applied. In general, a synchronous PS-based training suffers from the synchronization overhead, especially in heterogeneous environments. To reduce the synchronization overhead, asynchronous PS-based training employs the asynchronous communication between PS and workers so that PS processes the request of each worker independently without waiting. Despite the performance improvement of asynchronous training, however, it inevitably incurs the difference among the local models of workers, where such a difference among workers may cause slower model convergence. Fro addressing this problem, in this work, we propose a novel asynchronous PS-based training algorithm, SHAT that considers (1) the scale of distributed training and (2) the heterogeneity among workers for successfully reducing the difference among the local models of workers. The extensive empirical evaluation demonstrates that (1) the model trained by SHAT converges to the higher accuracy up to 5.22% than state-of-the-art algorithms, and (2) the model convergence of SHAT is robust under various heterogeneous environments.

Download Full-text

Two-step Brillouin zone sampling for eﬃcient computation of electron dynamics in solids

Journal of Physics Condensed Matter ◽

10.1088/1361-648x/ac3f00 ◽

2021 ◽

Author(s):

Shunsuke A. Sato

Keyword(s):

Brillouin Zone ◽

Density Functional ◽

Integration Scheme ◽

Electronic Systems ◽

Efficient Computation ◽

Decomposition Scheme ◽

High Order Harmonic ◽

Linear Optical Properties ◽

Synchronization Overhead ◽

Zone Sampling

Abstract We develop a numerical Brillouin-zone integration scheme for real-time propagation of electronic systems with time-dependent density functional theory. This scheme is based on the decomposition of a large simulation into a set of small independent simulations. The performance of the decomposition scheme is examined in both linear and nonlinear regimes by computing the linear optical properties of bulk silicon and high-order harmonic generation. The decomposition of a large simulation into a set of independent simulations can improve the eﬃciency of parallel computation by reducing communication and synchronization overhead and enhancing the portability of simulations across a relatively small cluster machine.

Download Full-text

Cuckoo-PC: An Evolutionary Synchronization-Aware Placement of SDN Controllers for Optimizing the Network Performance in WSNs

Sensors ◽

10.3390/s20113231 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3231 ◽

Cited By ~ 1

Author(s):

Shirin Tahmasebi ◽

Mohadeseh Safi ◽

Somayeh Zolfi ◽

Mohammad Reza Maghsoudi ◽

Hamid Reza Faragardi ◽

...

Keyword(s):

Network Performance ◽

Optimization Problem ◽

Average Distance ◽

Research Problem ◽

Global Optimum ◽

Optimal Placement ◽

Cuckoo Optimization Algorithm ◽

Pc Algorithm ◽

Synchronization Overhead ◽

And Performance

Due to reliability and performance considerations, employing multiple software-defined networking (SDN) controllers is known as a promising technique in Wireless Sensor Networks (WSNs). Nevertheless, employing multiple controllers increases the inter-controller synchronization overhead. Therefore, optimal placement of SDN controllers to optimize the performance of a WSN, subject to the maximum number of controllers, determined based on the synchronization overhead, is a challenging research problem. In this paper, we first formulate this research problem as an optimization problem, then to address the optimization problem, we propose the Cuckoo Placement of Controllers (Cuckoo-PC) algorithm. Cuckoo-PC works based on the Cuckoo optimization algorithm which is a meta-heuristic algorithm inspired by nature. This algorithm seeks to find the global optimum by imitating brood parasitism of some cuckoo species. To evaluate the performance of Cuckoo-PC, we compare it against a couple of state-of-the-art methods, namely Simulated Annealing (SA) and Quantum Annealing (QA). The experiments demonstrate that Cuckoo-PC outperforms both SA and QA in terms of the network performance by lowering the average distance between sensors and controllers up to 13% and 9%, respectively. Comparing our method against Integer Linear Programming (ILP) reveals that Cuckoo-PC achieves approximately similar results (less than 1% deviation) in a noticeably shorter time.

Download Full-text

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

Electronics ◽

10.3390/electronics7120359 ◽

2018 ◽

Vol 7 (12) ◽

pp. 359 ◽

Cited By ~ 2

Author(s):

Xing Su ◽

Fei Lei

Keyword(s):

Dynamic Load ◽

High Performance ◽

Dynamic Load Balancing ◽

General Matrix ◽

Basic Linear Algebra Subprograms ◽

Numerical Software ◽

Multiple Threads ◽

Synchronization Overhead ◽

Many Core ◽

Computational Kernel

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.

Download Full-text