scholarly journals Memory Access Behavior Analysis of NUMA-Based Shared Memory Programs

2002 ◽  
Vol 10 (1) ◽  
pp. 45-53 ◽  
Author(s):  
Jie Tao ◽  
Wolfgang Karl ◽  
Martin Schulz

Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.

2000 ◽  
Vol 8 (3) ◽  
pp. 143-162 ◽  
Author(s):  
Dimitrios S. Nikolopoulos ◽  
Theodore S. Papatheodorou ◽  
Constantine D. Polychronopoulos ◽  
Jesús Labarta ◽  
Eduard Ayguadé

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.


1997 ◽  
Vol 25 (2) ◽  
pp. 157-169 ◽  
Author(s):  
Leonidas Kontothanassis ◽  
Galen Hunt ◽  
Robert Stets ◽  
Nikolaos Hardavellas ◽  
Michał Cierniak ◽  
...  

Author(s):  
Mingjie Lin ◽  
Juan Escobedo

High-level synthesis (HLS) with FPGA can achieve significant performance improvements through effective memory partitioning and meticulous data reuse. In this chapter, the authors will first explore techniques that have been adopted directly from systems that possess a fixed memory subsystem such as CPUs and GPUs (Section 2). Section 3 will focus on techniques that have been developed specifically for reconfigurable architectures which generate custom memory subsystems to take advantage of the peculiarities of a family of affine code called stencil code. The authors will focus on techniques that exploit memory banking to allow for parallel, conflict-free memory accesses in Section 3.1 and techniques that generate an optimal memory micro-architecture for data reuse in Section 3.2. Finally, Section 4 will explore the technique handling code still belonging to the affine family but the relative distance between the addresses.


2014 ◽  
Vol 22 (2) ◽  
pp. 75-91 ◽  
Author(s):  
Robert Gerstenberger ◽  
Maciej Besta ◽  
Torsten Hoefler

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.


2000 ◽  
Vol 10 (01n02) ◽  
pp. 1-22
Author(s):  
BAHMAN S. MOTLAGH ◽  
RONALD F. DeMARA

Analytical models were developed and simulations of memory latency were performed for Uniform Memory Access (UMA), Non-Uniform Memory Access (NUMA), Local-Remote-Global (LRG), and RCR architectures for hit rates from 0.1 to 0.9 in steps of 0.1, memory access times of 10 to 100 ns, proportions of read/write access from 0.01 to 0.1, and block sizes of 8 to 64 words. The RCR architecture provides favorable performance over UMA and NUMA architectures for all ranges of application and system parameters. RCR outperforms LRG architectures when the hit rates of the processor cache exceed 80%and replicated memory exceed 25%. Thus, inclusion of a small replicated memory at each processor significantly reduces expected access time since all replicated memory hits become independent of global traffic. For configurations of up to 32 processors, results show that latency is further reduced by distinguishing burst-mode transfers between isolated memory accesses and those which are incrementally outside the working set.


Author(s):  
Leonidas Kontothanassis ◽  
Galen Hunt ◽  
Robert Stets ◽  
Nikolaos Hardavellas ◽  
Michał Cierniak ◽  
...  

Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1639
Author(s):  
Seungmin Jung ◽  
Jihoon Moon ◽  
Sungwoo Park ◽  
Eenjun Hwang

Recently, multistep-ahead prediction has attracted much attention in electric load forecasting because it can deal with sudden changes in power consumption caused by various events such as fire and heat wave for a day from the present time. On the other hand, recurrent neural networks (RNNs), including long short-term memory and gated recurrent unit (GRU) networks, can reflect the previous point well to predict the current point. Due to this property, they have been widely used for multistep-ahead prediction. The GRU model is simple and easy to implement; however, its prediction performance is limited because it considers all input variables equally. In this paper, we propose a short-term load forecasting model using an attention based GRU to focus more on the crucial variables and demonstrate that this can achieve significant performance improvements, especially when the input sequence of RNN is long. Through extensive experiments, we show that the proposed model outperforms other recent multistep-ahead prediction models in the building-level power consumption forecasting.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5748
Author(s):  
Zhibo Zhang ◽  
Qing Chang ◽  
Na Zhao ◽  
Chen Li ◽  
Tianrun Li

The future development of communication systems will create a great demand for the internet of things (IOT), where the overall control of all IOT nodes will become an important problem. Considering the essential issues of miniaturization and energy conservation, in this study, a new data downlink system is designed in which all IOT nodes harvest energy first and then receive data. To avoid the unsolvable problem of pre-locating all positions of vast IOT nodes, a device called the power and data beacon (PDB) is proposed. This acts as a relay station for energy and data. In addition, we model future scenes in which a communication system is assisted by unmanned aerial vehicles (UAVs), large intelligent surfaces (LISs), and PDBs. In this paper, we propose and solve the problem of determining the optimal flight trajectory to reach the minimum energy consumption or minimum time consumption. Four future feasible scenes are analyzed and then the optimization problems are solved based on numerical algorithms. Simulation results show that there are significant performance improvements in energy/time with the deployment of LISs and reasonable UAV trajectory planning.


Sign in / Sign up

Export Citation Format

Share Document