Memory Access Behavior Analysis of NUMA-Based Shared Memory Programs

Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.

Download Full-text

A Transparent Runtime Data Distribution Engine for OpenMP

Scientific Programming ◽

10.1155/2000/417570 ◽

2000 ◽

Vol 8 (3) ◽

pp. 143-162 ◽

Cited By ~ 4

Author(s):

Dimitrios S. Nikolopoulos ◽

Theodore S. Papatheodorou ◽

Constantine D. Polychronopoulos ◽

Jesús Labarta ◽

Eduard Ayguadé

Keyword(s):

High Performance ◽

Programming Model ◽

Data Distribution ◽

Data Locality ◽

Remote Memory ◽

Main Body ◽

Performance Loss ◽

Page Migration ◽

Runtime Environment ◽

Memory Accesses

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.

Download Full-text

VM-based shared memory on low-latency, remote-memory-access networks

ACM SIGARCH Computer Architecture News ◽

10.1145/384286.264163 ◽

1997 ◽

Vol 25 (2) ◽

pp. 157-169 ◽

Cited By ~ 6

Author(s):

Leonidas Kontothanassis ◽

Galen Hunt ◽

Robert Stets ◽

Nikolaos Hardavellas ◽

Michał Cierniak ◽

...

Keyword(s):

Shared Memory ◽

Access Networks ◽

Memory Access ◽

Remote Memory ◽

Low Latency ◽

Remote Memory Access

Download Full-text

FPGA Memory Optimization in High-Level Synthesis

Advances in Systems Analysis, Software Engineering, and High Performance Computing - FPGA Algorithms and Applications for the Internet of Things ◽

10.4018/978-1-5225-9806-0.ch003 ◽

2020 ◽

pp. 51-81

Author(s):

Mingjie Lin ◽

Juan Escobedo

Keyword(s):

Data Reuse ◽

High Level Synthesis ◽

Reconfigurable Architectures ◽

Relative Distance ◽

Memory Optimization ◽

Performance Improvements ◽

Memory Subsystem ◽

Significant Performance ◽

Memory Accesses ◽

High Level

High-level synthesis (HLS) with FPGA can achieve significant performance improvements through effective memory partitioning and meticulous data reuse. In this chapter, the authors will first explore techniques that have been adopted directly from systems that possess a fixed memory subsystem such as CPUs and GPUs (Section 2). Section 3 will focus on techniques that have been developed specifically for reconfigurable architectures which generate custom memory subsystems to take advantage of the peculiarities of a family of affine code called stencil code. The authors will focus on techniques that exploit memory banking to allow for parallel, conflict-free memory accesses in Section 3.1 and techniques that generate an optimal memory micro-architecture for data reuse in Section 3.2. Finally, Section 4 will explore the technique handling code still belonging to the affine family but the relative distance between the addresses.

Download Full-text

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Scientific Programming ◽

10.1155/2014/571902 ◽

2014 ◽

Vol 22 (2) ◽

pp. 75-91 ◽

Cited By ~ 12

Author(s):

Robert Gerstenberger ◽

Maciej Besta ◽

Torsten Hoefler

Keyword(s):

Message Passing ◽

Direct Memory Access ◽

Memory Access ◽

Remote Memory ◽

Memory Consumption ◽

Performance Models ◽

Application Performance ◽

Performance Improvements ◽

Programming Interface ◽

Better Than

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.

Download Full-text

PERFORMANCE OF SCALABLE SHARED-MEMORY ARCHITECTURES

Journal of Circuits System and Computers ◽

10.1142/s0218126600000068 ◽

2000 ◽

Vol 10 (01n02) ◽

pp. 1-22

Author(s):

BAHMAN S. MOTLAGH ◽

RONALD F. DeMARA

Keyword(s):

Memory Access ◽

Analytical Models ◽

Access Time ◽

System Parameters ◽

Hit Rates ◽

Working Set ◽

Memory Accesses ◽

Memory Architectures ◽

Block Sizes ◽

Shared Memory Architectures

Analytical models were developed and simulations of memory latency were performed for Uniform Memory Access (UMA), Non-Uniform Memory Access (NUMA), Local-Remote-Global (LRG), and RCR architectures for hit rates from 0.1 to 0.9 in steps of 0.1, memory access times of 10 to 100 ns, proportions of read/write access from 0.01 to 0.1, and block sizes of 8 to 64 words. The RCR architecture provides favorable performance over UMA and NUMA architectures for all ranges of application and system parameters. RCR outperforms LRG architectures when the hit rates of the processor cache exceed 80%and replicated memory exceed 25%. Thus, inclusion of a small replicated memory at each processor significantly reduces expected access time since all replicated memory hits become independent of global traffic. For configurations of up to 32 processors, results show that latency is further reduced by distinguishing burst-mode transfers between isolated memory accesses and those which are incrementally outside the working set.

Download Full-text

VM-based shared memory on low-latency, remote-memory-access networks

Proceedings of the 24th annual international symposium on Computer architecture - ISCA '97 ◽

10.1145/264107.264163 ◽

1997 ◽

Cited By ~ 20

Author(s):

Leonidas Kontothanassis ◽

Galen Hunt ◽

Robert Stets ◽

Nikolaos Hardavellas ◽

Michał Cierniak ◽

...

Keyword(s):

Shared Memory ◽

Access Networks ◽

Memory Access ◽

Remote Memory ◽

Low Latency ◽

Remote Memory Access

Download Full-text

Visualizing the Memory Access Behavior of Shared Memory Applications on NUMA Architectures

Computational Science - ICCS 2001 - Lecture Notes in Computer Science ◽

10.1007/3-540-45718-6_91 ◽

2001 ◽

pp. 861-870 ◽

Cited By ~ 7

Author(s):

Jie Tao ◽

Wolfgang Karl ◽

Martin Schulz

Keyword(s):

Shared Memory ◽

Memory Access ◽

Memory Applications

Download Full-text

An Attention-Based Multilayer GRU Model for Multistep-Ahead Short-Term Load Forecasting

Sensors ◽

10.3390/s21051639 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1639

Author(s):

Seungmin Jung ◽

Jihoon Moon ◽

Sungwoo Park ◽

Eenjun Hwang

Keyword(s):

Power Consumption ◽

Prediction Models ◽

Short Term Memory ◽

Load Forecasting ◽

Input Sequence ◽

Short Term ◽

Performance Improvements ◽

Short Term Load Forecasting ◽

Significant Performance ◽

Input Variables

Recently, multistep-ahead prediction has attracted much attention in electric load forecasting because it can deal with sudden changes in power consumption caused by various events such as fire and heat wave for a day from the present time. On the other hand, recurrent neural networks (RNNs), including long short-term memory and gated recurrent unit (GRU) networks, can reflect the previous point well to predict the current point. Due to this property, they have been widely used for multistep-ahead prediction. The GRU model is simple and easy to implement; however, its prediction performance is limited because it considers all input variables equally. In this paper, we propose a short-term load forecasting model using an attention based GRU to focus more on the crucial variables and demonstrate that this can achieve significant performance improvements, especially when the input sequence of RNN is long. Through extensive experiments, we show that the proposed model outperforms other recent multistep-ahead prediction models in the building-level power consumption forecasting.

Download Full-text

Data Downlink System in the Vast IOT Node Condition Assisted by UAV, Large Intelligent Surface, and Power and Data Beacon

Sensors ◽

10.3390/s20205748 ◽

2020 ◽

Vol 20 (20) ◽

pp. 5748

Author(s):

Zhibo Zhang ◽

Qing Chang ◽

Na Zhao ◽

Chen Li ◽

Tianrun Li

Keyword(s):

Communication Systems ◽

Optimization Problems ◽

Minimum Energy ◽

Numerical Algorithms ◽

Flight Trajectory ◽

Performance Improvements ◽

Minimum Energy Consumption ◽

Significant Performance ◽

Downlink System ◽

The Internet Of Things

The future development of communication systems will create a great demand for the internet of things (IOT), where the overall control of all IOT nodes will become an important problem. Considering the essential issues of miniaturization and energy conservation, in this study, a new data downlink system is designed in which all IOT nodes harvest energy first and then receive data. To avoid the unsolvable problem of pre-locating all positions of vast IOT nodes, a device called the power and data beacon (PDB) is proposed. This acts as a relay station for energy and data. In addition, we model future scenes in which a communication system is assisted by unmanned aerial vehicles (UAVs), large intelligent surfaces (LISs), and PDBs. In this paper, we propose and solve the problem of determining the optimal flight trajectory to reach the minimum energy consumption or minimum time consumption. Four future feasible scenes are analyzed and then the optimization problems are solved based on numerical algorithms. Simulation results show that there are significant performance improvements in energy/time with the deployment of LISs and reasonable UAV trajectory planning.

Download Full-text