EFFECTIVENESS OF COMPILER-DIRECTED PREFETCHING ON DATA MINING BENCHMARKS

For today's increasingly power-constrained multicore systems, integrating simpler and more energy-efficient in-order cores becomes attractive. However, since in-order processors lack complex hardware support for tolerating long-latency memory accesses, developing compiler technologies to hide such latencies becomes critical. Compiler-directed prefetching has been demonstrated effective on some applications. On the application side, a large class of data centric applications has emerged to explore the underlying properties of the explosively growing data. These applications, in contrast to traditional benchmarks, are characterized by substantial thread-level parallelism, complex and unpredictable control flow, as well as intensive and irregular memory access patterns. These applications are expected to be the dominating workloads on future microprocessors. Thus, in this paper, we investigated the effectiveness of compiler-directed prefetching on data mining applications in in-order multicore systems. Our study reveals that although properly inserted prefetch instructions can often effectively reduce memory access latencies for data mining applications, the compiler is not always able to exploit this potential. Compiler-directed prefetching can become inefficient in the presence of complex control flow and memory access patterns; and architecture dependent behaviors. The integration of multithreaded execution onto a single die makes it even more difficult for the compiler to insert prefetch instructions, since optimizations that are effective for single-threaded execution may or may not be effective in multithreaded execution. Thus, compiler-directed prefetching must be judiciously deployed to avoid creating performance bottlenecks that otherwise do not exist. Our experiences suggest that dynamic performance tuning techniques that adjust to the behaviors of a program can potentially facilitate the deployment of aggressive optimizations in data mining applications.

Download Full-text

Extended performance accounting using Valgrind tool

PROBLEMS IN PROGRAMMING ◽

10.15407/pp2021.02.054 ◽

2021 ◽

pp. 054-062

Author(s):

D.V. Rahozin ◽

◽

A.Yu. Doroshenko ◽

Keyword(s):

Network Inference ◽

Hot Spot ◽

Control Flow ◽

Memory Access ◽

Precise Analysis ◽

Software Modules ◽

Computing Performance ◽

Memory Accesses ◽

Access Patterns ◽

Utilization Time

Modern workloads, parallel or sequential, usually suffer from insufficient memory and computing performance. Common trends to improve workload performance include the utilizations of complex functional units or coprocessors, which are able not only to provide accelerated computations but also independently fetch data from memory generating complex address patterns, with or without support of control flow operations. Such coprocessors usually are not adopted by optimizing compilers and should be utilized by special application interfaces by hand. On the other hand, memory bottlenecks may be avoided with proper use of processor prefetch capabilities which load necessary data ahead of actual utilization time, and the prefetch is also adopted only for simple cases making programmers to do it usually by hand. As workloads are fast migrating to embedded applications a problem raises how to utilize all hardware capabilities for speeding up workload at moderate efforts. This requires precise analysis of memory access patterns at program run time and marking hot spots where the vast amount of memory accesses is issued. Precise memory access model can be analyzed via simulators, for example Valgrind, which is capable to run really big workload, for example neural network inference in reasonable time. But simulators and hardware performance analyzers fail to separate the full amount of memory references and cache misses per particular modules as it requires the analysis of program call graph. We are extending Valgrind tool cache simulator, which allows to account memory accesses per software modules and render realistic distribution of hot spot in a program. Additionally the analysis of address sequences in the simulator allows to recover array access patterns and propose effective prefetching schemes. Motivating samples are provided to illustrate the use of Valgrind tool.

Download Full-text

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/3433687 ◽

2021 ◽

Vol 5 (4) ◽

pp. 1-28

Author(s):

Eduardo H. M. Cruz ◽

Matthias Diener ◽

Laércio L. Pilla ◽

Philippe O. A. Navaux

Keyword(s):

Energy Efficiency ◽

Memory Management ◽

Substantial Reduction ◽

Management Unit ◽

Memory Access ◽

Parallel Applications ◽

Data Mapping ◽

Wide Range ◽

Memory Accesses ◽

Level Parallelism

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.

Download Full-text

Data Mining MPSoC Simulation Traces to Identify Concurrent Memory Access Patterns

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013 ◽

10.7873/date.2013.161 ◽

2013 ◽

Cited By ~ 3

Author(s):

Sofiane Lagraa ◽

Alexandre Termier ◽

Frederic Petrot

Keyword(s):

Data Mining ◽

Memory Access ◽

Access Patterns

Download Full-text

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

PLoS ONE ◽

10.1371/journal.pone.0257047 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257047

Author(s):

Adrián Lamela ◽

Óscar G. Ossorio ◽

Guillermo Vinuesa ◽

Benjamín Sahelices

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Multicore Processors ◽

Memory Access ◽

Non Volatile Memory ◽

Volatile Memory ◽

Memory Accesses ◽

Access Patterns ◽

Memory Architectures

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.

Download Full-text

On the Detectability of Control Flow Using Memory Access Patterns

Proceedings of the 3rd Workshop on System Software for Trusted Execution - SysTEX '18 ◽

10.1145/3268935.3268941 ◽

2018 ◽

Cited By ~ 1

Author(s):

Robert Buhren ◽

Felicitas Hetzelt ◽

Niklas Pirnay

Keyword(s):

Control Flow ◽

Memory Access ◽

Access Patterns

Download Full-text

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

The Journal of Supercomputing ◽

10.1007/s11227-019-03135-7 ◽

2020 ◽

Vol 76 (4) ◽

pp. 3129-3154

Author(s):

Juan Fang ◽

Mengxuan Wang ◽

Zelin Wei

Keyword(s):

Memory Access ◽

Access Latency ◽

Scheduling Strategy ◽

Memory Scheduling ◽

Request Queue ◽

Average Latency ◽

The Difference ◽

Memory Accesses ◽

Level Parallelism ◽

Memory Request

AbstractMultiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.

Download Full-text

Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features

Proceedings of the International Workshop on OpenCL ◽

10.1145/3388333.3388656 ◽

2020 ◽

Author(s):

Aditya Chilukuri ◽

Josh Milthorpe ◽

Beau Johnston

Keyword(s):

Memory Access ◽

Access Patterns

Download Full-text

Classifying Memory Access Patterns for Prefetching

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems ◽

10.1145/3373376.3378498 ◽

2020 ◽

Cited By ~ 3

Author(s):

Grant Ayers ◽

Heiner Litz ◽

Christos Kozyrakis ◽

Parthasarathy Ranganathan

Keyword(s):

Memory Access ◽

Access Patterns

Download Full-text

Investigation of Data Dependencies by Dynamic Analysis of Sapfor

Russian Digital Libraries Journal ◽

10.26907/1562-5419-2020-23-3-473-493 ◽

2020 ◽

Vol 23 (3) ◽

pp. 473-493

Author(s):

Nikita Andreevich Kataev ◽

Alexander Andreevich Smirnov ◽

Andrey Dmitrievich Zhukov

Keyword(s):

Dynamic Analysis ◽

Programming Languages ◽

Static Analysis ◽

Automatic Parallelization ◽

Parallel Execution ◽

Control Flow ◽

Analysis Tool ◽

Data Dependencies ◽

Program Parallelization ◽

Memory Accesses

The use of pointers and indirect memory accesses in the program, as well as the complex control flow are some of the main weaknesses of the static analysis of programs. The program properties investigated by this analysis are too conservative to accurately describe program behavior and hence they prevent parallel execution of the program. The application of dynamic analysis allows us to expand the capabilities of semi-automatic parallelization. In the SAPFOR system (System FOR Automated Parallelization), a dynamic analysis tool has been implemented, based on on the instrumentation of the LLVM representation of an analyzed program, which allows the system to explore programs in both C and Fortran programming languages. The capabilities of the static analysis implemented in SAPFOR are used to reduce the overhead program execution, while maintaining the completeness of the analysis. The use of static analysis allows to reduce the number of analyzed memory accesses and to ignore scalar variables, which can be explored in a static way. The developed tool was tested on performance tests from the NAS Parallel Benchmarks package for C and Fortran languages. The implementation of dynamic analysis, in addition to traditional types of data dependencies (flow, anit, output), allows us to determine privitizable variables and a possibility of pipeline execution of loops. Together with the capabilities of DVM and OpenMP these greatly facilitates program parallelization and simplify insertion of the appropriate compiler directives.

Download Full-text

Toddler: Detecting performance problems via similar memory-access patterns

2013 35th International Conference on Software Engineering (ICSE) ◽

10.1109/icse.2013.6606602 ◽

2013 ◽

Cited By ~ 36

Author(s):

Adrian Nistor ◽

Linhai Song ◽

Darko Marinov ◽

Shan Lu

Keyword(s):

Memory Access ◽

Performance Problems ◽

Access Patterns

Download Full-text