TiDA: High-Level Programming Abstractions for Data Locality Management

Author(s):  
Didem Unat ◽  
Tan Nguyen ◽  
Weiqun Zhang ◽  
Muhammed Nufail Farooqi ◽  
Burak Bastem ◽  
...  
1999 ◽  
Vol 7 (1) ◽  
pp. 21-37
Author(s):  
Balaram Sinharoy

Over the last decade processor speed has increased dramatically, whereas the speed of the memory subsystem improved at a modest rate. Due to the increase in the cache miss latency (in terms of the processor cycle), processors stall on cache misses for a significant portion of its execution time. Multithreaded processors has been proposed in the literature to reduce the processor stall time due to cache misses. Although multithreading improves processor utilization, it may also increase cache miss rates, because in a multithreaded processor multiple threads share the same cache, which effectively reduces the cache size available to each individual thread. Increased processor utilization and the increase in the cache miss rate demands higher memory bandwidth. A novel compiler optimization method has been presented in this paper that improves data locality for each of the threads and enhances data sharing among the threads. The method is based on loop transformation theory and optimizes both spatial and temporal data locality. The created threads exhibit high level of intra‐thread and inter‐thread data locality which effectively reduces both the data cache miss rates and the total execution time of numerically intensive computation running on a multithreaded processor.


2018 ◽  
Vol 30 (2) ◽  
pp. 305-324 ◽  
Author(s):  
Vasiliki Kalavri ◽  
Vladimir Vlassov ◽  
Seif Haridi

1999 ◽  
Vol 7 (1) ◽  
pp. 67-81 ◽  
Author(s):  
Siegfried Benkner

High Performance Fortran (HPF) offers an attractive high‐level language interface for programming scalable parallel architectures providing the user with directives for the specification of data distribution and delegating to the compiler the task of generating an explicitly parallel program. Available HPF compilers can handle regular codes quite efficiently, but dramatic performance losses may be encountered for applications which are based on highly irregular, dynamically changing data structures and access patterns. In this paper we introduce the Vienna Fortran Compiler (VFC), a new source‐to‐source parallelization system for HPF+, an optimized version of HPF, which addresses the requirements of irregular applications. In addition to extended data distribution and work distribution mechanisms, HPF+ provides the user with language features for specifying certain information that decisively influence a program’s performance. This comprises data locality assertions, non‐local access specifications and the possibility of reusing runtime‐generated communication schedules of irregular loops. Performance measurements of kernels from advanced applications demonstrate that with a high‐level data parallel language such as HPF+ a performance close to hand‐written message‐passing programs can be achieved even for highly irregular codes.


2014 ◽  
Author(s):  
Adrian Tate ◽  
Amir Kamil ◽  
Anshu Dubey ◽  
Armin Groblinger ◽  
Brad Chamberlain ◽  
...  

2021 ◽  
Vol 14 (4) ◽  
pp. 1-39
Author(s):  
Yi-Hsiang Lai ◽  
Ecenur Ustun ◽  
Shaojie Xiang ◽  
Zhenman Fang ◽  
Hongbo Rong ◽  
...  

FPGA-based accelerators are increasingly popular across a broad range of applications, because they offer massive parallelism, high energy efficiency, and great flexibility for customizations. However, difficulties in programming and integrating FPGAs have hindered their widespread adoption. Since the mid 2000s, there has been extensive research and development toward making FPGAs accessible to software-inclined developers, besides hardware specialists. Many programming models and automated synthesis tools, such as high-level synthesis, have been proposed to tackle this grand challenge. In this survey, we describe the progression and future prospects of the ongoing journey in significantly improving the software programmability of FPGAs. We first provide a taxonomy of the essential techniques for building a high-performance FPGA accelerator, which requires customizations of the compute engines, memory hierarchy, and data representations. We then summarize a rich spectrum of work on programming abstractions and optimizing compilers that provide different trade-offs between performance and productivity. Finally, we highlight several additional challenges and opportunities that deserve extra attention by the community to bring FPGA-based computing to the masses.


2011 ◽  
Vol 14 (1) ◽  
Author(s):  
Gustavo Guevara ◽  
Travis Desell ◽  
Jason LaPorte ◽  
Carlos A. Varela

Effective visualization is critical to developing, analyzing, and optimizing distributed systems. We havedeveloped OverView, a tool for online/offline distributed systems visualization, that enables modular layout mechanisms, so that different distributed system high-level programming abstractions such as actors or processes can be visualized in intuitive ways. OverView uses by default a hierarchical concentric layout that distinguishes entities from containers allowing migration patterns triggered by adaptive middleware to be visualized. In this paper, we develop a force-directed layout strategy that connects entities according to their communication patterns in order to directly exhibit the application communication topologies. In force-directed visualization, entities`'  locations are encoded with different colors to illustrate load balancing. We compare these layouts using quantitative metrics including communication to entity ratio, applied on common distributed application topologies. We conclude that modular visualization is necessary to effectively visualize distributed systems since no one layout is best for all applications.


2014 ◽  
Vol 24 (03) ◽  
pp. 1441005 ◽  
Author(s):  
Michel Steuwer ◽  
Michael Haidl ◽  
Stefan Breuer ◽  
Sergei Gorlatch

The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code.


2001 ◽  
Vol 11 (04) ◽  
pp. 471-486 ◽  
Author(s):  
HANS-WOLFGANG LOIDL ◽  
PHILIP W. TRINDER ◽  
CARSTEN BUTZ

The performance of data parallel programs often hinges on two key coordination aspects: the computational costs of the parallel tasks relative to their management overhead — task granularity; and the communication costs induced by the distance between tasks and their data — data locality. In data parallel programs both granularity and locality can be improved by clustering, i.e. arranging for parallel tasks to operate on related sub-collections of data. The GPH parallel functional language automatically manages most coordination aspects, but also allows some high-level control of coordination using evaluation strategies. We study the coordination behavior of two typical data parallel programs, and find that while they can be improved by introducing clustering evaluation strategies, further performance improvements can be achieved by restructuring the program. We introduce a new generic Cluster class that allows clustering to be systematically introduced, and improved by program transformation. In contrast to many other parallel program transformation approaches, we transform realistic programs and report performance results on a 32-processor Beowulf cluster. The cluster class is highly-generic and extensible, amenable to reasoning, and avoids conflating computation and coordination aspects of the program.


Sign in / Sign up

Export Citation Format

Share Document