TiDA: High-Level Programming Abstractions for Data Locality Management

Over the last decade processor speed has increased dramatically, whereas the speed of the memory subsystem improved at a modest rate. Due to the increase in the cache miss latency (in terms of the processor cycle), processors stall on cache misses for a significant portion of its execution time. Multithreaded processors has been proposed in the literature to reduce the processor stall time due to cache misses. Although multithreading improves processor utilization, it may also increase cache miss rates, because in a multithreaded processor multiple threads share the same cache, which effectively reduces the cache size available to each individual thread. Increased processor utilization and the increase in the cache miss rate demands higher memory bandwidth. A novel compiler optimization method has been presented in this paper that improves data locality for each of the threads and enhances data sharing among the threads. The method is based on loop transformation theory and optimizes both spatial and temporal data locality. The created threads exhibit high level of intra‐thread and inter‐thread data locality which effectively reduces both the data cache miss rates and the total execution time of numerically intensive computation running on a multithreaded processor.

Download Full-text

Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions

Lecture Notes in Computer Science - Euro-Par 2009 Parallel Processing ◽

10.1007/978-3-642-03869-3_6 ◽

2009 ◽

pp. 21-32 ◽

Cited By ~ 3

Author(s):

Nick Rutar ◽

Jeffrey K. Hollingsworth

Keyword(s):

Parallel Programming ◽

Programming Abstractions ◽

High Level

Download Full-text

High-Level Programming Abstractions for Distributed Graph Processing

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2017.2762294 ◽

2018 ◽

Vol 30 (2) ◽

pp. 305-324 ◽

Cited By ~ 9

Author(s):

Vasiliki Kalavri ◽

Vladimir Vlassov ◽

Seif Haridi

Keyword(s):

Graph Processing ◽

Programming Abstractions ◽

High Level

Download Full-text

VFC: The Vienna Fortran Compiler

Scientific Programming ◽

10.1155/1999/304639 ◽

1999 ◽

Vol 7 (1) ◽

pp. 67-81 ◽

Cited By ~ 34

Author(s):

Siegfried Benkner

Keyword(s):

Message Passing ◽

High Performance ◽

Data Distribution ◽

Data Locality ◽

Performance Measurements ◽

Fortran Compiler ◽

Work Distribution ◽

Local Access ◽

High Level ◽

Access Patterns

High Performance Fortran (HPF) offers an attractive high‐level language interface for programming scalable parallel architectures providing the user with directives for the specification of data distribution and delegating to the compiler the task of generating an explicitly parallel program. Available HPF compilers can handle regular codes quite efficiently, but dramatic performance losses may be encountered for applications which are based on highly irregular, dynamically changing data structures and access patterns. In this paper we introduce the Vienna Fortran Compiler (VFC), a new source‐to‐source parallelization system for HPF+, an optimized version of HPF, which addresses the requirements of irregular applications. In addition to extended data distribution and work distribution mechanisms, HPF+ provides the user with language features for specifying certain information that decisively influence a program’s performance. This comprises data locality assertions, non‐local access specifications and the possibility of reusing runtime‐generated communication schedules of irregular loops. Performance measurements of kernels from advanced applications demonstrate that with a high‐level data parallel language such as HPF+ a performance close to hand‐written message‐passing programs can be achieved even for highly irregular codes.

Download Full-text

Programming Abstractions for Data Locality

10.2172/1172915 ◽

2014 ◽

Cited By ~ 9

Author(s):

Adrian Tate ◽

Amir Kamil ◽

Anshu Dubey ◽

Armin Groblinger ◽

Brad Chamberlain ◽

...

Keyword(s):

Data Locality ◽

Programming Abstractions

Download Full-text

Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future Prospects

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3469660 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1-39

Author(s):

Yi-Hsiang Lai ◽

Ecenur Ustun ◽

Shaojie Xiang ◽

Zhenman Fang ◽

Hongbo Rong ◽

...

Keyword(s):

High Performance ◽

High Energy ◽

Grand Challenge ◽

Future Prospects ◽

Trade Offs ◽

Programming Abstractions ◽

Challenges And Opportunities ◽

The Masses ◽

High Level ◽

Rich Spectrum

FPGA-based accelerators are increasingly popular across a broad range of applications, because they offer massive parallelism, high energy efficiency, and great flexibility for customizations. However, difficulties in programming and integrating FPGAs have hindered their widespread adoption. Since the mid 2000s, there has been extensive research and development toward making FPGAs accessible to software-inclined developers, besides hardware specialists. Many programming models and automated synthesis tools, such as high-level synthesis, have been proposed to tackle this grand challenge. In this survey, we describe the progression and future prospects of the ongoing journey in significantly improving the software programmability of FPGAs. We first provide a taxonomy of the essential techniques for building a high-performance FPGA accelerator, which requires customizations of the compute engines, memory hierarchy, and data representations. We then summarize a rich spectrum of work on programming abstractions and optimizing compilers that provide different trade-offs between performance and productivity. Finally, we highlight several additional challenges and opportunities that deserve extra attention by the community to bring FPGA-based computing to the masses.

Download Full-text

Modular Visualization of Distributed Systems

CLEI electronic journal ◽

10.19153/cleiej.14.1.7 ◽

2011 ◽

Vol 14 (1) ◽

Cited By ~ 1

Author(s):

Gustavo Guevara ◽

Travis Desell ◽

Jason LaPorte ◽

Carlos A. Varela

Keyword(s):

Distributed Systems ◽

Distributed System ◽

Migration Patterns ◽

Distributed Application ◽

Programming Abstractions ◽

Quantitative Metrics ◽

Adaptive Middleware ◽

Modular Layout ◽

Effective Visualization ◽

High Level

Effective visualization is critical to developing, analyzing, and optimizing distributed systems. We havedeveloped OverView, a tool for online/offline distributed systems visualization, that enables modular layout mechanisms, so that different distributed system high-level programming abstractions such as actors or processes can be visualized in intuitive ways. OverView uses by default a hierarchical concentric layout that distinguishes entities from containers allowing migration patterns triggered by adaptive middleware to be visualized. In this paper, we develop a force-directed layout strategy that connects entities according to their communication patterns in order to directly exhibit the application communication topologies. In force-directed visualization, entities`' locations are encoded with different colors to illustrate load balancing. We compare these layouts using quantitative metrics including communication to entity ratio, applied on common distributed application topologies. We conclude that modular visualization is necessary to effectively visualize distributed systems since no one layout is best for all applications.

Download Full-text

High-Level Programming of Stencil Computations on Multi-GPU Systems Using the SkelCL Library

Parallel Processing Letters ◽

10.1142/s0129626414410059 ◽

2014 ◽

Vol 24 (03) ◽

pp. 1441005 ◽

Cited By ~ 7

Author(s):

Michel Steuwer ◽

Michael Haidl ◽

Stefan Breuer ◽

Sergei Gorlatch

Keyword(s):

Parallel Implementation ◽

Competitive Performance ◽

Data Types ◽

Stencil Computations ◽

Automatic Data ◽

Parallel Patterns ◽

Multiple Gpus ◽

Massively Parallel Systems ◽

Programming Abstractions ◽

High Level

The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code.

Download Full-text

TUNING TASK GRANULARITY AND DATA LOCALITY OF DATA PARALLEL GPH PROGRAMS

Parallel Processing Letters ◽

10.1142/s0129626401000737 ◽

2001 ◽

Vol 11 (04) ◽

pp. 471-486 ◽

Cited By ~ 6

Author(s):

HANS-WOLFGANG LOIDL ◽

PHILIP W. TRINDER ◽

CARSTEN BUTZ

Keyword(s):

Program Transformation ◽

Data Locality ◽

Parallel Programs ◽

Level Control ◽

Performance Improvements ◽

Data Parallel ◽

Parallel Tasks ◽

Evaluation Strategies ◽

High Level ◽

Cluster Class

The performance of data parallel programs often hinges on two key coordination aspects: the computational costs of the parallel tasks relative to their management overhead — task granularity; and the communication costs induced by the distance between tasks and their data — data locality. In data parallel programs both granularity and locality can be improved by clustering, i.e. arranging for parallel tasks to operate on related sub-collections of data. The GPH parallel functional language automatically manages most coordination aspects, but also allows some high-level control of coordination using evaluation strategies. We study the coordination behavior of two typical data parallel programs, and find that while they can be improved by introducing clustering evaluation strategies, further performance improvements can be achieved by restructuring the program. We introduce a new generic Cluster class that allows clustering to be systematically introduced, and improved by program transformation. In contrast to many other parallel program transformation approaches, we transform realistic programs and report performance results on a 32-processor Beowulf cluster. The cluster class is highly-generic and extensible, amenable to reasoning, and avoids conflating computation and coordination aspects of the program.

Download Full-text