First Time Miss : Low Overhead Mitigation for Shared Memory Cache Side Channels

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

Download Full-text

WHERE DOES THE SPEEDUP GO: QUANTITATIVE MODELING OF PERFORMANCE LOSSES IN SHARED-MEMORY PROGRAMS

Parallel Processing Letters ◽

10.1142/s0129626400000226 ◽

2000 ◽

Vol 10 (02n03) ◽

pp. 227-238 ◽

Cited By ~ 4

Author(s):

SEON WOOK KIM ◽

RUDOLF EIGENMANN

Keyword(s):

Shared Memory ◽

Code Generation ◽

Quantitative Information ◽

Component Model ◽

Optimal Program ◽

Performance Loss ◽

Software And Hardware ◽

Model Components ◽

First Time ◽

Hardware Counters

Even fully parallel shared-memory program sections may perform significantly below the ideal speedup of P on P processors. Relatively little quantitative information is available about the sources of such inefficiencies. In this paper we present a speedup component model that is able to fully account for sources of performance loss in parallel program sections. The model categorizes the gap between measured and ideal speedup into the four components memory stalls, processor stalls, code overhead, and thread management overhead. These model components are measured based on hardware counters and timers, with which programs are instrumented automatically by our compiler. The speedup component model allows us, for the first time, to quantitatively state the reasons for less-than-optimal program performance, on a program section basis. The overhead components are chosen such that they can be associated directly with software and hardware techniques that may improve performance. Although general, our model is especially suited for the analysis of loop-oriented programs, such as those written in the OpenMP API. We have applied this model to compare three parallel code generation schemes for the Polaris parallelizing compiler. It helps us answer questions such as, what sources of inefficiencies are present in compiler-parallelized programs. To discuss the question we have also implemented an alternative, thread-based code generation method.

Download Full-text

Concurrent Multifrontal Methods: Shared Memory, Cache, and Frontwidth Issues

The International Journal of Supercomputing Applications ◽

10.1177/109434208700100304 ◽

1987 ◽

Vol 1 (3) ◽

pp. 26-44 ◽

Cited By ~ 21

Author(s):

R.E. Benner ◽

G.R. Montry ◽

G.G. Weigand ◽

Iain Duff

Keyword(s):

Shared Memory ◽

Memory Cache ◽

Multifrontal Methods

Download Full-text

Renumbering strategies for unstructured-grid solvers operating on shared-memory, cache-based parallel machines

10.2514/6.1997-2045 ◽

1997 ◽

Cited By ~ 3

Author(s):

Rainald Loehner ◽

Rainald Loehner

Keyword(s):

Shared Memory ◽

Parallel Machines ◽

Unstructured Grid ◽

Memory Cache

Download Full-text

Renumbering strategies for unstructured-grid solvers operating on shared-memory, cache-based parallel machines

Computer Methods in Applied Mechanics and Engineering ◽

10.1016/s0045-7825(98)00005-x ◽

1998 ◽

Vol 163 (1-4) ◽

pp. 95-109 ◽

Cited By ~ 71

Author(s):

Rainald Löhner

Keyword(s):

Shared Memory ◽

Parallel Machines ◽

Unstructured Grid ◽

Memory Cache

Download Full-text

Shared Memory Cache Organizations for Reconfigurable Computing Systems

2009 17th IEEE Symposium on Field Programmable Custom Computing Machines ◽

10.1109/fccm.2009.28 ◽

2009 ◽

Cited By ~ 3

Author(s):

Philip Garcia ◽

Katherine Compton

Keyword(s):

Shared Memory ◽

Reconfigurable Computing ◽

Computing Systems ◽

Memory Cache

Download Full-text

Rely-guarantee bound analysis of parameterized concurrent shared-memory programs

Formal Methods in System Design ◽

10.1007/s10703-021-00370-8 ◽

2021 ◽

Author(s):

Thomas Pani ◽

Georg Weissenbacher ◽

Florian Zuleger

Keyword(s):

Shared Memory ◽

Time Complexity ◽

Concurrent Programs ◽

Automate Reasoning ◽

Fine Grained ◽

Free Data ◽

Concurrent Algorithms ◽

Resource Bound ◽

First Time ◽

Bound Analysis

AbstractWe present a thread-modular proof method for complexity and resource bound analysis of concurrent, shared-memory programs. To this end, we lift Jones’ rely-guarantee reasoning to assumptions and commitments capable of expressing bounds. The compositionality (thread-modularity) of this framework allows us to reason about parameterized programs, i.e., programs that execute arbitrarily many concurrent threads. We automate reasoning in our logic by reducing bound analysis of concurrent programs to the sequential case. As an application, we automatically infer time complexity for a family of fine-grained concurrent algorithms, lock-free data structures, to our knowledge for the first time.

Download Full-text