scholarly journals A double-precision multiplier with fine-grained clock-gating support for a first-generation CELL processor

Author(s):  
J.B. Kuang ◽  
T.C. Buchholtz ◽  
S.M. Dance ◽  
J.D. Warnock ◽  
S.N. Storino ◽  
...  
2006 ◽  
Vol 41 (1) ◽  
pp. 179-196 ◽  
Author(s):  
D.C. Pham ◽  
T. Aipperspach ◽  
D. Boerstler ◽  
M. Bolliger ◽  
R. Chaudhry ◽  
...  

1995 ◽  
Vol 05 (04) ◽  
pp. 671-683 ◽  
Author(s):  
FREDERIC T. CHONG ◽  
SHAMIK D. SHARMA ◽  
ERIC A. BREWER ◽  
JOEL SALTZ

We examine multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of 34 for 128 processors, resulting in an absolute performance of over 33 million double-precision floating point operations per second. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.


Author(s):  
Michael O Lam ◽  
Jeffrey K Hollingsworth

Floating-point computation is ubiquitous in high-performance scientific computing, but rounding error can compromise the results of extended calculations, especially at large scales. In this paper, we present new techniques that use binary instrumentation and modification to do fine-grained floating-point precision analysis, simulating any level of precision less than or equal to the precision of the original program. These techniques have an average of 40–70% lower overhead and provide more fine-grained insights into a program’s sensitivity than previous mixed-precision analyses. We also present a novel histogram-based visualization of a program’s floating-point precision sensitivity, as well as an incremental search technique that allows developers to incrementally trade off analysis time for detail, including the ability to restart analyses from where they left off. We present results from several case studies and experiments that show the efficacy of these techniques. Using our tool and its novel visualization, application developers can more quickly determine for specific data sets whether their application could be run using fewer double precision variables, saving both time and memory space.


Sign in / Sign up

Export Citation Format

Share Document