A double-precision multiplier with fine-grained clock-gating support for a first-generation CELL processor

We examine multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of 34 for 128 processors, resulting in an absolute performance of over 33 million double-precision floating point operations per second. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

Download Full-text

Power Efficient High-Level Synthesis by Centralized and Fine-Grained Clock Gating

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ◽

10.1109/tcad.2015.2445734 ◽

2015 ◽

Vol 34 (12) ◽

pp. 1954-1963 ◽

Cited By ~ 7

Author(s):

Mohsen Riahi Alam ◽

Mostafa Ersali Salehi Nasab ◽

Sied Mehdi Fakhraie

Keyword(s):

High Level Synthesis ◽

Clock Gating ◽

Power Efficient ◽

Fine Grained ◽

High Level

Download Full-text

The design and implementation of a first-generation CELL processor

ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005. ◽

10.1109/isscc.2005.1493930 ◽

2005 ◽

Cited By ~ 274

Author(s):

D. Pham ◽

S. Asano ◽

M. Bolliger ◽

M.N. Day ◽

H.P. Hofstee ◽

...

Keyword(s):

First Generation ◽

Cell Processor ◽

Design And Implementation

Download Full-text

The design and implementation of a first-generation CELL processor - a multi-core SoC

2005 International Conference on Integrated Circuit Design and Technology, 2005. ICICDT 2005. ◽

10.1109/icicdt.2005.1502588 ◽

2005 ◽

Cited By ~ 13

Author(s):

D. Pham ◽

S. Asano ◽

M. Bolliger ◽

M.N. Day ◽

H.P. Hofstee ◽

...

Keyword(s):

First Generation ◽

Cell Processor ◽

Design And Implementation

Download Full-text

Key features of the design methodology enabling a multi-core SoC implementation of a first-generation CELL processor

Asia and South Pacific Conference on Design Automation, 2006. ◽

10.1145/1118299.1118497 ◽

2006 ◽

Cited By ~ 4

Author(s):

Dac Pham ◽

Atsushi Kameyama ◽

John Keaty ◽

Bob Le ◽

Sang Lee ◽

...

Keyword(s):

Design Methodology ◽

First Generation ◽

Cell Processor ◽

Key Features

Download Full-text

Fine-grained floating-point precision analysis

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652462 ◽

2016 ◽

Vol 32 (2) ◽

pp. 231-245 ◽

Cited By ~ 10

Author(s):

Michael O Lam ◽

Jeffrey K Hollingsworth

Keyword(s):

High Performance ◽

Floating Point ◽

Data Sets ◽

Double Precision ◽

Precision Analysis ◽

Memory Space ◽

Incremental Search ◽

Fine Grained ◽

Mixed Precision ◽

Application Developers

Floating-point computation is ubiquitous in high-performance scientific computing, but rounding error can compromise the results of extended calculations, especially at large scales. In this paper, we present new techniques that use binary instrumentation and modification to do fine-grained floating-point precision analysis, simulating any level of precision less than or equal to the precision of the original program. These techniques have an average of 40–70% lower overhead and provide more fine-grained insights into a program’s sensitivity than previous mixed-precision analyses. We also present a novel histogram-based visualization of a program’s floating-point precision sensitivity, as well as an incremental search technique that allows developers to incrementally trade off analysis time for detail, including the ability to restart analyses from where they left off. We present results from several case studies and experiments that show the efficacy of these techniques. Using our tool and its novel visualization, application developers can more quickly determine for specific data sets whether their application could be run using fewer double precision variables, saving both time and memory space.

Download Full-text

A double-precision multiplier with fine-grained clock-gating support for a first-generation CELL processor

The design and implementation of double-precision multiplier in a first-generation CELL processor

Fine-grained power managed dual-thread vector scalar unit for the first-generation CELL processor

Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor

The design methodology and implementation of a first-generation CELL processor: a multi-core SoC

MULTIPROCESSOR RUNTIME SUPPORT FOR FINE-GRAINED, IRREGULAR DAGS

Power Efficient High-Level Synthesis by Centralized and Fine-Grained Clock Gating

The design and implementation of a first-generation CELL processor

The design and implementation of a first-generation CELL processor - a multi-core SoC

Key features of the design methodology enabling a multi-core SoC implementation of a first-generation CELL processor

Fine-grained floating-point precision analysis

Export Citation Format