HIGH PRECISION INTEGER MULTIPLICATION WITH A GPU USING STRASSEN'S ALGORITHM WITH MULTIPLE FFT SIZES

We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.

Download Full-text

Multi-Softcore Architecture on FPGA

International Journal of Reconfigurable Computing ◽

10.1155/2014/979327 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 4

Author(s):

Mouna Baklouti ◽

Mohamed Abid

Keyword(s):

High Performance ◽

Design Methodology ◽

Matrix Multiplication ◽

Rapid Prototype ◽

General Purpose ◽

Parallel Applications ◽

Multicore Systems ◽

Processor Core ◽

Nios Ii ◽

Wide Range

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication).

Download Full-text

A HIGH PERFORMANCE PARALLEL STRASSEN IMPLEMENTATION

Parallel Processing Letters ◽

10.1142/s0129626496000029 ◽

1996 ◽

Vol 06 (01) ◽

pp. 3-12 ◽

Cited By ~ 20

Author(s):

BRIAN GRAYSON ◽

ROBERT VAN DE GEIJN

Keyword(s):

Execution Time ◽

High Performance ◽

Parallel Implementation ◽

Matrix Multiplication ◽

Intel Paragon ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

In this paper, we give a practical high performance parallel implementation of Strassen’s algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 10– 20% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time.

Download Full-text

A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

Scientific Programming ◽

10.1155/1995/636457 ◽

1995 ◽

Vol 4 (4) ◽

pp. 275-289 ◽

Cited By ~ 10

Author(s):

B. Kumar ◽

C.-H. Huang ◽

P. Sadayappan ◽

R.W. Johnson

Keyword(s):

Tensor Product ◽

Shared Memory ◽

High Performance ◽

Fourier Transforms ◽

Matrix Multiplication ◽

Matrix Multiplication Algorithm ◽

Multiplication Algorithm ◽

Strassen’S Algorithm ◽

Strassen's Algorithm ◽

Product Formulas

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O(7n) for multiplying 2n× 2nmatrices. We present a modified formulation in which the working storage requirement is reduced to O(4n). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.

Download Full-text

Design Tradeoffs of High Performance DSPs for General-Purpose HPC

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2013.00790 ◽

2014 ◽

Vol 36 (4) ◽

pp. 790-798

Author(s):

Kai ZHANG ◽

Shu-Ming CHEN ◽

Yao-Hua WANG ◽

Xi NING

Keyword(s):

High Performance ◽

General Purpose ◽

Design Tradeoffs

Download Full-text

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

Publications of the Astronomical Society of Australia ◽

10.1071/as10021 ◽

2011 ◽

Vol 28 (1) ◽

pp. 1-14 ◽

Cited By ~ 172

Author(s):

W. van Straten ◽

M. Bailes

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Graphics Processing Units ◽

High Performance ◽

Digital Signal ◽

General Purpose ◽

Design Decisions ◽

Extensive Range ◽

Processing Software ◽

Graphics Processing

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.

Download Full-text

Learnings from the Field Implementation of a Novel Ultra-High Performance Concrete Beam End Repair on a Corroded Steel Girder Bridge in Connecticut

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211004128 ◽

2021 ◽

pp. 036119812110041

Author(s):

Alexandra Hain ◽

Arash E. Zaghi

Keyword(s):

High Performance ◽

High Performance Concrete ◽

Implementation Process ◽

Lessons Learned ◽

Load Path ◽

Ultra High Performance Concrete ◽

Steel Girder Bridge ◽

Aging Steel ◽

Girder Bridge ◽

The University

Corrosion at steel beam ends is one of the most pressing challenges in the maintenance of aging bridges. To tackle this challenge, the Connecticut Department of Transportation (DOT) has partnered with the University of Connecticut to develop a repair method that benefits from the superior mechanical and durability characteristics of ultra-high performance concrete (UHPC) material. The repair involves welding shear studs to the intact portions of the web and encasing the beam end with UHPC. This provides an alternate load path for bearing forces that bypasses the corroded regions of the beam. The structural viability of the repair has been extensively proven through small- and full-scale experiments and comprehensive finite element simulations. Connecticut DOT implemented the repair for the first time in the field on a heavily trafficked four-span bridge in 2019. The UHPC beam end repair was chosen because of the access constraints and geometric complexities of the bridge that limited the viable repair options. Four of the repaired beam ends were fully instrumented to collect data on the performance of the repaired locations before casting, during curing, and for approximately 6 months following the application of the repair. This paper provides an overview of the successful repair implementation and presents the lessons learned during construction. Select data from the monitored beam ends are presented. It is expected that this information will provide engineers with a better understanding of the repair implementation process, and thus provide an additional repair option for states to enhance the safety of aging steel bridges.

Download Full-text

Rack Server Solution in Data Center

Volume 1: Thermal Management ◽

10.1115/ipack2015-48258 ◽

2015 ◽

Cited By ~ 2

Author(s):

Sheng Kang ◽

Guofeng Chen ◽

Chun Wang ◽

Ruiquan Ding ◽

Jiajun Zhang ◽

...

Keyword(s):

Power Consumption ◽

Low Power ◽

Power Supply ◽

Data Center ◽

Power Efficiency ◽

High Performance ◽

High Efficiency ◽

High Growth ◽

General Purpose ◽

Power Supplies

With the advent of big data and cloud computing solutions, enterprise demand for servers is increasing. There is especially high growth for Intel based x86 server platforms. Today’s datacenters are in constant pursuit of high performance/high availability computing solutions coupled with low power consumption and low heat generation and the ability to manage all of this through advanced telemetry data gathering. This paper showcases one such solution of an updated rack and server architecture that promises such improvements. The ability to manage server and data center power consumption and cooling more completely is critical in effectively managing datacenter costs and reducing the PUE in the data center. Traditional Intel based 1U and 2U form factor servers have existed in the data center for decades. These general purpose x86 server designs by the major OEM’s are, for all practical purposes, very similar in their power consumption and thermal output. Power supplies and thermal designs for server in the past have not been optimized for high efficiency. In addition, IT managers need to know more information about servers in order to optimize data center cooling and power use, an improved server/rack design needs to be built to take advantage of more efficient power supplies or PDU’s and more efficient means of cooling server compute resources than from traditional internal server fans. This is the constant pursuit of corporations looking at new ways to improving efficiency and gaining a competitive advantage. A new way to optimize power consumption and improve cooling is a complete redesign of the traditional server rack. Extracting internal server power supplies and server fans and centralizing these within the rack aims to achieve this goal. This type of design achieves an entirely new low power target by utilizing centralized, high efficiency PDU’s that power all servers within the rack. Cooling is improved by also utilizing large efficient rack based fans for airflow to all servers. Also, opening up the server design is to allow greater airflow across server components for improved cooling. This centralized power supply breaks through the traditional server power limits. Rack based PDU’s can adjust the power efficiency to a more optimum point. Combine this with the use of online + offline modes within one single power supply. Cold backup makes data center power to achieve optimal power efficiency. In addition, unifying the mechanical structure and thermal definitions within the rack solution for server cooling and PSU information allows IT to collect all server power and thermal information centrally for improved ease in analyzing and processing.

Download Full-text

Efficient Graph Component Labeling on Hybrid CPU and GPU Platforms

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.596.276 ◽

2014 ◽

Vol 596 ◽

pp. 276-279

Author(s):

Xiao Hui Pan

Keyword(s):

High Performance ◽

General Purpose ◽

Gpu Programming ◽

Data Parallel ◽

Graphical Processing Units ◽

Architectural Features ◽

Graph Coloring Problem ◽

Graphical Processing ◽

And Performance ◽

Performance Results

Graph component labeling, which is a subset of the general graph coloring problem, is a computationally expensive operation in many important applications and simulations. A number of data-parallel algorithmic variations to the component labeling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on CPUs and GPUs using CUDA. We evaluated our system with real-world graphs. We show how to consider different architectural features of the GPU and the host CPUs and achieve high performance.

Download Full-text

Using a general-purpose numerical library to parallelize an industrial application: Design of high-performance lasers

Euro-Par’98 Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/bfb0057935 ◽

1998 ◽

pp. 812-820 ◽

Cited By ~ 1

Author(s):

Ida de Bono ◽

Daniela di Serafino ◽

Eric Ducloux

Keyword(s):

Industrial Application ◽

High Performance ◽

General Purpose ◽

Application Design

Download Full-text

High-Performance Reconfigurable Computing

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch053 ◽

2019 ◽

pp. 731-744

Author(s):

Mário Pereira Vestias

Keyword(s):

Power Consumption ◽

Integrated Circuit ◽

Reconfigurable Computing ◽

High Performance ◽

General Purpose ◽

Reconfigurable Hardware ◽

Coarse Grained ◽

Lower Power ◽

Fine Grained ◽

Application Specific

High-performance reconfigurable computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to general-purpose processors. Better performance and lower power consumption could be achieved using application-specific integrated circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter, the authors provide a description of reconfigurable hardware for high-performance computing.

Download Full-text