A HIGH PERFORMANCE PARALLEL STRASSEN IMPLEMENTATION

In this paper, we give a practical high performance parallel implementation of Strassen’s algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 10– 20% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time.

Download Full-text

A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

Scientific Programming ◽

10.1155/1995/636457 ◽

1995 ◽

Vol 4 (4) ◽

pp. 275-289 ◽

Cited By ~ 10

Author(s):

B. Kumar ◽

C.-H. Huang ◽

P. Sadayappan ◽

R.W. Johnson

Keyword(s):

Tensor Product ◽

Shared Memory ◽

High Performance ◽

Fourier Transforms ◽

Matrix Multiplication ◽

Matrix Multiplication Algorithm ◽

Multiplication Algorithm ◽

Strassen’S Algorithm ◽

Strassen's Algorithm ◽

Product Formulas

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O(7n) for multiplying 2n× 2nmatrices. We present a modified formulation in which the working storage requirement is reduced to O(4n). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.

Download Full-text

Scientific Programming with High Performance Fortran: A Case Study Using the xHPF Compiler

Scientific Programming ◽

10.1155/1997/528513 ◽

1997 ◽

Vol 6 (1) ◽

pp. 127-152

Author(s):

Eric De Sturler ◽

Volker Strumpen

Keyword(s):

High Performance ◽

Parallel Implementation ◽

Gaussian Elimination ◽

Primary Objective ◽

Matrix Product ◽

Dense Matrix ◽

High Performance Fortran ◽

Partial Pivoting ◽

Intel Paragon

Recently, the first commercial High Performance Fortran (HPF) subset compilers have appeared. This article reports on our experiences with the xHPF compiler of Applied Parallel Research, version 1.2, for the Intel Paragon. At this stage, we do not expect very High Performance from our HPF programs, even though performance will eventually be of paramount importance for the acceptance of HPF. Instead, our primary objective is to study how to convert large Fortran 77 (F77) programs to HPF such that the compiler generates reasonably efficient parallel code. We report on a case study that identifies several problems when parallelizing code with HPF; most of these problems affect current HPF compiler technology in general, although some are specific for the xHPF compiler. We discuss our solutions from the perspective of the scientific programmer, and presenttiming results on the Intel Paragon. The case study comprises three programs of different complexity with respect to parallelization. We use the dense matrix-matrix product to show that the distribution of arrays and the order of nested loops significantly influence the performance of the parallel program. We use Gaussian elimination with partial pivoting to study the parallelization strategy of the compiler. There are various ways to structure this algorithm for a particular data distribution. This example shows how much effort may be demanded from the programmer to support the compiler in generating an efficient parallel implementation. Finally, we use a small application to show that the more complicated structure of a larger program may introduce problems for the parallelization, even though all subroutines of the application are easy to parallelize by themselves. The application consists of a finite volume discretization on a structured grid and a nested iterative solver. Our case study shows that it is possible to obtain reasonably efficient parallel programs with xHPF, although the compiler needs substantial support from the programmer.

Download Full-text

Six Pass MapReduce Implementation of Strassen's Algorithm for Matrix Multiplication

Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond - BeyondMR'18 ◽

10.1145/3206333.3206336 ◽

2018 ◽

Author(s):

Prakash Ramanan

Keyword(s):

Matrix Multiplication ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

Download Full-text

MapReduce Implementation of Strassen's Algorithm for Matrix Multiplication

Proceedings of the 4th Algorithms and Systems on MapReduce and Beyond - BeyondMR'17 ◽

10.1145/3070607.3070614 ◽

2017 ◽

Cited By ~ 2

Author(s):

Minhao Deng ◽

Prakash Ramanan

Keyword(s):

Matrix Multiplication ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

Download Full-text

HIGH PRECISION INTEGER MULTIPLICATION WITH A GPU USING STRASSEN'S ALGORITHM WITH MULTIPLE FFT SIZES

Parallel Processing Letters ◽

10.1142/s0129626411000266 ◽

2011 ◽

Vol 21 (03) ◽

pp. 359-375 ◽

Cited By ~ 28

Author(s):

NIALL EMMART ◽

CHARLES C. WEEMS

Keyword(s):

High Performance ◽

Implementation Process ◽

General Purpose ◽

Fixed Size ◽

Processor Core ◽

Technology Generation ◽

Strassen’S Algorithm ◽

Integer Multiplication ◽

Strassen's Algorithm ◽

Memory Layout

We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.

Download Full-text

Parallel PPI Prediction Performance Study on HPC Platforms

Journal of Circuits System and Computers ◽

10.1142/s0218126615500747 ◽

2015 ◽

Vol 24 (05) ◽

pp. 1550074 ◽

Cited By ~ 1

Author(s):

Ali A. El-Moursy ◽

Wael S. Afifi ◽

Fadi N. Sibai ◽

Salwa M. Nassar

Keyword(s):

Protein Interactions ◽

Execution Time ◽

High Performance ◽

Large Scale ◽

Parallel Implementation ◽

Prediction Method ◽

Protein Protein Interactions ◽

Performance Study ◽

Ppi Prediction ◽

Performance Computing

STRIKE is an algorithm which predicts protein–protein interactions (PPIs) and determines that proteins interact if they contain similar substrings of amino acids. Unlike other methods for PPI prediction, STRIKE is able to achieve reasonable improvement over the existing PPI prediction methods. Although its high accuracy as a PPI prediction method, STRIKE consumes a large execution time and hence it is considered to be a compute-intensive application. In this paper, we develop and implement a parallel STRIKE algorithm for high-performance computing (HPC) systems. Using a large-scale cluster, the execution time of the parallel implementation of this bioinformatics algorithm was reduced from about a week on a serial uniprocessor machine to about 16.5 h on 16 computing nodes, down to about 2 h on 128 parallel nodes. Communication overheads between nodes are thoroughly studied.

Download Full-text

Implementation of Strassen's algorithm for matrix multiplication

Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '96 ◽

10.1145/369028.369096 ◽

1996 ◽

Cited By ~ 31

Author(s):

Steven Huss-Lederman ◽

Elaine M. Jacobson ◽

Anna Tsao ◽

Thomas Turnbull ◽

Jeremy R. Johnson

Keyword(s):

Matrix Multiplication ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

Download Full-text

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

International Journal of Parallel Programming ◽

10.1007/s10766-021-00714-1 ◽

2021 ◽

Author(s):

Breno A. de Melo Menezes ◽

Nina Herrmann ◽

Herbert Kuchen ◽

Fernando Buarque de Lima Neto

Keyword(s):

Ant Colony Optimization ◽

High Performance ◽

Optimization Problems ◽

Programming Model ◽

Parallel Implementation ◽

Ant Colony ◽

Algorithmic Skeletons ◽

Low Level ◽

Programming Patterns ◽

High Level

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

Download Full-text

TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2021.02.013 ◽

2021 ◽

Vol 151 ◽

pp. 70-85

Author(s):

Cody Rivera ◽

Jieyang Chen ◽

Nan Xiong ◽

Jing Zhang ◽

Shuaiwen Leon Song ◽

...

Keyword(s):

High Performance ◽

Matrix Multiplication

Download Full-text

A Modified KNN Algorithm for High-Performance Computing on FPGA of Real-Time m-QAM Demodulators

Electronics ◽

10.3390/electronics10050627 ◽

2021 ◽

Vol 10 (5) ◽

pp. 627

Author(s):

David Marquez-Viloria ◽

Luis Castano-Londono ◽

Neil Guerrero-Gonzalez

Keyword(s):

Real Time ◽

High Performance ◽

Interference Mitigation ◽

Parallel Implementation ◽

Computational Time ◽

Successful Implementation ◽

Interchannel Interference ◽

The Difference ◽

High Level ◽

Performance Computing

A methodology for scalable and concurrent real-time implementation of highly recurrent algorithms is presented and experimentally validated using the AWS-FPGA. This paper presents a parallel implementation of a KNN algorithm focused on the m-QAM demodulators using high-level synthesis for fast prototyping, parameterization, and scalability of the design. The proposed design shows the successful implementation of the KNN algorithm for interchannel interference mitigation in a 3 × 16 Gbaud 16-QAM Nyquist WDM system. Additionally, we present a modified version of the KNN algorithm in which comparisons among data symbols are reduced by identifying the closest neighbor using the rule of the 8-connected clusters used for image processing. Real-time implementation of the modified KNN on a Xilinx Virtex UltraScale+ VU9P AWS-FPGA board was compared with the results obtained in previous work using the same data from the same experimental setup but offline DSP using Matlab. The results show that the difference is negligible below FEC limit. Additionally, the modified KNN shows a reduction of operations from 43 percent to 75 percent, depending on the symbol’s position in the constellation, achieving a reduction 47.25% reduction in total computational time for 100 K input symbols processed on 20 parallel cores compared to the KNN algorithm.

Download Full-text