Implementation with comparison of system performance in different parallel processing configuration systems using matrix multiplication

This paper introduces the performance metric of DSP parallel processing system and presents a model of coarse-grained speedup of DSP parallel processing structure. Quantitative research is done according to the system performance index and target program features. This study simulates and analyzes different communication protocols and different influences of different degrees of parallelism on the parallel processing structure performances. Optimization direction of parallel processing system is put forward.

Download Full-text

Study on parallel processing method of matrix multiplication--A method to calculate the N-th power of massive matrix

2010 International Conference on Computer Application and System Modeling (ICCASM 2010) ◽

10.1109/iccasm.2010.5620197 ◽

2010 ◽

Author(s):

Sun Xu ◽

Li Dengdao ◽

Li Tao

Keyword(s):

Parallel Processing ◽

Matrix Multiplication ◽

Processing Method

Download Full-text

Dense Matrix Multiplication Algorithms and Performance Evaluation of HPCC in 81 Nodes IBM Power 8 Architecture

Computation ◽

10.3390/computation9080086 ◽

2021 ◽

Vol 9 (8) ◽

pp. 86

Author(s):

Eduardo Patricio Estévez Estévez Ruiz ◽

Giovanny Eduardo Caluña Caluña Chicaiza ◽

Fabian Rodolfo Jiménez Patiño ◽

Joaquín Cayetano López López Lago ◽

Saravana Prakash Thirumuruganandham

Keyword(s):

Performance Evaluation ◽

System Performance ◽

High Performance ◽

Matrix Multiplication ◽

Dense Matrix ◽

Current Configuration ◽

Performance Factors ◽

Reasonable Cost ◽

And Performance ◽

Performance Computing

Optimizing HPC systems based on performance factors and bottlenecks is essential for designing an HPC infrastructure with the best characteristics and at a reasonable cost. Such insight can only be achieved through a detailed analysis of existing HPC systems and the execution of their workloads. The “Quinde I” is the only and most powerful supercomputer in Ecuador and is currently listed third on the South America. It was built with the IBM Power 8 servers. In this work, we measured its performance using different parameters from High-Performance Computing (HPC) to compare it with theoretical values and values obtained from tests on similar models. To measure its performance, we compiled and ran different benchmarks with the specific optimization flags for Power 8 to get the maximum performance with the current configuration in the hardware installed by the vendor. The inputs of the benchmarks were varied to analyze their impact on the system performance. In addition, we compile and compare the performance of two algorithms for dense matrix multiplication SRUMMA and DGEMM.

Download Full-text

A method for synthesis and optimization for linear nearest neighbor quantum circuits by parallel processing

Quantum Information and Computation ◽

10.26421/qic18.13-14-2 ◽

2018 ◽

Vol 18 (13&14) ◽

pp. 1095-1114

Author(s):

Zongyuan Zhang ◽

Zhijin Guan ◽

Hong Zhang ◽

Haiying Ma ◽

Weiping Ding

Keyword(s):

Parallel Processing ◽

Large Scale ◽

Nearest Neighbor ◽

Matrix Multiplication ◽

Quantum Circuit ◽

Quantum Circuits ◽

Quantum Cost ◽

The Matrix ◽

Speed Up ◽

Serial Algorithm

In order to realize the linear nearest neighbor{(LNN)} of the quantum circuits and reduce the quantum cost of linear reversible quantum circuits, a method for synthesizing and optimizing linear reversible quantum circuits based on matrix multiplication of the structure of the quantum circuit is proposed. This method shows the matrix representation of linear quantum circuits by multiplying matrices of different parts of the whole circuit. The LNN realization by adding the SWAP gates is proposed and the equivalence of two ways of adding the SWAP gates is proved. The elimination rules of the SWAP gates between two overlapped adjacent quantum gates in different cases are proposed, which reduce the quantum cost of quantum circuits after realizing the LNN architecture. We propose an algorithm based on parallel processing in order to effectively reduce the time consumption for large-scale quantum circuits. Experiments show that the quantum cost can be improved by 34.31\% on average and the speed-up ratio of the GPU-based algorithm can reach 4 times compared with the CPU-based algorithm. The average time optimization ratio of the benchmark large-scale circuits in RevLib processed by the parallel algorithm is {95.57\%} comparing with the serial algorithm.

Download Full-text