matrix vector Latest Research Papers

Performance tuning of the Helmholtz matrix-vector product kernel in the computational fluid dynamics solver Nek5000/RS for the A64FX processor

10.1145/3503470.3503476 ◽

2022 ◽

Author(s):

Miwako Tsuji ◽

Misun Min ◽

Stefan Kerkemeier ◽

Paul Fischer ◽

Elia Merzari ◽

...

Keyword(s):

Fluid Dynamics ◽

Computational Fluid Dynamics ◽

Performance Tuning ◽

Vector Product ◽

Product Kernel ◽

Matrix Vector

DENSITY OPERATOR REPRESENTATION IN MULTI-MATRIX VECTOR COHERENT STATES: LANDAU PROBLEM IN A HARMONIC POTENTIAL BACKGROUND

Reports on Mathematical Physics ◽

10.1016/s0034-4877(21)00084-7 ◽

2021 ◽

Vol 88 (3) ◽

pp. 327-350

Author(s):

ISIAKA AREMUA ◽

MAHOUTON NORBERT HOUNKONNOU ◽

KOMI SODOGA

Keyword(s):

Coherent States ◽

Density Operator ◽

Harmonic Potential ◽

Operator Representation ◽

Matrix Vector ◽

Vector Coherent States

Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2022.i1.221-244 ◽

2021 ◽

pp. 221-244

Author(s):

Hanno Becker ◽

Vincent Hwang ◽

Matthias J. Kannwischer ◽

Bo-Yin Yang ◽

Shang-Yi Yang

Keyword(s):

State Of The Art ◽

Polynomial Multiplication ◽

Montgomery Multiplication ◽

The Core ◽

Multi Stage ◽

Unknown Factor ◽

The Matrix ◽

Improved Technique ◽

Matrix Vector ◽

Vector Polynomial

We present new speed records on the Armv8-A architecture for the latticebased schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.

Algorithmic Structures for Realizing Short-Length Circular Convolutions with Reduced Complexity

Electronics ◽

10.3390/electronics10222800 ◽

2021 ◽

Vol 10 (22) ◽

pp. 2800

Author(s):

Aleksandr Cariow ◽

Janusz P. Paplinski

Keyword(s):

Energy Efficient ◽

Hardware Implementation ◽

Circulant Matrix ◽

Short Length ◽

Convolution Operation ◽

Naive Method ◽

The Matrix ◽

Reduced Complexity ◽

Matrix Vector ◽

Naive Approach

A set of efficient algorithmic solutions suitable to the fully parallel hardware implementation of the short-length circular convolution cores is proposed. The advantage of the presented algorithms is that they require significantly fewer multiplications as compared to the naive method of implementing this operation. During the synthesis of the presented algorithms, the matrix notation of the cyclic convolution operation was used, which made it possible to represent this operation using the matrix–vector product. The fact that the matrix multiplicand is a circulant matrix allows its successful factorization, which leads to a decrease in the number of multiplications when calculating such a product. The proposed algorithms are oriented towards a completely parallel hardware implementation, but in comparison with a naive approach to a completely parallel hardware implementation, they require a significantly smaller number of hardwired multipliers. Since the wired multiplier occupies a much larger area on the VLSI and consumes more power than the wired adder, the proposed solutions are resource efficient and energy efficient in terms of their hardware implementation. We considered circular convolutions for sequences of lengths N= 2, 3, 4, 5, 6, 7, 8, and 9.

Sparse Matrix-Vector Multiplication Cache Performance Evaluation and Design Exploration

10.1109/mascots53633.2021.9614301 ◽

2021 ◽

Author(s):

Jianfeng Cui ◽

Kai Lu ◽

Sheng Liu

Keyword(s):

Performance Evaluation ◽

Sparse Matrix ◽

Cache Performance ◽

Design Exploration ◽

Matrix Vector Multiplication ◽

Matrix Vector

Optimized Data Reuse via Reordering for Sparse Matrix-Vector Multiplication on FPGAs

10.1109/iccad51958.2021.9643453 ◽

2021 ◽

Author(s):

Shiqing Li ◽

Di Liu ◽

Weichen Liu

Keyword(s):

Sparse Matrix ◽

Data Reuse ◽

Matrix Vector Multiplication ◽

Matrix Vector

Querying a Matrix through Matrix-Vector Products

ACM Transactions on Algorithms ◽

10.1145/3470566 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-19

Author(s):

Xiaoming Sun ◽

David P. Woodruff ◽

Guang Yang ◽

Jialin Zhang

Keyword(s):

Linear Algebra ◽

Incidence Matrix ◽

Communication Complexity ◽

Property Testing ◽

Distributed Computation ◽

Maximum Eigenvalue ◽

Graph Problems ◽

The Matrix ◽

The Right ◽

Matrix Vector

We consider algorithms with access to an unknown matrix M ε F n×d via matrix-vector products , namely, the algorithm chooses vectors v 1 , ⃛ , v q , and observes Mv 1 , ⃛ , Mv q . Here the v i can be randomized as well as chosen adaptively as a function of Mv 1 , ⃛ , Mv i-1 . Motivated by applications of sketching in distributed computation, linear algebra, and streaming models, as well as connections to areas such as communication complexity and property testing, we initiate the study of the number q of queries needed to solve various fundamental problems. We study problems in three broad categories, including linear algebra, statistics problems, and graph problems. For example, we consider the number of queries required to approximate the rank, trace, maximum eigenvalue, and norms of a matrix M; to compute the AND/OR/Parity of each column or row of M, to decide whether there are identical columns or rows in M or whether M is symmetric, diagonal, or unitary; or to compute whether a graph defined by M is connected or triangle-free. We also show separations for algorithms that are allowed to obtain matrix-vector products only by querying vectors on the right, versus algorithms that can query vectors on both the left and the right. We also show separations depending on the underlying field the matrix-vector product occurs in. For graph problems, we show separations depending on the form of the matrix (bipartite adjacency versus signed edge-vertex incidence matrix) to represent the graph. Surprisingly, very few works discuss this fundamental model, and we believe a thorough investigation of problems in this model would be beneficial to a number of different application areas.

Barycentric Lagrange Interpolation Matrix–Vector Form Polynomial for Solving Volterra Integral Equations of the Second Kind

10.1007/978-981-16-2102-4_14 ◽

2021 ◽

pp. 151-161

Author(s):

E. S. Shoukralla ◽

B. M. Ahmed

Keyword(s):

Integral Equations ◽

Lagrange Interpolation ◽

Volterra Integral Equations ◽

Vector Form ◽

Interpolation Matrix ◽

Matrix Vector

A fast and oblivious matrix compression algorithm for Volterra integral operators

Advances in Computational Mathematics ◽

10.1007/s10444-021-09902-6 ◽

2021 ◽

Vol 47 (6) ◽

Author(s):

J. Dölz ◽

H. Egger ◽

V. Shashkov

Keyword(s):

Dynamical Systems ◽

Large Scale ◽

Integral Operators ◽

Test Problems ◽

Vector Product ◽

Active Memory ◽

Matrix Compression ◽

History Of ◽

Matrix Vector ◽

Complete History

AbstractThe numerical solution of dynamical systems with memory requires the efficient evaluation of Volterra integral operators in an evolutionary manner. After appropriate discretization, the basic problem can be represented as a matrix-vector product with a lower diagonal but densely populated matrix. For typical applications, like fractional diffusion or large-scale dynamical systems with delay, the memory cost for storing the matrix approximations and complete history of the data then becomes prohibitive for an accurate numerical approximation. For Volterra integral operators of convolution type, the fast and oblivious convolution quadrature method of Schädle, Lopez-Fernandez, and Lubich resolves this issue and allows to compute the discretized evaluation with N time steps in $O(N \log N)$ O ( N log N ) complexity and only requires $O(\log N)$ O ( log N ) active memory to store a compressed version of the complete history of the data. We will show that this algorithm can be interpreted as an ${{\mathscr{H}}}$ H -matrix approximation of the underlying integral operator. A further improvement can thus be achieved, in principle, by resorting to ${{\mathscr{H}}}^{2}$ H 2 -matrix compression techniques. Following this idea, we formulate a variant of the ${{\mathscr{H}}}^{2}$ H 2 -matrix-vector product for discretized Volterra integral operators that can be performed in an evolutionary and oblivious manner and requires only O(N) operations and $O(\log N)$ O ( log N ) active memory. In addition to the acceleration, more general asymptotically smooth kernels can be treated and the algorithm does not require a priori knowledge of the number of time steps. The efficiency of the proposed method is demonstrated by application to some typical test problems.

Efficient computation of matrix–vector products with full observation weighting matrices in data assimilation

Quarterly Journal of the Royal Meteorological Society ◽

10.1002/qj.4170 ◽

2021 ◽

Author(s):

Guannan Hu ◽

Sarah L. Dance

Keyword(s):

Data Assimilation ◽

Efficient Computation ◽

Matrix Vector

matrix vector
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Performance tuning of the Helmholtz matrix-vector product kernel in the computational fluid dynamics solver Nek5000/RS for the A64FX processor

DENSITY OPERATOR REPRESENTATION IN MULTI-MATRIX VECTOR COHERENT STATES: LANDAU PROBLEM IN A HARMONIC POTENTIAL BACKGROUND

Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1

Algorithmic Structures for Realizing Short-Length Circular Convolutions with Reduced Complexity

Sparse Matrix-Vector Multiplication Cache Performance Evaluation and Design Exploration

Optimized Data Reuse via Reordering for Sparse Matrix-Vector Multiplication on FPGAs

Querying a Matrix through Matrix-Vector Products

Barycentric Lagrange Interpolation Matrix–Vector Form Polynomial for Solving Volterra Integral Equations of the Second Kind

A fast and oblivious matrix compression algorithm for Volterra integral operators

Efficient computation of matrix–vector products with full observation weighting matrices in data assimilation

Export Citation Format

matrix vectorRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Performance tuning of the Helmholtz matrix-vector product kernel in the computational fluid dynamics solver Nek5000/RS for the A64FX processor

DENSITY OPERATOR REPRESENTATION IN MULTI-MATRIX VECTOR COHERENT STATES: LANDAU PROBLEM IN A HARMONIC POTENTIAL BACKGROUND

Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1

Algorithmic Structures for Realizing Short-Length Circular Convolutions with Reduced Complexity

Sparse Matrix-Vector Multiplication Cache Performance Evaluation and Design Exploration

Optimized Data Reuse via Reordering for Sparse Matrix-Vector Multiplication on FPGAs

Querying a Matrix through Matrix-Vector Products

Barycentric Lagrange Interpolation Matrix–Vector Form Polynomial for Solving Volterra Integral Equations of the Second Kind

A fast and oblivious matrix compression algorithm for Volterra integral operators

Efficient computation of matrix–vector products with full observation weighting matrices in data assimilation

matrix vector
Recently Published Documents