matrix vector
Recently Published Documents


TOTAL DOCUMENTS

880
(FIVE YEARS 176)

H-INDEX

41
(FIVE YEARS 5)

Author(s):  
Hanno Becker ◽  
Vincent Hwang ◽  
Matthias J. Kannwischer ◽  
Bo-Yin Yang ◽  
Shang-Yi Yang

We present new speed records on the Armv8-A architecture for the latticebased schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.


Electronics ◽  
2021 ◽  
Vol 10 (22) ◽  
pp. 2800
Author(s):  
Aleksandr Cariow ◽  
Janusz P. Paplinski

A set of efficient algorithmic solutions suitable to the fully parallel hardware implementation of the short-length circular convolution cores is proposed. The advantage of the presented algorithms is that they require significantly fewer multiplications as compared to the naive method of implementing this operation. During the synthesis of the presented algorithms, the matrix notation of the cyclic convolution operation was used, which made it possible to represent this operation using the matrix–vector product. The fact that the matrix multiplicand is a circulant matrix allows its successful factorization, which leads to a decrease in the number of multiplications when calculating such a product. The proposed algorithms are oriented towards a completely parallel hardware implementation, but in comparison with a naive approach to a completely parallel hardware implementation, they require a significantly smaller number of hardwired multipliers. Since the wired multiplier occupies a much larger area on the VLSI and consumes more power than the wired adder, the proposed solutions are resource efficient and energy efficient in terms of their hardware implementation. We considered circular convolutions for sequences of lengths N= 2, 3, 4, 5, 6, 7, 8, and 9.


2021 ◽  
Vol 17 (4) ◽  
pp. 1-19
Author(s):  
Xiaoming Sun ◽  
David P. Woodruff ◽  
Guang Yang ◽  
Jialin Zhang

We consider algorithms with access to an unknown matrix M ε F n×d via matrix-vector products , namely, the algorithm chooses vectors v 1 , ⃛ , v q , and observes Mv 1 , ⃛ , Mv q . Here the v i can be randomized as well as chosen adaptively as a function of Mv 1 , ⃛ , Mv i-1 . Motivated by applications of sketching in distributed computation, linear algebra, and streaming models, as well as connections to areas such as communication complexity and property testing, we initiate the study of the number q of queries needed to solve various fundamental problems. We study problems in three broad categories, including linear algebra, statistics problems, and graph problems. For example, we consider the number of queries required to approximate the rank, trace, maximum eigenvalue, and norms of a matrix M; to compute the AND/OR/Parity of each column or row of M, to decide whether there are identical columns or rows in M or whether M is symmetric, diagonal, or unitary; or to compute whether a graph defined by M is connected or triangle-free. We also show separations for algorithms that are allowed to obtain matrix-vector products only by querying vectors on the right, versus algorithms that can query vectors on both the left and the right. We also show separations depending on the underlying field the matrix-vector product occurs in. For graph problems, we show separations depending on the form of the matrix (bipartite adjacency versus signed edge-vertex incidence matrix) to represent the graph. Surprisingly, very few works discuss this fundamental model, and we believe a thorough investigation of problems in this model would be beneficial to a number of different application areas.


2021 ◽  
Vol 47 (6) ◽  
Author(s):  
J. Dölz ◽  
H. Egger ◽  
V. Shashkov

AbstractThe numerical solution of dynamical systems with memory requires the efficient evaluation of Volterra integral operators in an evolutionary manner. After appropriate discretization, the basic problem can be represented as a matrix-vector product with a lower diagonal but densely populated matrix. For typical applications, like fractional diffusion or large-scale dynamical systems with delay, the memory cost for storing the matrix approximations and complete history of the data then becomes prohibitive for an accurate numerical approximation. For Volterra integral operators of convolution type, the fast and oblivious convolution quadrature method of Schädle, Lopez-Fernandez, and Lubich resolves this issue and allows to compute the discretized evaluation with N time steps in $O(N \log N)$ O ( N log N ) complexity and only requires $O(\log N)$ O ( log N ) active memory to store a compressed version of the complete history of the data. We will show that this algorithm can be interpreted as an ${{\mathscr{H}}}$ H -matrix approximation of the underlying integral operator. A further improvement can thus be achieved, in principle, by resorting to ${{\mathscr{H}}}^{2}$ H 2 -matrix compression techniques. Following this idea, we formulate a variant of the ${{\mathscr{H}}}^{2}$ H 2 -matrix-vector product for discretized Volterra integral operators that can be performed in an evolutionary and oblivious manner and requires only O(N) operations and $O(\log N)$ O ( log N ) active memory. In addition to the acceleration, more general asymptotically smooth kernels can be treated and the algorithm does not require a priori knowledge of the number of time steps. The efficiency of the proposed method is demonstrated by application to some typical test problems.


Sign in / Sign up

Export Citation Format

Share Document