Time-memory trade-off in Toom-Cook multiplication: an application to module-lattice based cryptography

Since the introduction of the ring-learning with errors problem, the number theoretic transform (NTT) based polynomial multiplication algorithm has been studied extensively. Due to its faster quasilinear time complexity, it has been the preferred choice of cryptographers to realize ring-learning with errors cryptographic schemes. Compared to NTT, Toom-Cook or Karatsuba based polynomial multiplication algorithms, though being known for a long time, still have a fledgling presence in the context of post-quantum cryptography.In this work, we observe that the pre- and post-processing steps in Toom-Cook based multiplications can be expressed as linear transformations. Based on this observation we propose two novel techniques that can increase the efficiency of Toom-Cook based polynomial multiplications. Evaluation is reduced by a factor of 2, and we call this method precomputation, and interpolation is reduced from quadratic to linear, and we call this method lazy interpolation.As a practical application, we applied our algorithms to the Saber post-quantum key-encapsulation mechanism. We discuss in detail the various implementation aspects of applying our algorithms to Saber. We show that our algorithm can improve the efficiency of the computationally costly matrix-vector multiplication by 12−37% compared to previous methods on their respective platforms. Secondly, we propose different methods to reduce the memory footprint of Saber for Cortex-M4 microcontrollers. Our implementation shows between 2.6 and 5.7 KB reduction in the memory usage with respect to the smallest implementation in the literature.

Download Full-text

Post Quantum Learning With Errors Problem Based Key Encapsulation Protocols and Matrix Vector Product

2019 4th International Conference on Computer Science and Engineering (UBMK) ◽

10.1109/ubmk.2019.8907201 ◽

2019 ◽

Author(s):

Erdem Alkim ◽

Bilge Kagan Yazar

Keyword(s):

Vector Product ◽

Quantum Learning ◽

Learning With Errors ◽

Learning With Errors Problem ◽

Matrix Vector

Download Full-text

Polynomial multiplication on embedded vector architectures

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2022.i1.482-505 ◽

2021 ◽

pp. 482-505

Author(s):

Hanno Becker ◽

Jose Maria Bermudo Mera ◽

Angshuman Karmakar ◽

Joseph Yiu ◽

Ingrid Verbauwhede

Keyword(s):

Instruction Scheduling ◽

Polynomial Multiplication ◽

Performance Improvements ◽

Low Area ◽

Memory Efficiency ◽

Key Encapsulation Mechanism ◽

Profile Vector ◽

And Performance ◽

Lattice Based Cryptography ◽

High Degree

High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those improvements on Arm® Cortex®-M4 CPU, as well as the newer Cortex-M55 processor, the first M-profile core implementing the M-profile Vector Extension (MVE), also known as Arm® Helium™ technology. We also implement the Number Theoretic Transform (NTT) on the Cortex-M55 processor. We show that despite being singleissue, in-order and offering only 8 vector registers compared to 32 on A-profile SIMD architectures like Arm® Neon™ technology and the Scalable Vector Extension (SVE), by careful register management and instruction scheduling, we can obtain a 3× to 5× performance improvement over already highly optimized implementations on Cortex-M4, while maintaining a low area and energy profile necessary for use in embedded market. Finally, as a real-world application we integrate our multiplication techniques to post-quantum key-encapsulation mechanism Saber

Download Full-text

Revisiting Multivariate Ring Learning with Errors and Its Applications on Lattice-Based Cryptography

Mathematics ◽

10.3390/math9080858 ◽

2021 ◽

Vol 9 (8) ◽

pp. 858

Author(s):

Alberto Pedrouzo-Ulloa ◽

Juan Ramón Troncoso-Pastoriza ◽

Nicolas Gama ◽

Mariya Georgieva ◽

Fernando Pérez-González

Keyword(s):

Cyclotomic Polynomials ◽

Time Efficiency ◽

Space And Time ◽

Low Degree ◽

Learning With Errors ◽

Lattice Based Cryptography ◽

Low Degree Polynomials ◽

Learning With Errors Problem ◽

Recent Attack

The “Multivariate Ring Learning with Errors” problem was presented as a generalization of Ring Learning with Errors (RLWE), introducing efficiency improvements with respect to the RLWE counterpart thanks to its multivariate structure. Nevertheless, the recent attack presented by Bootland, Castryck and Vercauteren has some important consequences on the security of the multivariate RLWE problem with “non-coprime” cyclotomics; this attack transforms instances of m-RLWE with power-of-two cyclotomic polynomials of degree n=∏ini into a set of RLWE samples with dimension maxi{ni}. This is especially devastating for low-degree cyclotomics (e.g., Φ4(x)=1+x2). In this work, we revisit the security of multivariate RLWE and propose new alternative instantiations of the problem that avoid the attack while still preserving the advantages of the multivariate structure, especially when using low-degree polynomials. Additionally, we show how to parameterize these instances in a secure and practical way, therefore enabling constructions and strategies based on m-RLWE that bring notable space and time efficiency improvements over current RLWE-based constructions.

Download Full-text

Implementing RLWE-based Schemes Using an RSA Co-Processor

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2019.i1.169-208 ◽

2018 ◽

pp. 169-208

Author(s):

Martin R. Albrecht ◽

Christian Hanser ◽

Andrea Hoeller ◽

Thomas Pöppelmann ◽

Fernando Virdia ◽

...

Keyword(s):

High Performance ◽

Key Generation ◽

Ideal Lattice ◽

Polynomial Multiplication ◽

Key Encapsulation Mechanism ◽

Speed Up ◽

Identity Cards ◽

Integer Multiplication ◽

Lattice Based Cryptography ◽

Hardware Security Modules

We repurpose existing RSA/ECC co-processors for (ideal) lattice-based cryptography by exploiting the availability of fast long integer multiplication. Such co-processors are deployed in smart cards in passports and identity cards, secured microcontrollers and hardware security modules (HSM). In particular, we demonstrate an implementation of a variant of the Module-LWE-based Kyber Key Encapsulation Mechanism (KEM) that is tailored for high performance on a commercially available smart card chip (SLE 78). To benefit from the RSA/ECC co-processor we use Kronecker substitution in combination with schoolbook and Karatsuba polynomial multiplication. Moreover, we speed-up symmetric operations in our Kyber variant using the AES co-processor to implement a PRNG and a SHA-256 co-processor to realise hash functions. This allows us to execute CCA-secure Kyber768 key generation in 79.6 ms, encapsulation in 102.4 ms and decapsulation in 132.7 ms.

Download Full-text

NTT Multiplication for NTT-unfriendly Rings

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2021.i2.159-188 ◽

2021 ◽

pp. 159-188

Author(s):

Chi-Ming Marvin Chung ◽

Vincent Hwang ◽

Matthias J. Kannwischer ◽

Gregor Seiler ◽

Cheng-Jhih Shih ◽

...

Keyword(s):

Chinese Remainder Theorem ◽

Polynomial Rings ◽

Superior Performance ◽

Polynomial Multiplication ◽

Software Optimization ◽

Matrix Vector Multiplication ◽

Speed Up ◽

Previous State ◽

Chinese Association ◽

Matrix Vector

In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT).For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets.As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.

Download Full-text

Efficient Three-Way Split Formulas for Binary Polynomial Multiplication and Toeplitz Matrix Vector Product

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e101.a.239 ◽

2018 ◽

Vol E101.A (1) ◽

pp. 239-248

Author(s):

Sun-Mi PARK ◽

Ku-Young CHANG ◽

Dowon HONG ◽

Changho SEO

Keyword(s):

Toeplitz Matrix ◽

Vector Product ◽

Polynomial Multiplication ◽

Matrix Vector

Download Full-text

SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) ◽

10.1109/hpca51647.2021.00055 ◽

2021 ◽

Author(s):

Xinfeng Xie ◽

Zheng Liang ◽

Peng Gu ◽

Abanti Basak ◽

Lei Deng ◽

...

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis ◽

10.1145/3295500.3356148 ◽

2019 ◽

Cited By ~ 1

Author(s):

Athena Elafrou ◽

Georgios Goumas ◽

Nectarios Koziris

Keyword(s):

Sparse Matrix ◽

Multicore Architectures ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Sparse Matrix-Vector Multiplication on GPGPUs

ACM Transactions on Mathematical Software ◽

10.1145/3017994 ◽

2017 ◽

Vol 43 (4) ◽

pp. 1-49 ◽

Cited By ~ 34

Author(s):

Salvatore Filippone ◽

Valeria Cardellini ◽

Davide Barbieri ◽

Alessandro Fanfarillo

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Private and rateless adaptive coded matrix-vector multiplication

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-020-01887-y ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Rawad Bitar ◽

Yuxuan Xing ◽

Yasaman Keshtkarjahromi ◽

Venkat Dasari ◽

Salim El Rouayheb ◽

...

Keyword(s):

Computing Methods ◽

Erasure Codes ◽

Time Varying ◽

New Paradigm ◽

Matrix Vector Multiplication ◽

Processing Data ◽

Computationally Intensive ◽

Matrix Vector ◽

Monitoring Devices ◽

Theoretical Results

AbstractEdge computing is emerging as a new paradigm to allow processing data near the edge of the network, where the data is typically generated and collected. This enables critical computations at the edge in applications such as Internet of Things (IoT), in which an increasing number of devices (sensors, cameras, health monitoring devices, etc.) collect data that needs to be processed through computationally intensive algorithms with stringent reliability, security and latency constraints. Our key tool is the theory of coded computation, which advocates mixing data in computationally intensive tasks by employing erasure codes and offloading these tasks to other devices for computation. Coded computation is recently gaining interest, thanks to its higher reliability, smaller delay, and lower communication costs. In this paper, we develop a private and rateless adaptive coded computation (PRAC) algorithm for distributed matrix-vector multiplication by taking into account (1) the privacy requirements of IoT applications and devices, and (2) the heterogeneous and time-varying resources of edge devices. We show that PRAC outperforms known secure coded computing methods when resources are heterogeneous. We provide theoretical guarantees on the performance of PRAC and its comparison to baselines. Moreover, we confirm our theoretical results through simulations and implementations on Android-based smartphones.

Download Full-text