Racing BIKE: Improved Polynomial Multiplication and Inversion in Hardware

BIKE is a Key Encapsulation Mechanism selected as an alternate candidate in NIST’s PQC standardization process, in which performance plays a significant role in the third round. This paper presents FPGA implementations of BIKE with the best area-time performance reported in literature. We optimize two key arithmetic operations, which are the sparse polynomial multiplication and the polynomial inversion. Our sparse multiplier achieves time-constancy for sparse polynomials of indefinite Hamming weight used in BIKE’s encapsulation. The polynomial inversion is based on the extended Euclidean algorithm, which is unprecedented in current BIKE implementations. Our optimized design results in a 5.5 times faster key generation compared to previous implementations based on Fermat’s little theorem.Besides the arithmetic optimizations, we present a united hardware design of BIKE with shared resources and shared sub-modules among KEM functionalities. On Xilinx Artix-7 FPGAs, our light-weight implementation consumes only 3 777 slices and performs a key generation, encapsulation, and decapsulation in 3 797 μs, 443 μs, and 6 896 μs, respectively. Our high-speed design requires 7 332 slices and performs the three KEM operations in 1 672 μs, 132 μs, and 1 892 μs, respectively.

Download Full-text

Implementing RLWE-based Schemes Using an RSA Co-Processor

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2019.i1.169-208 ◽

2018 ◽

pp. 169-208

Author(s):

Martin R. Albrecht ◽

Christian Hanser ◽

Andrea Hoeller ◽

Thomas Pöppelmann ◽

Fernando Virdia ◽

...

Keyword(s):

High Performance ◽

Key Generation ◽

Ideal Lattice ◽

Polynomial Multiplication ◽

Key Encapsulation Mechanism ◽

Speed Up ◽

Identity Cards ◽

Integer Multiplication ◽

Lattice Based Cryptography ◽

Hardware Security Modules

We repurpose existing RSA/ECC co-processors for (ideal) lattice-based cryptography by exploiting the availability of fast long integer multiplication. Such co-processors are deployed in smart cards in passports and identity cards, secured microcontrollers and hardware security modules (HSM). In particular, we demonstrate an implementation of a variant of the Module-LWE-based Kyber Key Encapsulation Mechanism (KEM) that is tailored for high performance on a commercially available smart card chip (SLE 78). To benefit from the RSA/ECC co-processor we use Kronecker substitution in combination with schoolbook and Karatsuba polynomial multiplication. Moreover, we speed-up symmetric operations in our Kyber variant using the AES co-processor to implement a PRNG and a SHA-256 co-processor to realise hash functions. This allows us to execute CCA-secure Kyber768 key generation in 79.6 ms, encapsulation in 102.4 ms and decapsulation in 132.7 ms.

Download Full-text

High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanism: Saber in Hardware

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2020.i4.443-466 ◽

2020 ◽

pp. 443-466

Author(s):

Sujoy Sinha Roy ◽

Andrea Basso

Keyword(s):

High Speed ◽

Critical Role ◽

Computation Time ◽

Public Key Cryptography ◽

Instruction Set ◽

Clock Frequency ◽

Polynomial Multiplication ◽

Trade Offs ◽

Key Encapsulation Mechanism ◽

Scale Design

In this paper, we present an instruction set coprocessor architecture for lattice-based cryptography and implement the module lattice-based post-quantum key encapsulation mechanism (KEM) Saber as a case study. To achieve fast computation time, the architecture is fully implemented in hardware, including CCA transformations. Since polynomial multiplication plays a performance-critical role in the module and ideal lattice-based public-key cryptography, a parallel polynomial multiplier architecture is proposed that overcomes memory access bottlenecks and results in a highly parallel yet simple and easy-to-scale design. Such multipliers can compute a full multiplication in 256 cycles, but are designed to target any area/performance trade-offs. Besides optimizing polynomial multiplication, we make important design decisions and perform architectural optimizations to reduce the overall cycle counts as well as improve resource utilization. For the module dimension 3 (security comparable to AES-192), the coprocessor computes CCA key generation, encapsulation, and decapsulation in only 5,453, 6,618 and 8,034 cycles respectively, making it the fastest hardware implementation of Saber to our knowledge. On a Xilinx UltraScale+ XCZU9EG-2FFVB1156 FPGA, the entire instruction set coprocessor architecture runs at 250 MHz clock frequency and consumes 23,686 LUTs, 9,805 FFs, and 2 BRAM tiles (including 5,113 LUTs and 3,068 FFs for the Keccak core).

Download Full-text

Polynomial Multiplication in NTRU Prime

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2021.i1.217-238 ◽

2020 ◽

pp. 217-238

Author(s):

Erdem Alkim ◽

Dean Yun-Li Cheng ◽

Chi-Ming Marvin Chung ◽

Hülya Evkan ◽

Leo Wei-Lun Huang ◽

...

Keyword(s):

Polynomial Ring ◽

State Of The Art ◽

The Other ◽

Polynomial Rings ◽

Key Generation ◽

Polynomial Multiplication ◽

Current State ◽

Key Encapsulation Mechanism

This paper proposes two different methods to perform NTT-based polynomial multiplication in polynomial rings that do not naturally support such a multiplication. We demonstrate these methods on the NTRU Prime key-encapsulation mechanism (KEM) proposed by Bernstein, Chuengsatiansup, Lange, and Vredendaal, which uses a polynomial ring that is, by design, not amenable to use with NTT. One of our approaches is using Good’s trick and focuses on speed and supporting more than one parameter set with a single implementation. The other approach is using a mixed radix NTT and focuses on the use of smaller multipliers and less memory. On a ARM Cortex-M4 microcontroller, we show that our three NTT-based implementations, one based on Good’s trick and two mixed radix NTTs, provide between 32% and 17% faster polynomial multiplication. For the parameter-set ntrulpr761, this results in between 16% and 9% faster total operations (sum of key generation, encapsulation, and decapsulation) and requires between 15% and 39% less memory than the current state-of-the-art NTRU Prime implementation on this platform, which is using Toom-Cook-based polynomial multiplication.

Download Full-text

High-Speed and Scalable FPGA Implementation of the Key Generation for the Leighton-Micali Signature Protocol

2021 IEEE International Symposium on Circuits and Systems (ISCAS) ◽

10.1109/iscas51556.2021.9401177 ◽

2021 ◽

Author(s):

Yifeng Song ◽

Xiao Hu ◽

Wenhao Wang ◽

Jing Tian ◽

Zhongfeng Wang

Keyword(s):

High Speed ◽

Fpga Implementation ◽

Key Generation

Download Full-text

Sailing Yacht Rig Improvements Through Viscous Computational Fluid Dynamics

10.5957/csys-2005-004 ◽

2005 ◽

Author(s):

Vincent G. Chapin ◽

Romaric Neyhousser ◽

Stephane Jamme ◽

Guillaume Dulliand ◽

Patrick Chassaing

Keyword(s):

Fluid Dynamics ◽

Computational Fluid Dynamics ◽

Wind Tunnel ◽

High Speed ◽

Stokes Equations ◽

Wind Tunnel Test ◽

Physical Modelling ◽

Optimized Design ◽

Linear Interaction ◽

Sailing Yacht

In this paper we propose a rational viscous Computational Fluid Dynamics (CFD) methodology applied to sailing yacht rig aerodynamic design and analysis. After an outlook of present challenges in high speed sailing, we emphasized the necessity of innovation and CFD to conceive, validate and optimize new aero-hydrodynamic concepts. Then, we present our CFD methodology through CAD, mesh generation, numerical and physical modelling choices, and their validation on typical rig configurations through wind-tunnel test comparisons. The methodology defined, we illustrate the relevance and wide potential of advanced numerical tools to investigate sailing yacht rig design questions like the relation between sail camber, propulsive force and aerodynamic finesse, and like the mast-mainsail non linear interaction. Through these examples, it is shown how sailing yacht rig improvements may be drawn by using viscous CFD based on Reynolds Averaged Navier-Stokes equations (RANS). Then the extensive use of viscous CFD, rather than wind-tunnel tests on scale models, for the evaluation or ranking of improved designs with increased time savings. Viscous CFD methodology is used on a preliminary study of the complex and largely unknown Yves Parlier Hydraplaneur double rig. We show how it is possible to increase our understanding of his flow physics with strong sail interactions, and we hope this methodology will open new roads toward optimized design. Throughout the paper, the necessary comparison between CFD and wind-tunnel test will be presented to focus on limitations and drawbacks of viscous CFD tools, and to address future improvements.

Download Full-text

High-Speed NTT-based Polynomial Multiplication Accelerator for Post-Quantum Cryptography

10.1109/arith51176.2021.00028 ◽

2021 ◽

Author(s):

Mojtaba Bisheh-Niasar ◽

Reza Azarderakhsh ◽

Mehran Mozaffari-Kermani

Keyword(s):

Quantum Cryptography ◽

High Speed ◽

Polynomial Multiplication ◽

Post Quantum Cryptography

Download Full-text

Optimized design of signal crosstalk in high speed PCB

2012 13th International Conference on Electronic Packaging Technology & High Density Packaging ◽

10.1109/icept-hdp.2012.6474678 ◽

2012 ◽

Author(s):

Wenchao Tian ◽

Lei Shan ◽

Wenlong Wang ◽

Yadi Zhu

Keyword(s):

High Speed ◽

Optimized Design ◽

Signal Crosstalk

Download Full-text

High-Speed Hybrid-Logic Full Adder Using High-Performance 10-T XOR–XNOR Cell

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1845 ◽

2021 ◽

pp. 263-269

Author(s):

Tejaswini M. L ◽

Aishwarya H ◽

Akhila M ◽

B. G. Manasa

Keyword(s):

High Speed ◽

High Performance ◽

Full Adder ◽

Cmos Technology ◽

Power Performance ◽

High Speed Design ◽

Power Delay Product ◽

Full Swing ◽

Output Swing ◽

High Output

The main aim of our work is to achieve low power, high speed design goals. The proposed hybrid adder is designed to meet the requirements of high output swing and minimum power. Performance of hybrid FA in terms of delay, power, and driving capability is largely dependent on the performance of XOR-XNOR circuit. In hybrid FAs maximum power is consumed by XOR-XNOR circuit. In this paper 10T XOR-XNOR is proposed, which provide good driving capabilities and full swing output simultaneously without using any external inverter. The performance of the proposed circuit is measured by simulating it in cadence virtuoso environment using 90-nm CMOS technology. This circuit outperforms its counterparts showing power delay product is reduced than that of available XOR-XNOR modules. Four different full adder designs are proposed utilizing 10T XOR-XNOR, sum and carry modules. The proposed FAs provide improvement in terms of PDP than that of other architectures. To evaluate the performance of proposed full adder circuit, we embedded it in a 4-bit and 8-bit cascaded full adder. Among all FAs two of the proposed FAs provide the best performance for a higher number of bits.

Download Full-text