The SPEEDY Family of Block Ciphers

We introduce SPEEDY, a family of ultra low-latency block ciphers. We mix engineering expertise into each step of the cipher’s design process in order to create a secure encryption primitive with an extremely low latency in CMOS hardware. The centerpiece of our constructions is a high-speed 6-bit substitution box whose coordinate functions are realized as two-level NAND trees. In contrast to other low-latency block ciphers such as PRINCE, PRINCEv2, MANTIS and QARMA, we neither constrain ourselves by demanding decryption at low overhead, nor by requiring a super low area or energy. This freedom together with our gate- and transistor-level considerations allows us to create an ultra low-latency cipher which outperforms all known solutions in single-cycle encryption speed. Our main result, SPEEDY-6-192, is a 6-round 192-bit block and 192-bit key cipher which can be executed faster in hardware than any other known encryption primitive (including Gimli in Even-Mansour scheme and the Orthros pseudorandom function) and offers 128-bit security. One round more, i.e., SPEEDY-7-192, provides full 192-bit security. SPEEDY primarily targets hardware security solutions embedded in high-end CPUs, where area and energy restrictions are secondary while high performance is the number one priority.

Download Full-text

Low Latency Network-on-Chip Router Microarchitecture Using Request Masking Technique

International Journal of Reconfigurable Computing ◽

10.1155/2015/570836 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 14

Author(s):

Alireza Monemi ◽

Chia Yee Ooi ◽

Muhammad Nadzir Marsono

Keyword(s):

High Performance ◽

Clock Cycle ◽

Network On Chip ◽

Operating Frequency ◽

Low Latency ◽

Core System ◽

Low Area ◽

Area Overhead ◽

Logic Cells ◽

On Chip

Network-on-Chip (NoC) is fast emerging as an on-chip communication alternative for many-core System-on-Chips (SoCs). However, designing a high performance low latency NoC with low area overhead has remained a challenge. In this paper, we present a two-clock-cycle latency NoC microarchitecture. An efficient request masking technique is proposed to combine virtual channel (VC) allocation with switch allocation nonspeculatively. Our proposed NoC architecture is optimized in terms of area overhead, operating frequency, and quality-of-service (QoS). We evaluate our NoC against CONNECT, an open source low latency NoC design targeted for field-programmable gate array (FPGA). The experimental results on several FPGA devices show that our NoC router outperforms CONNECT with 50% reduction of logic cells (LCs) utilization, while it works with 100% and 35%~20% higher operating frequency compared to the one- and two-clock-cycle latency CONNECT NoC routers, respectively. Moreover, the proposed NoC router achieves 2.3 times better performance compared to CONNECT.

Download Full-text

An Optimized CB-UT Multiplier for Efficient Design of the AM Operator

10.31219/osf.io/dhx9j ◽

2020 ◽

Author(s):

Hari Krishna Modalavalasa

Keyword(s):

Signal Processing ◽

Performance Optimization ◽

High Speed ◽

High Performance ◽

New Technology ◽

Digital Signal ◽

Propagation Delay ◽

Efficient Design ◽

Silicon Area ◽

Low Area

The multiplication and accumulation are the vital operations involved in almost all the Digital Signal Processing applications. With the advent of new technology in the domain of VLSI, communication and signal processing, there is an ever going demand for the high speed processing and low area design. In today's technology, Add-Multiply (AM) operator or Multiply Accumulator (MAC) units are generally employed in all high performance digital signal processors (DSP) and controllers. The performance of AM operator mainly depends on the speed of multiplier. A lot of research has been contributed in this area and the conventional multipliers were modified to provide good speed performance but needs to be improved further along with area optimization. Urdhwa-Tiryakbhyam Multiplier (UTM) architecture is adopted from ancient Indian mathematics "Vedas’ and can generate the partial products and sums in one step, which reduces the carry propagation from LSB to MSB. UTM can be used to implement high performance AM operators but results in larger silicon areas. This increased area can be minimized by using the modified compressor based design of UTM. In this work, the carrylook-ahead (CLA) adder is adopted instead of parallel adders for high speed of accumulation. So, the Compressor-Based-Urdhwa-Tiryakbhyam (CB-UT) multiplier with CLA results in both area and performance optimization of Add-Multiply operator. The functionality of this architecture is evaluated by comparing with the Modified Booth (MB) multiplier based AM operator in terms of performance parameters like propagation delay, power consumption and silicon-area. The design is implemented and verified using Xilinx Spartan-3E FPGA and ISE Simulator.

Download Full-text

Automatic Feature Recognition for Data Interoperability Issues in High-Speed Electronics System Design

Volume 6: Electronics and Photonics ◽

10.1115/imece2008-68389 ◽

2008 ◽

Author(s):

Xiaolin Chen ◽

Hui Zhang ◽

Will Miller

Keyword(s):

System Design ◽

Design Process ◽

High Speed ◽

High Performance ◽

Feature Recognition ◽

Circuit Simulation ◽

Signal Integrity ◽

Electronic Systems ◽

Design Data ◽

Automatic Feature Recognition

Technology trends toward higher speed and density devices have pushed high performance electronic system design to its limits. With fine miniaturization of very-large-scale integrated (VLSI) circuits and rapid increase in the working frequency of system-on-a-chip (SoC), the signal integrity has become a major concern. As the operating frequencies enter the gigahertz range, signal integrity issues such as cross talk, power-ground-plane voltage bounce, and substrate losses can no longer be neglected. In order to design high-performance electronic systems with fast time-to-market, it is often needed to analyze whole or part of the system at one fundamentally deeper level of physics. It has begun to be recognized that electromagnetic (EM) field analysis needs to be rigorously included as an addition to traditional circuit simulation. A common problem in this practice is the lack of efficient tools that enable engineers to easily transfer circuit board design data into EM solvers. To partially solve this problem, ACIS SAT has been introduced as a standard data exchange format and been adopted by many software vendors for data import and export. However, efficient data transfer remains a problem as the geometry created in the design package becomes static and no longer feature-based once imported into the simulation package. In this paper, automatic feature recognition algorithms are implemented to help extract features and parameters from the imported static model in SAT format. Case studies will be provided for some representative high speed electronics designs. This work is supported by Research & Technology Development Grant Program of Washington Technology Center with a goal to achieve improved design process for high-speed electronic systems. The developed tool has a potential to speed up the current design process by eliminating laborious manual preparation of design data for EM simulation and allow what-if analysis to be automated to highlight likely signal integrity issues.

Download Full-text

Comparative Analysis Domino Logic Based Techniques For VLSI Circuit

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v12i8.2998 ◽

2014 ◽

Vol 12 (8) ◽

pp. 3803-3808 ◽

Cited By ~ 1

Author(s):

Shilpa Kamde ◽

Jitesh Shinde ◽

Sanjay Badjate ◽

Pratik Hajare

Keyword(s):

High Speed ◽

High Performance ◽

Circuit Simulation ◽

Dynamic Logic ◽

Digital Circuit ◽

Logic Circuit ◽

Vlsi Circuit ◽

Short Channel ◽

Domino Logic ◽

Low Area

Domino logic is a CMOS-based evolution of the dynamic logicÂ techniques Â basedÂ onÂ eitherÂ PMOSÂ orÂ NMOS transistors. Domino logic technique is widely used in modern digital VLSI circuit. Dynamic logic is twice as fast as static CMOS logic because it uses only N fast transistors. The Dynamic (Domino) logic circuit are often favored in high performance designs because of the high speed and low area advantage.Four different dynamic circuit techniques including Basic domino logic circuit are compared in this paper for low power consumption and speed of domino logic circuits. For digital circuit simulation used BSIM(Berkeley Short Channel IGFET ) Model because it control leakage current.

Download Full-text

Design and Implementation of High-Performance ECC Processor with Unified Point Addition on Twisted Edwards Curve

Sensors ◽

10.3390/s20185148 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5148

Author(s):

Md. Mainul Islam ◽

Md. Selim Hossain ◽

Moh. Khalid Hasan ◽

Md. Shahjalal ◽

Yeong Min Jang

Keyword(s):

Elliptic Curve ◽

High Speed ◽

High Performance ◽

Security Level ◽

Prime Field ◽

Iot Security ◽

Point Multiplication ◽

Low Area ◽

Edwards Curve ◽

Twisted Edwards Curve

With the swift evolution of wireless technologies, the demand for the Internet of Things (IoT) security is rising immensely. Elliptic curve cryptography (ECC) provides an attractive solution to fulfill this demand. In recent years, Edwards curves have gained widespread acceptance in digital signatures and ECC due to their faster group operations and higher resistance against side-channel attacks (SCAs) than that of the Weierstrass form of elliptic curves. In this paper, we propose a high-speed, low-area, simple power analysis (SPA)-resistant field-programmable gate array (FPGA) implementation of ECC processor with unified point addition on a twisted Edwards curve, namely Edwards25519. Efficient hardware architectures for modular multiplication, modular inversion, unified point addition, and elliptic curve point multiplication (ECPM) are proposed. To reduce the computational complexity of ECPM, the ECPM scheme is designed in projective coordinates instead of affine coordinates. The proposed ECC processor performs 256-bit point multiplication over a prime field in 198,715 clock cycles and takes 1.9 ms with a throughput of 134.5 kbps, occupying only 6543 slices on Xilinx Virtex-7 FPGA platform. It supports high-speed public-key generation using fewer hardware resources without compromising the security level, which is a challenging requirement for IoT security.

Download Full-text

An Automatic Design Framework for Real-Time Power System Simulators Supporting Smart Grid Applications

Electronics ◽

10.3390/electronics9020299 ◽

2020 ◽

Vol 9 (2) ◽

pp. 299 ◽

Cited By ~ 1

Author(s):

Eleftherios Mylonas ◽

Nikolaos Tzanis ◽

Michael Birbas ◽

Alexios Birbas

Keyword(s):

Smart Grid ◽

Power System ◽

Real Time ◽

High Speed ◽

High Performance ◽

Power Grids ◽

Low Latency ◽

Successful Implementation ◽

Automatic Design ◽

Design Framework

Smart grid technology is the next step to the evolution of classical power grids, providing robustness, reliability, and security throughout the network, enabling real-time management and control. To achieve these goals, distributed computing (microgrid concept) and intelligent control algorithms, tailored to the nature and needs of the network under study, are necessary. To deal with the vast diversity of power grids, being able to capture the dynamics of any given network, and create tools for network analysis, apparatus testing, and power grid management, an automatic design framework for real-time power system simulators is needed. In this article, a prototype of this approach is presented, employing Field Programmable Gate Array (FPGA) platforms due to their reconfigurability that enables low-power, low-latency, and high-performance designs, as a first attempt towards an open source platform, compatible with the majority of hardware design suites. It comprises two major parts: (i) a user-oriented section, built in Matlab/Simulink; and (ii) a hardware-oriented section, written in Matlab and Very High Speed Integrated Circuit (VHSIC)-Hardware Description Language (VHDL) code. To verify its functionality, two test power networks were given in a schematic format, analyzed through Matlab code and turned into dedicated hardware simulators with the aid of the VHDL template. Then, simulation results from Simulink and the prototype were compared for error estimation. The results show the prototype’s successful implementation with minimal resources utilization, high performance and low latency in the order of nanoseconds in Xilinx 6- and 7-series FPGAs, therefore proving its modularity and efficient use in many different scenarios, meeting low-latency/real-time requirements while enabling further smart grid research.

Download Full-text

Heuristic method for bitsliced representation of randomly generated 8×8 cryptographic S-Box

Ukrainian Journal of Information Technology ◽

10.23939/ujit2021.02.058 ◽

2021 ◽

Vol 3 (2) ◽

pp. 58-65

Author(s):

Ya. R. Sovyn ◽

◽

V. V. Khoma ◽

Keyword(s):

High Speed ◽

High Performance ◽

Heuristic Method ◽

Block Ciphers ◽

Truth Table ◽

Software Implementation ◽

Logical Operations ◽

Resistance To Power ◽

Symmetric Block ◽

Cache Attacks

The article is devoted to the issues of increasing the security and efficiency of software implementation for the symmetric block ciphers. For the implementation of cryptoalgorithms on low-end CPUs (8/16/32-bit microcontrollers), it is important to provide increased resistance to power consumption analysis attacks. With regard to the implementation of ciphers on high-end CPUs (x86, ARM Cortex-A), it is important to eliminate the vulnerability primarily to timing and cache attacks. The authors used a bitslice approach to securely implement block ciphers, which has potential advantages such as high speed and low computing resources. However, the known bitsliced methods have a significant limitation, since they work with deterministic S-Boxes or arbitrary S-Boxes of smaller sizes. The paper proposes a new heuristic method for bitsliced representation of cryptographic 8×8 S-Boxes containing randomly generated values. These values defy description using algebraic expressions. The method is based on the decomposition of the truth table, which describes the S-Box, into two parts. One part of the table forms logical masks, and the other is split into bit vectors. To find a logical description of these vectors an exhaustive search is used. After finding the description of all vectors, these two parts of the table are combined into one using logical operations. The use of this method oriented on software implementation in the logical basis {AND, OR, XOR, NOT} ensures the minimization of arbitrary 8×8 S-Boxes. The proposed method can be implemented using standard logical instructions on any 8/16/32/64-bit processors. It is also possible to use logical SIMD instructions from the SSE, AVX, AVX-512 extensions for x86-64 processors, which provides high performance due to the use of long registers. The corresponding software has been developed that implements the method of searching for bitsliced representations of a given S-Box, and also automatically generates C++ code for it based on SSE, AVX and AVX-512 instructions. The effectiveness of the method on the S-Box of known block ciphers, in particular the Ukrainian encryption standard "Kalyna", has been investigated. It was found that the developed algorithm requires almost half as many gates for the bitsliced description of an arbitrary S-Box than the best of known algorithm (370 gates versus 680, respectively). For ciphers that use two or four S-Box tables, joint minimization can yield up to 330 or 300 gates per table, respectively. Keywords: bitslicing; S-Box; logical minimization; SIMD; x86-64 CPU; software implementation; block ciphers.

Download Full-text

High-performance Low-area Iterative BCH Decoder Architecture for Ultra High Speed Optical Communications

Journal of the Institute of Electronics and Information Engineers ◽

10.5573/ieie.2019.56.1.29 ◽

2019 ◽

Vol 56 (1) ◽

pp. 29-39

Author(s):

Dae-Hyun Choi ◽

Hanho Lee

Keyword(s):

Optical Communications ◽

High Speed ◽

High Performance ◽

Low Area ◽

Ultra High Speed ◽

Decoder Architecture

Download Full-text

Vacuum System to Minimize the Specimen Contamination of High-Performance EM

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100077967 ◽

1977 ◽

Vol 35 ◽

pp. 68-69

Author(s):

N. Yoshimura ◽

K. Shirota ◽

T. Etoh

Keyword(s):

Electron Microscope ◽

High Speed ◽

High Performance ◽

High Vacuum ◽

Vacuum System ◽

Pump System ◽

Pumping System ◽

Diffusion Pump ◽

Almost All ◽

Cascade Type

One of the most important requirements for a high-performance EM, especially an analytical EM using a fine beam probe, is to prevent specimen contamination by providing a clean high vacuum in the vicinity of the specimen. However, in almost all commercial EMs, the pressure in the vicinity of the specimen under observation is usually more than ten times higher than the pressure measured at the punping line. The EM column inevitably requires the use of greased Viton O-rings for fine movement, and specimens and films need to be exchanged frequently and several attachments may also be exchanged. For these reasons, a high speed pumping system, as well as a clean vacuum system, is now required. A newly developed electron microscope, the JEM-100CX features clean high vacuum in the vicinity of the specimen, realized by the use of a CASCADE type diffusion pump system which has been essentially improved over its predeces- sorD employed on the JEM-100C.

Download Full-text

PHAX-SCAN: Functional integration of a Scanning Electron Microscope and an energy-dispersive x-ray analyser

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100152252 ◽

1989 ◽

Vol 47 ◽

pp. 56-57

Author(s):

Marc H. Peeters ◽

Max T. Otten

Keyword(s):

Electron Microscope ◽

Scanning Electron Microscope ◽

High Speed ◽

High Performance ◽

Functional Integration ◽

Energy Dispersive ◽

X Rays ◽

X Ray ◽

High Speed Analysis ◽

Scanning Electron

Over the past decades, the combination of energy-dispersive analysis of X-rays and scanning electron microscopy has proved to be a powerful tool for fast and reliable elemental characterization of a large variety of specimens. The technique has evolved rapidly from a purely qualitative characterization method to a reliable quantitative way of analysis. In the last 5 years, an increasing need for automation is observed, whereby energy-dispersive analysers control the beam and stage movement of the scanning electron microscope in order to collect digital X-ray images and perform unattended point analysis over multiple locations.The Philips High-speed Analysis of X-rays system (PHAX-Scan) makes use of the high performance dual-processor structure of the EDAX PV9900 analyser and the databus structure of the Philips series 500 scanning electron microscope to provide a highly automated, user-friendly and extremely fast microanalysis system. The software that runs on the hardware described above was specifically designed to provide the ultimate attainable speed on the system.

Download Full-text