New First-Order Secure AES Performance Records

Being based on a sound theoretical basis, masking schemes are commonly applied to protect cryptographic implementations against Side-Channel Analysis (SCA) attacks. Constructing SCA-protected AES, as the most widely deployed block cipher, has been naturally the focus of several research projects, with a direct application in industry. The majority of SCA-secure AES implementations introduced to the community opted for low area and latency overheads considering Application-Specific Integrated Circuit (ASIC) platforms. Albeit a few, those which particularly targeted Field Programmable Gate Arrays (FPGAs) as the implementation platform yield either a low throughput or a not-highly secure design.In this work, we fill this gap by introducing first-order glitch-extended probing secure masked AES implementations highly optimized for FPGAs, which support both encryption and decryption. Compared to the state of the art, our designs efficiently map the critical non-linear parts of the masked S-box into the built-in Block RAMs (BRAMs).The most performant variant of our constructions accomplishes five first-order secure AES encryptions/decryptions simultaneously in 50 clock cycles. Compared to the equivalent state-of-the-art designs, this leads to at least 70% reduction in utilization of FPGA resources (slices) at the cost of occupying BRAMs. Last but not least, we provide a wide range of such secure and efficient implementations supporting a large set of applications, ranging from low-area to high-throughput.

Download Full-text

A P4-Enabled RINA Interior Router for Software-Defined Data Centers

Computers ◽

10.3390/computers9030070 ◽

2020 ◽

Vol 9 (3) ◽

pp. 70

Author(s):

Carolina Fernández ◽

Sergio Giménez ◽

Eduard Grasa ◽

Steve Bunch

Keyword(s):

Integrated Circuit ◽

High Performance ◽

Data Transfer ◽

Great Promise ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Application Specific Integrated Circuit ◽

Networking Technologies ◽

Application Specific

The lack of high-performance RINA (Recursive InterNetwork Architecture) implementations to date makes it hard to experiment with RINA as an underlay networking fabric solution for different types of networks, and to assess RINA’s benefits in practice on scenarios with high traffic loads. High-performance router implementations typically require dedicated hardware support, such as FPGAs (Field Programmable Gate Arrays) or specialized ASICs (Application Specific Integrated Circuit). With the advance of hardware programmability in recent years, new possibilities unfold to prototype novel networking technologies. In particular, the use of the P4 programming language for programmable ASICs holds great promise for developing a RINA router. This paper details the design and part of the implementation of the first P4-based RINA interior router, which reuses the layer management components of the IRATI Linux-based RINA implementation and implements the data-transfer components using a P4 program. We also describe the configuration and testing of our initial deployment scenarios, using ancillary open-source tools such as the P4 reference test software switch (BMv2) or the P4Runtime API.

Download Full-text

Sparse Cholesky Factorization on FPGA Using Parameterized Model

Mathematical Problems in Engineering ◽

10.1155/2017/3021591 ◽

2017 ◽

Vol 2017 ◽

pp. 1-11

Author(s):

Yichun Sun ◽

Hengzhu Liu ◽

Tong Zhou

Keyword(s):

Integrated Circuit ◽

Sparse Matrix ◽

Fundamental Problem ◽

Performance Model ◽

Cholesky Factorization ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Application Specific Integrated Circuit ◽

Parameterized Model ◽

And Performance

Cholesky factorization is a fundamental problem in most engineering and science computation applications. When dealing with a large sparse matrix, numerical decomposition consumes the most time. We present a vector architecture to parallelize numerical decomposition of Cholesky factorization. We construct an integrated analytical parameterized performance model to accurately predict the execution times of typical matrices under varying parameters. Our proposed approach is general for accelerator and limited by neither field-programmable gate arrays (FPGAs) nor application-specific integrated circuit. We implement a simplified module in FPGAs to prove the accuracy of the model. The experiments show that, for most cases, the performance differences between the predicted and measured execution are less than 10%. Based on the performance model, we optimize parameters and obtain a balance of resources and performance after analyzing the performance of varied parameter settings. Comparing with the state-of-the-art implementation in CPU and GPU, we find that the performance of the optimal parameters is 2x that of CPU. Our model offers several advantages, particularly in power consumption. It provides guidance for the design of future acceleration components.

Download Full-text

AREEBA: An Area Efficient Binary Huff-Curve Architecture

Electronics ◽

10.3390/electronics10121490 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1490

Author(s):

Asher Sajid ◽

Muhammad Rashid ◽

Sajjad Shaukat Jamal ◽

Malik Imran ◽

Saud S. Alotaibi ◽

...

Keyword(s):

Elliptic Curve Cryptography ◽

Power Analysis ◽

State Of The Art ◽

Power Analysis Attacks ◽

Low Area ◽

Gate Arrays ◽

Asymmetric Cryptography ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Computational Resources

Elliptic curve cryptography is the most widely employed class of asymmetric cryptography algorithm. However, it is exposed to simple power analysis attacks due to the lack of unifiedness over point doubling and addition operations. The unified crypto systems such as Binary Edward, Hessian and Huff curves provide resistance against power analysis attacks. Furthermore, Huff curves are more secure than Edward and Hessian curves but require more computational resources. Therefore, this article has provided a low area hardware architecture for point multiplication computation of Binary Huff curves over GF(2163) and GF(2233). To achieve this, a segmented least significant digit multiplier for polynomial multiplications is proposed. In order to provide a realistic and reasonable comparison with state of the art solutions, the proposed architecture is modeled in Verilog and synthesized for different field programmable gate arrays. For Virtex-4, Virtex-5, Virtex-6, and Virtex-7 devices, the utilized hardware resources in terms of hardware slices over GF(2163) are 5302, 2412, 2982 and 3508, respectively. The corresponding achieved values over GF(2233) are 11,557, 10,065, 4370 and 4261, respectively. The reported low area values provide the acceptability of this work in area-constrained applications.

Download Full-text

Soft Core Processor Generated Based on the Machine Code of the Application

Journal of Circuits System and Computers ◽

10.1142/s0218126616500298 ◽

2016 ◽

Vol 25 (04) ◽

pp. 1650029 ◽

Cited By ~ 11

Author(s):

Adam Ziebinski ◽

Stanwlaw Swierc

Keyword(s):

Embedded System ◽

Soft Core ◽

Machine Code ◽

Correct Operation ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Instruction Set Extensions ◽

The Cost ◽

Application Specific

Currently embedded system designs aim to improve areas such as speed, energy efficiency and the cost of an application. Application-specific instruction set extensions on reconfigurable hardware provide such opportunities. The article presents a new approach for generating soft core processors that are optimized for specific tasks. In this work, we describe an automatic method for selecting custom instructions for generating software core processors that are based on the machine code of the application program. As the result, a soft core processor will contain the logic that is absolutely necessary. This solution requires fewer gates to be synthesized in the field programmable gate arrays (FPGA) and has a potential to increase the speed of the information processing that is performed by the system in the target FPGA. Experiments have confirmed the correct operation of the method that was used. After the reduction mechanism was enabled, the total number of slices blocks that were occupied decreased to 47% of its initial value in the best case for the Xilinx Spartan3 (xc3s200) and the maximum frequency increased approximately 44% in the best case for Xilinx Spartan6 (xc6slx4).

Download Full-text

BPR-TCAM—Block and Partial Reconfiguration based TCAM on Xilinx FPGAs

Electronics ◽

10.3390/electronics9020353 ◽

2020 ◽

Vol 9 (2) ◽

pp. 353 ◽

Cited By ~ 1

Author(s):

Anees Ullah ◽

Ali Zahir ◽

Noaman A. Khan ◽

Waleed Ahmad ◽

Alexis Ramos ◽

...

Keyword(s):

Resource Utilization ◽

High Speed ◽

State Of The Art ◽

Field Programmable Gate Arrays ◽

Partial Reconfiguration ◽

Gate Arrays ◽

Content Addressable Memories ◽

Field Programmable ◽

Programmable Gate Arrays

Field Programmable Gate Arrays (FPGAs) based Ternary Content Addressable Memories (TCAMs) are widely used in high-speed networking applications.However, TCAMs are not present on state-of-the-art FPGAs and need to be emulated on SRAM-based memories (i.e., LUTRAMs and Block RAMs) which requires a large amount of FPGA resources. In this paper, we present an efficient methodology to implement FPGA-based TCAMs with significant resource savings compared to existing schemes. The proposed methodology exploits the fracturable nature of Look Up Tables (LUTs) and the built-in slice carry-chains for simultaneous mapping of two rules and its matching logic to a single FPGA slice. Multiple slices can be stacked together to build deeper and wider TCAMs in a modular way. The combination of all these techniques results in significant savings in resource utilization compared to existing approaches.

Download Full-text

State-of-the-art in Heterogeneous Computing

Scientific Programming ◽

10.1155/2010/540159 ◽

2010 ◽

Vol 18 (1) ◽

pp. 1-33 ◽

Cited By ~ 96

Author(s):

Andre R. Brodtkorb ◽

Christopher Dyken ◽

Trond R. Hagen ◽

Jon M. Hjelmervik ◽

Olaf O. Storaasli

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

State Of The Art ◽

Peak Performance ◽

Fine Grained ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Cost Efficient ◽

Graphics Processing

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

Download Full-text

SS-OCT and FD-OCT System Prototyping Using LabVIEW FPGA

10.1115/biomed2011-66007 ◽

2011 ◽

Author(s):

Zach Olson

Keyword(s):

Real Time ◽

Graphics Processing Units ◽

Image Data ◽

Light Sources ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Labview Fpga ◽

And Performance ◽

Control Light ◽

The Cost

Optical coherence tomography (OCT) techniques have opened up a number of new medical imaging applications in research and clinical applications. Key application areas include cancer research, vascular applications such as imaging arterial plaque, and ophthalmology applications such as pre and post-operative cataract surgery imaging. Emerging Technologies in galvo control, light sources, detector technologies, and parallel hardware-based processing are increasing the quality and performance of images, as well as reducing the cost and footprint of OCT systems. The parallel computing capabilities of field programmable gate arrays (FPGAs), multi-core processors, and graphics processing units (GPUs) have enabled real-time OCT image processing, which provides real-time image data to support surgical procedures.

Download Full-text

Comparison of Interpolators Used for Time-Interval Measurement Systems Based on Multiple-Tapped Delay Line

Metrology and Measurement Systems ◽

10.1515/mms-2017-0033 ◽

2017 ◽

Vol 24 (2) ◽

pp. 401-412 ◽

Cited By ~ 1

Author(s):

Dariusz Chaberski ◽

Robert Frankowski ◽

Maciej Gurski ◽

Marek Zieliński

Keyword(s):

Measurement Range ◽

Time Interval ◽

Test Results ◽

Measurement Systems ◽

Gate Arrays ◽

Wide Range ◽

Time Stamps ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Construction Operation

AbstractThe paper describes the construction, operation and test results of three most popular interpolators from a viewpoint of time-interval (TI) measurement systems consisting of many tapped-delay lines (TDLs) and registering pulses of a wide-range changeable intensity. The comparison criteria include the maximum intensity of registered time stamps (TSs), the dependency of interpolator characteristic on the registered TSs’ intensity, the need of using either two counters or a mutually-complementing pair counter-register for extending a measurement range, the need of calculating offsets between TDL inputs and the dependency of a resolution increase on the number of used TDL segments. This work also contains conclusions about a range of applications, usefulness and methods of employing each described TI interpolator. The presented experimental results bring new facts that can be used by the designers who implement precise time delays in the field-programmable gate arrays (FPGA).

Download Full-text

Analysis of the Quantization Noise in Discrete Wavelet Transform Filters for Image Processing

Electronics ◽

10.3390/electronics7080135 ◽

2018 ◽

Vol 7 (8) ◽

pp. 135 ◽

Cited By ~ 9

Author(s):

Nikolay Chervyakov ◽

Pavel Lyakhov ◽

Dmitry Kaplun ◽

Denis Butusov ◽

Nikolay Nagornov

Keyword(s):

Image Processing ◽

Wavelet Transform ◽

Discrete Wavelet Transform ◽

Integrated Circuit ◽

Filter Banks ◽

Signal To Noise Ratio ◽

Quantization Noise ◽

Discrete Wavelet ◽

Field Programmable ◽

Application Specific Integrated Circuit

In this paper, we analyze the noise quantization effects in coefficients of discrete wavelet transform (DWT) filter banks for image processing. We propose the implementation of the DWT method, making it possible to determine the effective bit-width of the filter banks coefficients at which the quantization noise does not significantly affect the image processing results according to the peak signal-to-noise ratio (PSNR). The dependence between the PSNR of the DWT image quality on the wavelet and the bit-width of the wavelet filter coefficients is analyzed. The formulas for determining the minimal bit-width of the filter coefficients at which the processed image achieves high quality (PSNR ≥ 40 dB) are given. The obtained theoretical results were confirmed through the simulation of DWT for a test image using the calculated bit-width values. All considered algorithms operate with fixed-point numbers, which simplifies their hardware implementation on modern devices: field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), etc.

Download Full-text

Up Close: The Interuniversity Microelectronics Center (IMEC), Leuven, Belgium

MRS Bulletin ◽

10.1557/s0883769400062692 ◽

1989 ◽

Vol 14 (6) ◽

pp. 35-38 ◽

Cited By ~ 1

Author(s):

Dirk Denoyelle

Keyword(s):

Integrated Circuit ◽

Computer Aided Design ◽

Systems Design ◽

System Level ◽

Clean Room ◽

Integrated Circuit Design ◽

Processing Technologies ◽

Wide Range ◽

Application Specific Integrated Circuit ◽

Aided Design

The Interuniversity Microelectronics Center, Leuven, Belgium (IMEC) is one of the world's largest independent research centers for microelectronics. It was established in 1984 by the Flemish government as a part of a comprehensive program to promote high technology in Flanders, Belgium. Benefiting from existing experience available mainly at the University of Leuven, IMEC moved into its present facilities in 1986 (Figure 1).The Center covers a wide range of research topics in the microelectronics domain—VLSI systems design methodologies, advanced semiconductor processing, materials, packaging, and more.About 50 people work on computer-aided design, developing a series of “true” silicon compilers: CATHEDRAL. With this software, ASIC (application specific integrated circuit) design becomes extremely attractive, since CATHEDRAL covers design from the high system level down to layout.INVOMEC, the training division of IMEC, supports universities in ASIC design. It trains people for both educational institutes and industry in chip design, makes available the necessary software, and has a well-established Multi Project Chip—Multi Project Wafer service.The Processing Technologies and Materials Divisions involve about 200 people and have a 3,600 m2 clean room at their disposal. The clean room consists of a 20% class 10 area with a fast-turnaround prototyping line and an 80% class 1000 area.IMEC's objectives are: to perform research in the microelectronics field, supporting both industry and universities, and to stimulate the microelectronics industry in Flanders.IMEC performs research on both silicon and III-V technologies.

Download Full-text