Method for improving ripple reduction during phase shedding in multiphase buck converters for SCADA systems

In the current digital environment, central processing unit (CPUs), field programmable gate array (FPGAs), application-specific integrated circuit (ASICs), as well as peripherals, are growing progressively complex. On motherboards in many areas of computing, from laptops and tablets to servers and Ethernet switches, multiphase phase buck regulators are seen to be more common nowadays, because of the higher power requirements. This study describes a four-stage buck converter with a phase shedding scheme that can be used to power processors in programmable logic controller (PLCs). The proposed power supply is designed to generate a regulated voltage with minimal ripple. Because of the suggested phase shedding method, this power supply also offers better light load efficiency. For this objective, a multiphase system with phase shedding is modeled in MATLAB SIMULINK, and the findings are validated.

Download Full-text

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Technologies ◽

10.3390/technologies8010006 ◽

2020 ◽

Vol 8 (1) ◽

pp. 6 ◽

Cited By ~ 1

Author(s):

Vasileios Leon ◽

Spyridon Mouselinos ◽

Konstantina Koliogeorgi ◽

Sotirios Xydis ◽

Dimitrios Soudris ◽

...

Keyword(s):

Integrated Circuit ◽

High Performance ◽

Graphics Processing Unit ◽

Inference Engine ◽

Efficient Solutions ◽

Processing Unit ◽

Central Processing ◽

Inference Engines ◽

Field Programmable ◽

Hardware Description

The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their increased need for low-power and speed has established Application-Specific Integrated Circuit (ASIC)-based accelerators as alternative efficient solutions. During the last five years, the development of Hardware Description Language (HDL)-based CNN accelerators, either for FPGA or ASIC, has seen huge academic interest due to their high-performance and room for optimizations. Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines for FPGAs and ASICs. The framework allows software developers to exploit the benefits of FPGA/ASIC acceleration without requiring any expertise on HDL development and low-level design. Moreover, it provides a set of optimization knobs concerning the model architecture and the inference engine generation, allowing the developer to tune the accelerator according to the requirements of the respective use case. Our framework is evaluated by optimizing the LeNet CNN model on the MNIST dataset, and implementing FPGA- and ASIC-based accelerators using the generated inference engine. The optimal FPGA-based accelerator on Zynq-7000 delivers 93% less memory footprint and 54% less Look-Up Table (LUT) utilization, and up to 10× speedup on the inference execution vs. different Graphics Processing Unit (GPU) and Central Processing Unit (CPU) implementations of the same model, in exchange for a negligible accuracy loss, i.e., 0.89%. For the same accuracy drop, the 45 nm standard-cell-based ASIC accelerator provides an implementation which operates at 520 MHz and occupies an area of 0.059 mm 2 , while the power consumption is ∼7.5 mW.

Download Full-text

Implementación de redes neuronales en FPGAs usando tipos de datos de punto fijo

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jjii3a.4849 ◽

2020 ◽

Vol 8 ◽

Author(s):

Daniel Enériz Orta ◽

Nicolás Medrano Marqués ◽

Belén Calvo López

Keyword(s):

Field Programmable Gate Array ◽

A Priori ◽

Central Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Field Programmable ◽

Gate Array

La capacidad de estimar funciones no lineales hace que las redes neuronales sean una de las herramientas más usadas para aplicar fusión sensorial, permitiendo combinar la salida de diferentes sensores para obtener información de la que a priori no se dispone. Por otra parte, la capacidad de procesamiento paralelo de las FPGAs (Field-Programmable Gate Array) las hace idóneas para implementar redes neuronales ubicuas, permitiendo inferir resultados más rápido que una CPU (Central Processing Unit) sin necesidad de una conexión activa a internet. De esta forma, en este artículo se propone un flujo de trabajo para diseñar, entrenar e implementar una red neuronal en una FPGA Xilinx PYNQ Z2 que use tipos de dato de punto fijo para hacer fusión sensorial. Dicho flujo de trabajo es probado mediante el desarrollo de una red neuronal que combine las salidas de una nariz artificial de 16 sensores para obtener una estimación de las concentraciones de CH4 y C2H4.

Download Full-text

Real-Time Monte Carlo Optimization on FPGA for the Efficient and Reliable Message Chain Structure

Electronics ◽

10.3390/electronics8080866 ◽

2019 ◽

Vol 8 (8) ◽

pp. 866 ◽

Cited By ~ 1

Author(s):

Heoncheol Lee ◽

Kipyo Kim

Keyword(s):

Monte Carlo ◽

Real Time ◽

Communication Systems ◽

Optimization Method ◽

Chain Structure ◽

Processing Unit ◽

Monte Carlo Optimization ◽

Central Processing ◽

Field Programmable ◽

Programmable Gate Arrays

This paper addresses the real-time optimization problem to find the most efficient and reliable message chain structure in data communications based on half-duplex command–response protocols such as MIL-STD-1553B communication systems. This paper proposes a real-time Monte Carlo optimization method implemented on field programmable gate arrays (FPGA) which can not only be conducted very quickly but also avoid the conflicts with other tasks on a central processing unit (CPU). Evaluation results showed that the proposed method can consistently find the optimal message chain structure within a quite small and deterministic time, which was much faster than the conventional Monte Carlo optimization method on a CPU.

Download Full-text

Experimental validation of a virtual engine-out NOx sensor for diesel emission control

International Journal of Engine Research ◽

10.1177/1468087419857584 ◽

2019 ◽

Vol 20 (10) ◽

pp. 1037-1046 ◽

Cited By ~ 1

Author(s):

Paul Mentink ◽

Daniel Escobar-Valdivieso ◽

Alexandru Forrai ◽

Xander Seykens ◽

Frank Willems

Keyword(s):

Processing Unit ◽

Central Processing ◽

Nox Sensor ◽

Heavy Duty Diesel Engine ◽

Field Programmable ◽

Diesel Emission ◽

Diesel Emission Control ◽

Heavy Duty Diesel ◽

Main Input ◽

Automotive Emission

Motivated by automotive emission legislations, a Virtual [Formula: see text] sensor is developed. This virtual sensor consists of a real-time, phenomenological model that computes engine-out [Formula: see text] by using the measured in-cylinder pressure signal from a single cylinder as its main input. The implementation is made on a Field Programmable Gate Array–Central Processing Unit architecture to ensure the [Formula: see text] computation is ready at the end of the combustion cycle. The Virtual [Formula: see text] sensor is tested and validated on an EURO-VI Heavy-Duty Diesel engine platform. The Virtual [Formula: see text] sensor is proven to meet the accuracy of a production [Formula: see text] sensor for steady-state conditions and has better frequency response compared to the production [Formula: see text] sensor.

Download Full-text

A P4-Enabled RINA Interior Router for Software-Defined Data Centers

Computers ◽

10.3390/computers9030070 ◽

2020 ◽

Vol 9 (3) ◽

pp. 70

Author(s):

Carolina Fernández ◽

Sergio Giménez ◽

Eduard Grasa ◽

Steve Bunch

Keyword(s):

Integrated Circuit ◽

High Performance ◽

Data Transfer ◽

Great Promise ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Application Specific Integrated Circuit ◽

Networking Technologies ◽

Application Specific

The lack of high-performance RINA (Recursive InterNetwork Architecture) implementations to date makes it hard to experiment with RINA as an underlay networking fabric solution for different types of networks, and to assess RINA’s benefits in practice on scenarios with high traffic loads. High-performance router implementations typically require dedicated hardware support, such as FPGAs (Field Programmable Gate Arrays) or specialized ASICs (Application Specific Integrated Circuit). With the advance of hardware programmability in recent years, new possibilities unfold to prototype novel networking technologies. In particular, the use of the P4 programming language for programmable ASICs holds great promise for developing a RINA router. This paper details the design and part of the implementation of the first P4-based RINA interior router, which reuses the layer management components of the IRATI Linux-based RINA implementation and implements the data-transfer components using a P4 program. We also describe the configuration and testing of our initial deployment scenarios, using ancillary open-source tools such as the P4 reference test software switch (BMv2) or the P4Runtime API.

Download Full-text

Literature Survey on Stereo Vision Disparity Map Algorithms

Journal of Sensors ◽

10.1155/2016/8742920 ◽

2016 ◽

Vol 2016 ◽

pp. 1-23 ◽

Cited By ~ 57

Author(s):

Rostam Affendi Hamzah ◽

Haidi Ibrahim

Keyword(s):

Stereo Vision ◽

Stereo Matching ◽

Literature Survey ◽

Processing Unit ◽

Stereo Correspondence ◽

Disparity Map ◽

Central Processing ◽

Field Programmable ◽

Processing Module ◽

Graphical Processing

This paper presents a literature survey on existing disparity map algorithms. It focuses on four main stages of processing as proposed by Scharstein and Szeliski in a taxonomy and evaluation of dense two-frame stereo correspondence algorithms performed in 2002. To assist future researchers in developing their own stereo matching algorithms, a summary of the existing algorithms developed for every stage of processing is also provided. The survey also notes the implementation of previous software-based and hardware-based algorithms. Generally, the main processing module for a software-based implementation uses only a central processing unit. By contrast, a hardware-based implementation requires one or more additional processors for its processing module, such as graphical processing unit or a field programmable gate array. This literature survey also presents a method of qualitative measurement that is widely used by researchers in the area of stereo vision disparity mappings.

Download Full-text

Analysis of the Quantization Noise in Discrete Wavelet Transform Filters for Image Processing

Electronics ◽

10.3390/electronics7080135 ◽

2018 ◽

Vol 7 (8) ◽

pp. 135 ◽

Cited By ~ 9

Author(s):

Nikolay Chervyakov ◽

Pavel Lyakhov ◽

Dmitry Kaplun ◽

Denis Butusov ◽

Nikolay Nagornov

Keyword(s):

Image Processing ◽

Wavelet Transform ◽

Discrete Wavelet Transform ◽

Integrated Circuit ◽

Filter Banks ◽

Signal To Noise Ratio ◽

Quantization Noise ◽

Discrete Wavelet ◽

Field Programmable ◽

Application Specific Integrated Circuit

In this paper, we analyze the noise quantization effects in coefficients of discrete wavelet transform (DWT) filter banks for image processing. We propose the implementation of the DWT method, making it possible to determine the effective bit-width of the filter banks coefficients at which the quantization noise does not significantly affect the image processing results according to the peak signal-to-noise ratio (PSNR). The dependence between the PSNR of the DWT image quality on the wavelet and the bit-width of the wavelet filter coefficients is analyzed. The formulas for determining the minimal bit-width of the filter coefficients at which the processed image achieves high quality (PSNR ≥ 40 dB) are given. The obtained theoretical results were confirmed through the simulation of DWT for a test image using the calculated bit-width values. All considered algorithms operate with fixed-point numbers, which simplifies their hardware implementation on modern devices: field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), etc.

Download Full-text

Sparse Cholesky Factorization on FPGA Using Parameterized Model

Mathematical Problems in Engineering ◽

10.1155/2017/3021591 ◽

2017 ◽

Vol 2017 ◽

pp. 1-11

Author(s):

Yichun Sun ◽

Hengzhu Liu ◽

Tong Zhou

Keyword(s):

Integrated Circuit ◽

Sparse Matrix ◽

Fundamental Problem ◽

Performance Model ◽

Cholesky Factorization ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Application Specific Integrated Circuit ◽

Parameterized Model ◽

And Performance

Cholesky factorization is a fundamental problem in most engineering and science computation applications. When dealing with a large sparse matrix, numerical decomposition consumes the most time. We present a vector architecture to parallelize numerical decomposition of Cholesky factorization. We construct an integrated analytical parameterized performance model to accurately predict the execution times of typical matrices under varying parameters. Our proposed approach is general for accelerator and limited by neither field-programmable gate arrays (FPGAs) nor application-specific integrated circuit. We implement a simplified module in FPGAs to prove the accuracy of the model. The experiments show that, for most cases, the performance differences between the predicted and measured execution are less than 10%. Based on the performance model, we optimize parameters and obtain a balance of resources and performance after analyzing the performance of varied parameter settings. Comparing with the state-of-the-art implementation in CPU and GPU, we find that the performance of the optimal parameters is 2x that of CPU. Our model offers several advantages, particularly in power consumption. It provides guidance for the design of future acceleration components.

Download Full-text

Optimized Compression for Implementing Convolutional Neural Networks on FPGA

Electronics ◽

10.3390/electronics8030295 ◽

2019 ◽

Vol 8 (3) ◽

pp. 295 ◽

Cited By ~ 15

Author(s):

Min Zhang ◽

Linpeng Li ◽

Hai Wang ◽

Yan Liu ◽

Hongbo Qin ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Short Term Memory ◽

Graphics Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Pruning Strategy ◽

Field Programmable ◽

Storage Format ◽

Evaluation Board

Field programmable gate array (FPGA) is widely considered as a promising platform for convolutional neural network (CNN) acceleration. However, the large numbers of parameters of CNNs cause heavy computing and memory burdens for FPGA-based CNN implementation. To solve this problem, this paper proposes an optimized compression strategy, and realizes an accelerator based on FPGA for CNNs. Firstly, a reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset. Peak-pruning is further introduced to achieve better compressibility. Moreover, quantization gives another 4× with negligible loss of accuracy. Secondly, an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented respectively. Finally, the effectiveness of the proposed strategy is verified by an accelerator implemented on a Xilinx ZCU104 evaluation board. By improving existing pruning techniques and the storage format of sparse data, we significantly reduce the size of AlexNet by 28×, from 243 MB to 8.7 MB. In addition, the overall performance of our accelerator achieves 9.73 fps for the compressed AlexNet. Compared with the central processing unit (CPU) and graphics processing unit (GPU) platforms, our implementation achieves 182.3× and 1.1× improvements in latency and throughput, respectively, on the convolutional (CONV) layers of AlexNet, with an 822.0× and 15.8× improvement for energy efficiency, separately. This novel compression strategy provides a reference for other neural network applications, including CNNs, long short-term memory (LSTM), and recurrent neural networks (RNNs).

Download Full-text

A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

Electronics ◽

10.3390/electronics10232960 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2960

Author(s):

Youngbin Son ◽

Seokwon Kang ◽

Hongjun Um ◽

Seokho Lee ◽

Jonghyun Ham ◽

...

Keyword(s):

Code Generation ◽

Optimization Technique ◽

Data Parallelism ◽

Processing Unit ◽

Fast Computation ◽

Performance Improvements ◽

Central Processing ◽

Field Programmable ◽

Operation Unit ◽

Better Than

Most modern processors contain a vector accelerator or internal vector units for the fast computation of large target workloads. However, accelerating applications using vector units is difficult because the underlying data parallelism should be uncovered explicitly using vector-specific instructions. Therefore, vector units are often underutilized or remain idle because of the challenges faced in vector code generation. To solve this underutilization problem of existing vector units, we propose the Vector Offloader for executing scalar programs, which considers the vector unit as a scalar operation unit. By using vector masking, an appropriate partition of the vector unit can be utilized to support scalar instructions. To efficiently utilize all execution units, including the vector unit, the Vector Offloader suggests running the target applications concurrently in both the central processing unit (CPU) and the decoupled vector units, by offloading some parts of the program to the vector unit. Furthermore, a profile-guided optimization technique is employed to determine the optimal offloading ratio for balancing the load between the CPU and the vector unit. We implemented the Vector Offloader on a RISC-V infrastructure with a Hwacha vector unit, and evaluated its performance using a Polybench benchmark set. Experimental results showed that the proposed technique achieved performance improvements up to 1.31× better than the simple, CPU-only execution on a field programmable gate array (FPGA)-level evaluation.

Download Full-text