Comparative Analysis of FPGA-Based Pair-HMM Accelerator Structures

Pengfei Wang; Yuanwu Lei; Yong Dou

doi:10.3390/electronics8090965

Comparative Analysis of FPGA-Based Pair-HMM Accelerator Structures

Electronics ◽

10.3390/electronics8090965 ◽

2019 ◽

Vol 8 (9) ◽

pp. 965

Author(s):

Pengfei Wang ◽

Yuanwu Lei ◽

Yong Dou

Keyword(s):

Ring Structure ◽

Analysis Model ◽

Forward Algorithm ◽

Field Programmable ◽

Suitable Structure ◽

Computationally Intensive ◽

On Chip ◽

Future Work ◽

And Storage ◽

Cooperation Structure

As one of the most important and computationally-intensive parts in bioinformatics analysis, the Pair Hidden Markov Model (Pair-HMM) forward algorithm is widely recognized and has great potential. Therefore, it is important to accelerate the process of this algorithm. There are various approaches to accelerate Pair-HMM, especially the accelerators stemmed from the Field Programmable Gate Array (FPGA) due to the highly-customizable on-chip resources and deep pipeline potential available to the designer. In this paper, we focus on the FPGA-based accelerators for the Pair-HMM forward algorithm proposed in recent years. The non-cooperation structure, which was proposed in our previous work, is compared with the Systolic Array (SA) structure and PE ring structure in the structure characteristics, calculation mode, computational efficiency, and storage requirement. We build an analysis model to evaluate the performance of the ring structure and our non-cooperative structure. Furthermore, based on this, we provide a detailed analysis of the characteristics of structures of different accelerators and of the selection of a suitable structure for different scenarios. Based on the non-cooperative PE structure, we design a new chain topology for the accelerator. Experimental results show that our non-cooperation structure is superior to the other structures in performance and execution efficiency, and our new topology improves the performance of the accelerator. Finally, we propose some ideas about the improvement of the non-cooperative structure accelerator for future work.

A Hardware Accelerator for the Inference of a Convolutional Neural network

Ciencia e Ingeniería Neogranadina ◽

10.18359/rcin.4194 ◽

2019 ◽

Vol 30 (1) ◽

pp. 107-116 ◽

Cited By ~ 1

Author(s):

Edwin González ◽

Walter D. Villamizar Luna ◽

Carlos Augusto Fajardo Ariza

Keyword(s):

Field Programmable Gate Array ◽

System On Chip ◽

Software Implementation ◽

Hardware Accelerator ◽

Processing Scheme ◽

Field Programmable ◽

Computationally Intensive ◽

On Chip ◽

Memory Resources ◽

Better Than

Convolutional Neural Networks (CNNs) are becoming increasingly popular in deep learning applications, e.g. image classification, speech recognition, medicine, to name a few. However, the CNN inference is computationally intensive and demanding a large among of memory resources. In this work is proposed a CNN inference hardware accelerator, which was implemented in a co-processing scheme. The aim is to reduce the hardware resources and achieve the better possible throughput. The design was implemented in the Digilent Arty Z7-20 development board, which is based on System on Chip (SoC) Zynq-7000 of Xilinx. Our implementation achieved a of accuracy for the MNIST database using only 12-bits fixed-point format. The results show that the co-processing scheme operating at a conservative speed of 100 MHz can identify around 441 images per second, which is about 17% times faster than a 650 MHz - software implementation. It is difficult to compare our results against other implementations based on Field-Programmable Gate Array (FPGA), because the others implementations are not exactly like ours. However, some comparisons, regarding the logical resources used and accuracy, suggest that our work could be better than previous works.

Implementation of decision tree algorithm on FPGA devices

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp131-138 ◽

2021 ◽

Vol 10 (1) ◽

pp. 131

Author(s):

Kritika Malhotra ◽

Amit Prakash Singh

Keyword(s):

Machine Learning ◽

Decision Tree ◽

High Performance ◽

Large Datasets ◽

Machine Learning Techniques ◽

Application Development ◽

Development Environment ◽

Field Programmable ◽

Computationally Intensive ◽

Future Work

<span id="docs-internal-guid-01e673b1-7fff-8dc3-6b99-14ed17cd6b49"><span>Machine learning techniques are rapidly emerging in large number of fields from robotics to computer vision to finance and biology. One important step of machine learning is classification which is the process of finding out to which category a new encountered observation belongs based on predefined categories. There are various existing solutions to classification and one of them is decision tree classification (DTC) which can achieve high accuracy while handling the large datasets. But DTC is computationally intensive algorithm and as the size of the dataset increases its running time also increases which could be from some hours to days even. But thanks to field programmable gate arrays (FPGA) which could be used for large datasets to achieve high performance implementation with low energy consumption. Along with FPGA’s, python is used for accelerating the application development and python is leveraged by using python productivity for zynq (PYNQ), a python development environment for application development. This paper provides the literature review of an implementation of DTC for FPGA devices along with future work that can be done.</span></span>

A Heterogeneous Hardware Accelerator for Image Classification in Embedded Systems

Sensors ◽

10.3390/s21082637 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2637

Author(s):

Ignacio Pérez ◽

Miguel Figueroa

Keyword(s):

Image Classification ◽

High Speed ◽

Hardware Acceleration ◽

Graphics Processors ◽

Embedded Processor ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Computationally Intensive ◽

On Chip

Convolutional neural networks (CNN) have been extensively employed for image classification due to their high accuracy. However, inference is a computationally-intensive process that often requires hardware acceleration to operate in real time. For mobile devices, the power consumption of graphics processors (GPUs) is frequently prohibitive, and field-programmable gate arrays (FPGA) become a solution to perform inference at high speed. Although previous works have implemented CNN inference on FPGAs, their high utilization of on-chip memory and arithmetic resources complicate their application on resource-constrained edge devices. In this paper, we present a scalable, low power, low resource-utilization accelerator architecture for inference on the MobileNet V2 CNN. The architecture uses a heterogeneous system with an embedded processor as the main controller, external memory to store network data, and dedicated hardware implemented on reconfigurable logic with a scalable number of processing elements (PE). Implemented on a XCZU7EV FPGA running at 200 MHz and using four PEs, the accelerator infers with 87% top-5 accuracy and processes an image of 224×224 pixels in 220 ms. It consumes 7.35 W of power and uses less than 30% of the logic and arithmetic resources used by other MobileNet FPGA accelerators.

Reconfigurable field‐programmable gate array‐based on‐chip learning neuromorphic digital implementation for nonlinear function approximation

International Journal of Circuit Theory and Applications ◽

10.1002/cta.3075 ◽

2021 ◽

Author(s):

Morteza Gholami ◽

Edris Zaman Farsa ◽

Gholamreza Karimi

Keyword(s):

Field Programmable Gate Array ◽

Function Approximation ◽

Nonlinear Function ◽

Digital Implementation ◽

Field Programmable ◽

Gate Array ◽

On Chip ◽

Nonlinear Function Approximation

Simba

Communications of the ACM ◽

10.1145/3460227 ◽

2021 ◽

Vol 64 (6) ◽

pp. 107-116

Author(s):

Yakun Sophia Shao ◽

Jason Cemons ◽

Rangharajan Venkatesan ◽

Brian Zimmer ◽

Matthew Fojtik ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Data Locality ◽

Coarse Grained ◽

Batch Size ◽

Peak Performance ◽

Large Scale Systems ◽

High Area ◽

On Chip ◽

And Storage

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

Precise Cache Profiling for Studying Radiation Effects

ACM Transactions on Embedded Computing Systems ◽

10.1145/3442339 ◽

2021 ◽

Vol 20 (3) ◽

pp. 1-25

Author(s):

James Marshall ◽

Robert Gifford ◽

Gedare Bloom ◽

Gabriel Parmer ◽

Rahul Simha

Keyword(s):

Radiation Effects ◽

Fault Injection ◽

Error Correcting Codes ◽

Direct Access ◽

Transient Faults ◽

Large Area ◽

Common Multiple ◽

Single Event Upsets ◽

On Chip ◽

Future Work

Increased access to space has led to an increase in the usage of commodity processors in radiation environments. These processors are vulnerable to transient faults such as single event upsets that may cause bit-flips in processor components. Caches in particular are vulnerable due to their relatively large area, yet are often omitted from fault injection testing because many processors do not provide direct access to cache contents and they are often not fully modeled by simulators. The performance benefits of caches make disabling them undesirable, and the presence of error correcting codes is insufficient to correct for increasingly common multiple bit upsets. This work explores building a program’s cache profile by collecting cache usage information at an instruction granularity via commonly available on-chip debugging interfaces. The profile provides a tighter bound than cache utilization for cache vulnerability estimates (50% for several benchmarks). This can be applied to reduce the number of fault injections required to characterize behavior by at least two-thirds for the benchmarks we examine. The profile enables future work in hardware fault injection for caches that avoids the biases of existing techniques.

Comparative analysis of soft and hard on-chip interconnects for field-programmable gate arrays

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2011.0169 ◽

2012 ◽

Vol 6 (6) ◽

pp. 396-405 ◽

Cited By ~ 2

Author(s):

J.Y. Hur ◽

M.A. Wahlah ◽

L. Mhamdi ◽

K. Goossens

Keyword(s):

Comparative Analysis ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

On Chip

A System-On-Chip Approach in Designing a Dedicated RISC Microcontroller Unit Using the Field-Programmable Gate Array

2010 Fifth International Conference on Systems ◽

10.1109/icons.2010.40 ◽

2010 ◽

Author(s):

Elena Roxana Buhus ◽

Alexandru Lazar ◽

Adriano Tavares

Keyword(s):

Field Programmable Gate Array ◽

System On Chip ◽

Field Programmable ◽

Gate Array ◽

On Chip ◽

Microcontroller Unit

EFFICIENT QRS COMPLEX DETECTION ALGORITHM IMPLEMENTATION ON SOC-BASED EMBEDDED SYSTEM

Jurnal Teknologi ◽

10.11113/jt.v78.9450 ◽

2016 ◽

Vol 78 (7-5) ◽

Cited By ~ 1

Author(s):

Muhammad Amin Hashim ◽

Yuan Wen Hau ◽

Rabia Baktheri

Keyword(s):

Embedded System ◽

Detection Algorithm ◽

Detection Accuracy ◽

Qrs Complex ◽

Qrs Detection ◽

Qrs Complex Detection ◽

Moving Windows ◽

Field Programmable ◽

Complex Detection ◽

On Chip

This paper studies two different Electrocardiography (ECG) preprocessing algorithms, namely Pan and Tompkins (PT) and Derivative Based (DB) algorithm, which is crucial of QRS complex detection in cardiovascular disease detection. Both algorithms are compared in terms of QRS detection accuracy and computation timing performance, with implementation on System-on-Chip (SoC) based embedded system that prototype on Altera DE2-115 Field Programmable Gate Array (FPGA) platform as embedded software. Both algorithms are tested with 30 minutes ECG data from each of 48 different patient records obtain from MIT-BIH arrhythmia database. Results show that PT algorithm achieve 98.15% accuracy with 56.33 seconds computation while DB algorithm achieve 96.74% with only 22.14 seconds processing time. Based on the study, an optimized PT algorithm with improvement on Moving Windows Integrator (MWI) has been proposed to accelerate its computation. Result shows that the proposed optimized Moving Windows Integrator algorithm achieves 9.5 times speed up than original MWI while retaining its QRS detection accuracy.

META-pipe cloud setup and execution

F1000Research ◽

10.12688/f1000research.13204.1 ◽

2017 ◽

Vol 6 ◽

pp. 2060

Author(s):

Aleksandr Agafonov ◽

Kimmo Mattila ◽

Cuong Duong Tuan ◽

Lars Tiede ◽

Inge Alexander Raknes ◽

...

Keyword(s):

Functional Annotation ◽

High Performance ◽

Sequence Data ◽

Metagenomic Data ◽

Taxonomic Profiling ◽

Geographically Distributed ◽

Computationally Intensive ◽

High Performance Computing Cluster ◽

And Storage ◽

Performance Computing

META-pipe is a complete service for the analysis of marine metagenomic data. It provides assembly of high-throughput sequence data, functional annotation of predicted genes, and taxonomic profiling. The functional annotation is computationally demanding and is therefore currently run on a high-performance computing cluster in Norway. However, additional compute resources are necessary to open the service to all ELIXIR users. We describe our approach for setting up and executing the functional analysis of META-pipe on additional academic and commercial clouds. Our goal is to provide a powerful analysis service that is easy to use and to maintain. Our design therefore uses a distributed architecture where we combine central servers with multiple distributed backends that execute the computationally intensive jobs. We believe our experiences developing and operating META-pipe provides a useful model for others that plan to provide a portal based data analysis service in ELIXIR and other organizations with geographically distributed compute and storage resources.