Embedded GPU Implementation for High-Performance Ultrasound Imaging

Stefano Rossi; Enrico Boni

doi:10.3390/electronics10080884

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

High-Performance, Graphics Processing Unit-Accelerated Fock Build Algorithm

Journal of Chemical Theory and Computation ◽

10.1021/acs.jctc.0c00768 ◽

2020 ◽

Vol 16 (12) ◽

pp. 7232-7238

Author(s):

Giuseppe M. J. Barca ◽

Jorge L. Galvez-Vallejo ◽

David L. Poole ◽

Alistair P. Rendell ◽

Mark S. Gordon

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Ballooning Graphics Memory Space in Full GPU Virtualization Environments

Scientific Programming ◽

10.1155/2019/5240956 ◽

2019 ◽

Vol 2019 ◽

pp. 1-11

Author(s):

Younghun Park ◽

Minwoo Gu ◽

Sungyong Park

Keyword(s):

High Performance ◽

Virtual Machines ◽

Graphics Processing Unit ◽

Performance Degradation ◽

Processing Unit ◽

Memory Space ◽

Memory Size ◽

Memory Sharing ◽

Gpu Virtualization ◽

Graphics Processing

Advances in virtualization technology have enabled multiple virtual machines (VMs) to share resources in a physical machine (PM). With the widespread use of graphics-intensive applications, such as two-dimensional (2D) or 3D rendering, many graphics processing unit (GPU) virtualization solutions have been proposed to provide high-performance GPU services in a virtualized environment. Although elasticity is one of the major benefits in this environment, the allocation of GPU memory is still static in the sense that after the GPU memory is allocated to a VM, it is not possible to change the memory size at runtime. This causes underutilization of GPU memory or performance degradation of a GPU application due to the lack of GPU memory when an application requires a large amount of GPU memory. In this paper, we propose a GPU memory ballooning solution called gBalloon that dynamically adjusts the GPU memory size at runtime according to the GPU memory requirement of each VM and the GPU memory sharing overhead. The gBalloon extends the GPU memory size of a VM by detecting performance degradation due to the lack of GPU memory. The gBalloon also reduces the GPU memory size when the overcommitted or underutilized GPU memory of a VM creates additional overhead for the GPU context switch or the CPU load due to GPU memory sharing among the VMs. We implemented the gBalloon by modifying the gVirt, a full GPU virtualization solution for Intel’s integrated GPUs. Benchmarking results show that the gBalloon dynamically adjusts the GPU memory size at runtime, which improves the performance by up to 8% against the gVirt with 384 MB of high global graphics memory and 32% against the gVirt with 1024 MB of high global graphics memory.

Download Full-text

Exploring Graphics Processing Unit (GPU) Resource Sharing Efficiency for High Performance Computing

Computers ◽

10.3390/computers2040176 ◽

2013 ◽

Vol 2 (4) ◽

pp. 176-214 ◽

Cited By ~ 5

Author(s):

Teng Li ◽

Vikram Narayana ◽

Tarek El-Ghazawi

Keyword(s):

High Performance Computing ◽

Resource Sharing ◽

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing ◽

Performance Computing

Download Full-text

Accelerating Event Detection with DGCNN and FPGAs

Electronics ◽

10.3390/electronics9101666 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1666

Author(s):

Zhe Han ◽

Jingfei Jiang ◽

Linbo Qiao ◽

Yong Dou ◽

Jinwei Xu ◽

...

Keyword(s):

Language Processing ◽

Event Detection ◽

Graphics Processing Unit ◽

Network Size ◽

Processing Unit ◽

Sigmoid Function ◽

Pipelined Architecture ◽

Proposed Model ◽

Field Programmable ◽

Graphics Processing

Recently, Deep Neural Networks (DNNs) have been widely used in natural language processing. However, DNNs are often computation-intensive and memory-expensive. Therefore, deploying DNNs in the real world is very difficult. In order to solve this problem, we proposed a network model based on the dilate gated convolutional neural network, which is very hardware-friendly. We further expanded the word representations and depth of the network to improve the performance of the model. We replaced the Sigmoid function to make it more friendly for hardware computation without loss, and we quantized the network weights and activations to compress the network size. We then proposed the first FPGA (Field Programmable Gate Array)-based event detection accelerator based on the proposed model. The accelerator significantly reduced the latency with the fully pipelined architecture. We implemented the accelerator on the Xilinx XCKU115 FPGA. The experimental results show that our model obtains the highest F1-score of 84.6% in the ACE 2005 corpus. Meanwhile, the accelerator achieved 95.2 giga operations (GOP)/s and 13.4 GOPS/W in performance and energy efficiency, which is 17/158 times higher than the Graphics Processing Unit (GPU).

Download Full-text

State-of-the-art in Heterogeneous Computing

Scientific Programming ◽

10.1155/2010/540159 ◽

2010 ◽

Vol 18 (1) ◽

pp. 1-33 ◽

Cited By ~ 96

Author(s):

Andre R. Brodtkorb ◽

Christopher Dyken ◽

Trond R. Hagen ◽

Jon M. Hjelmervik ◽

Olaf O. Storaasli

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

State Of The Art ◽

Peak Performance ◽

Fine Grained ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Cost Efficient ◽

Graphics Processing

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

Download Full-text

Low-power and high-performance design of OpenGL ES 2.0 graphics processing unit for mobile applications

2015 IEEE International Conference on Digital Signal Processing (DSP) ◽

10.1109/icdsp.2015.7251840 ◽

2015 ◽

Cited By ~ 3

Author(s):

Shen-Fu Hsiao ◽

Shang-Yu Li ◽

Kai-Hsiang Tsao

Keyword(s):

Low Power ◽

Mobile Applications ◽

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Opengl Es ◽

Graphics Processing ◽

High Performance Design

Download Full-text

Implementation of Embedded Floating Point Arithmetic Units on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.550.126 ◽

2014 ◽

Vol 550 ◽

pp. 126-136

Author(s):

N. Ramya Rani

Keyword(s):

High Speed ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Embedded Computing ◽

Floating Point Arithmetic ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Arithmetic Units ◽

Point Arithmetic

:Floating point arithmetic plays a major role in scientific and embedded computing applications. But the performance of field programmable gate arrays (FPGAs) used for floating point applications is poor due to the complexity of floating point arithmetic. The implementation of floating point units on FPGAs consumes a large amount of resources and that leads to the development of embedded floating point units in FPGAs. Embedded applications like multimedia, communication and DSP algorithms use floating point arithmetic in processing graphics, Fourier transformation, coding, etc. In this paper, methodologies are presented for the implementation of embedded floating point units on FPGA. The work is focused with the aim of achieving high speed of computations and to reduce the power for evaluating expressions. An application that demands high performance floating point computation can achieve better speed and density by incorporating embedded floating point units. Additionally this paper describes a comparative study of the design of single precision and double precision pipelined floating point arithmetic units for evaluating expressions. The modules are designed using VHDL simulation in Xilinx software and implemented on VIRTEX and SPARTAN FPGAs.

Download Full-text

Splotch

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652713 ◽

2016 ◽

Vol 31 (6) ◽

pp. 550-563

Author(s):

Timothy Dykes ◽

Claudio Gheller ◽

Marzia Rivi ◽

Mel Krokos

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Xeon Phi ◽

The Many ◽

Many Core ◽

Performance Results ◽

Graphics Processing ◽

Performance Computing

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.

Download Full-text

BIST Architecture using Area Efficient Low Current LFSR for Embedded Memory Testing Applications Applications

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v7.i1.pp1-11 ◽

2018 ◽

Vol 7 (1) ◽

pp. 1

Author(s):

M. Parvathi ◽

N. Vasantha ◽

K. Satya Prasad

Keyword(s):

Systems Design ◽

Digital Signal ◽

Low Area ◽

Gate Arrays ◽

Current Limit ◽

Maximum Current ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Layout Area ◽

Signal Processors

One of the important block of BIST controller is LFSR and the speed with which BIST operates depends on LFSR systems design. There are methods in implementing LFSR using field programmable gate arrays (FPGAs) or digital signal processors (DSPs). BIST controller system speed is then limited to FPGAs and DSPs, which may influence other parameters such as overall area, maximum current, limit and power dissipation. This paper proposes a technique to achieve an efficient BIST controller by redesigning LFSR using GDI based D flip-flops that resulted with low area and low current capabilities. This paper presents three different techniques for implementing flip-flops for an efficient LFSR so that the layout area will be minimized as well as the maximum current drawn will be lower.

Download Full-text