scholarly journals Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

2017 ◽  
Vol 27 (03n04) ◽  
pp. 1750006 ◽  
Author(s):  
Farhad Merchant ◽  
Anupam Chattopadhyay ◽  
Soumyendu Raha ◽  
S. K. Nandy ◽  
Ranjani Narayan

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.

Electronics ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 884
Author(s):  
Stefano Rossi ◽  
Enrico Boni

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.


2013 ◽  
Vol 753-755 ◽  
pp. 2731-2735
Author(s):  
Wei Cao ◽  
Zheng Hua Wang ◽  
Chuan Fu Xu

The graphics processing unit (GPU) has evolved from configurable graphics processor to a powerful engine for high performance computer. In this paper, we describe the graphics pipeline of GPU, and introduce the history and evolution of GPU architecture. We also provide a summary of software environments used on GPU, from graphics APIs to non-graphics APIs. At last, we present the GPU computing in computational fluid dynamics applications, including the GPGPU computing for Navier-Stokes equations methods and the GPGPU computing for Lattice Boltzmann method.


Sensors ◽  
2020 ◽  
Vol 20 (14) ◽  
pp. 3969
Author(s):  
Hongzhi Huang ◽  
Yakun Wu ◽  
Mengqi Yu ◽  
Xuesong Shi ◽  
Fei Qiao ◽  
...  

Visual semantic segmentation, which is represented by the semantic segmentation network, has been widely used in many fields, such as intelligent robots, security, and autonomous driving. However, these Convolutional Neural Network (CNN)-based networks have high requirements for computing resources and programmability for hardware platforms. For embedded platforms and terminal devices in particular, Graphics Processing Unit (GPU)-based computing platforms cannot meet these requirements in terms of size and power consumption. In contrast, the Field Programmable Gate Array (FPGA)-based hardware system not only has flexible programmability and high embeddability, but can also meet lower power consumption requirements, which make it an appropriate solution for semantic segmentation on terminal devices. In this paper, we demonstrate EDSSA—an Encoder-Decoder semantic segmentation networks accelerator architecture which can be implemented with flexible parameter configurations and hardware resources on the FPGA platforms that support Open Computing Language (OpenCL) development. We introduce the related technologies, architecture design, algorithm optimization, and hardware implementation of the Encoder-Decoder semantic segmentation network SegNet as an example, and undertake a performance evaluation. Using an Intel Arria-10 GX1150 platform for evaluation, our work achieves a throughput higher than 432.8 GOP/s with power consumption of about 20 W, which is a 1.2× times improvement the energy-efficiency ratio compared to a high-performance GPU.


2020 ◽  
Vol 16 (1) ◽  
pp. 19-29
Author(s):  
Caterina Travan ◽  
Francesca Vatta ◽  
Fulvio Babich

The behaviour of a transmission channel may be simulated using the performance abilities of current generation multiprocessing hardware, namely, a multicore Central Processing Unit (CPU), a general purpose Graphics Processing Unit (GPU), or a Field Programmable Gate Array (FPGA). These were investigated by Cullinan et al. in a recent paper (published in 2012) where these three devices capabilities were compared to determine which device would be best suited towards which specific task. In particular, it was shown that, for the application which is objective of our work (i.e., for a transmission channel simulation), the FPGA is 26.67 times faster than the GPU and 10.76 times faster than the CPU. Motivated by these results, in this paper we propose and present a direct hardware emulation. In particular, a Cyclone II FPGA architecture is implemented to simulate a burst error channel behaviour, in which errors are clustered together, and a burst erasure channel behaviour, in which the erasures are clustered together. The results presented in the paper are valid for any FPGA architecture that may be considered for this scope.


Technologies ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 6 ◽  
Author(s):  
Vasileios Leon ◽  
Spyridon Mouselinos ◽  
Konstantina Koliogeorgi ◽  
Sotirios Xydis ◽  
Dimitrios Soudris ◽  
...  

The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their increased need for low-power and speed has established Application-Specific Integrated Circuit (ASIC)-based accelerators as alternative efficient solutions. During the last five years, the development of Hardware Description Language (HDL)-based CNN accelerators, either for FPGA or ASIC, has seen huge academic interest due to their high-performance and room for optimizations. Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines for FPGAs and ASICs. The framework allows software developers to exploit the benefits of FPGA/ASIC acceleration without requiring any expertise on HDL development and low-level design. Moreover, it provides a set of optimization knobs concerning the model architecture and the inference engine generation, allowing the developer to tune the accelerator according to the requirements of the respective use case. Our framework is evaluated by optimizing the LeNet CNN model on the MNIST dataset, and implementing FPGA- and ASIC-based accelerators using the generated inference engine. The optimal FPGA-based accelerator on Zynq-7000 delivers 93% less memory footprint and 54% less Look-Up Table (LUT) utilization, and up to 10× speedup on the inference execution vs. different Graphics Processing Unit (GPU) and Central Processing Unit (CPU) implementations of the same model, in exchange for a negligible accuracy loss, i.e., 0.89%. For the same accuracy drop, the 45 nm standard-cell-based ASIC accelerator provides an implementation which operates at 520 MHz and occupies an area of 0.059 mm 2 , while the power consumption is ∼7.5 mW.


2019 ◽  
Vol 16 (2) ◽  
pp. 304-308
Author(s):  
Chao Peng

Purpose The purpose of this paper is to investigate possibilities to adopt state-of-the-art computer graphics technologies for big data visualization in engineering applications. Toward this purpose, a conceptual heterogeneous system is proposed for graphical rendering, which is established with multiple central processing unit cores and multiple graphics processing unit GPUs. Design/methodology/approach The design of the system supports both general-purpose computation and graphics-related computation. Three processing components are discussed to fulfill the execution requirements in load balancing, data streaming and display. This design fully uses computational and memory resources and enhances the performance with the support of GPU-based parallelization. Findings The advantages and disadvantages of particular technical methods for each processing component are discussed. The possible ways to integrate them are analyzed. Originality/value This work has contributions of using computer graphics technologies in engineering applications.


Computation ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 4 ◽  
Author(s):  
Håvard H. Holm ◽  
André R. Brodtkorb ◽  
Martin L. Sætra

In this work, we examine the performance, energy efficiency, and usability when using Python for developing high-performance computing codes running on the graphics processing unit (GPU). We investigate the portability of performance and energy efficiency between Compute Unified Device Architecture (CUDA) and Open Compute Language (OpenCL); between GPU generations; and between low-end, mid-range, and high-end GPUs. Our findings showed that the impact of using Python is negligible for our applications, and furthermore, CUDA and OpenCL applications tuned to an equivalent level can in many cases obtain the same computational performance. Our experiments showed that performance in general varies more between different GPUs than between using CUDA and OpenCL. We also show that tuning for performance is a good way of tuning for energy efficiency, but that specific tuning is needed to obtain optimal energy efficiency.


Author(s):  
K. Bhargavi ◽  
Sathish Babu B.

The GPUs (Graphics Processing Unit) were mainly used to speed up computation intensive high performance computing applications. There are several tools and technologies available to perform general purpose computationally intensive application. This chapter primarily discusses about GPU parallelism, applications, probable challenges and also highlights some of the GPU computing platforms, which includes CUDA, OpenCL (Open Computing Language), OpenMPC (Open MP extended for CUDA), MPI (Message Passing Interface), OpenACC (Open Accelerator), DirectCompute, and C++ AMP (C++ Accelerated Massive Parallelism). Each of these platforms is discussed briefly along with their advantages and disadvantages.


Water ◽  
2020 ◽  
Vol 12 (5) ◽  
pp. 1288
Author(s):  
Yueling Wang ◽  
Xiaoliu Yang

To protect ecologies and the environment by preventing floods, analysis of the impact of climate change on water requires a tool capable of considering the rainfall-runoff processes on a small scale, for example, 10 m. As has been shown previously, hydrologic models are good at simulating rainfall-runoff processes on a large scale, e.g., over several hundred km2, while hydraulic models are more advantageous for applications on smaller scales. In order to take advantages of these two types of models, this paper coupled a hydrologic model, the Xinanjing model (XAJ), with a hydraulic model, the Graphics Processing Unit (GPU)-accelerated high-performance integrated hydraulic modelling system (HiPIMS). The study was completed in the Misai basin (797 km2), located in Zhejiang Province, China. The coupled XAJ–HiPIMS model was validated against observed flood events. The simulated results agree well with the data observed at the basin outlet. The study proves that a coupled hydrologic and hydraulic model is capable of providing flood information on a small scale for a large basin and shows the potential of the research.


Sign in / Sign up

Export Citation Format

Share Document