IA Algorithm Acceleration Using GPUs

Graphics Processing Units (GPUs) have been evolving very fast, turning into high performance programmable processors. Though GPUs have been designed to compute graphics algorithms, their power and flexibility makes them a very attractive platform for generalpurpose computing. In the last years they have been used to accelerate calculations in physics, computer vision, artificial intelligence, database operations, etc. (Owens, 2007). In this paper an approach to general purpose computing with GPUs is made, followed by a description of artificial intelligence algorithms based on Artificial Neural Networks (ANN) and Evolutionary Computation (EC) accelerated using GPU.

Download Full-text

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

Publications of the Astronomical Society of Australia ◽

10.1071/as10021 ◽

2011 ◽

Vol 28 (1) ◽

pp. 1-14 ◽

Cited By ~ 172

Author(s):

W. van Straten ◽

M. Bailes

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Graphics Processing Units ◽

High Performance ◽

Digital Signal ◽

General Purpose ◽

Design Decisions ◽

Extensive Range ◽

Processing Software ◽

Graphics Processing

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.

Download Full-text

Accelerated FDPS: Algorithms to use accelerators with FDPS

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psz133 ◽

2020 ◽

Vol 72 (1) ◽

Cited By ~ 2

Author(s):

Masaki Iwasawa ◽

Daisuke Namekata ◽

Keigo Nitadori ◽

Kentaro Nomura ◽

Long Wang ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

General Purpose ◽

Performance Model ◽

Performance Tuning ◽

Data Types ◽

Interaction Function ◽

Current Implementation ◽

And Performance ◽

Graphics Processing

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Download Full-text

Superhuman Intelligence

Poems That Solve Puzzles ◽

10.1093/oso/9780198853732.003.0012 ◽

2020 ◽

pp. 203-214

Author(s):

Chris Bleakley

Keyword(s):

Artificial Intelligence ◽

Neural Networks ◽

Artificial Neural Networks ◽

Computer Program ◽

General Purpose ◽

Board Game ◽

High Profile ◽

Artificial Neural ◽

World Champion ◽

Human Player

Chapter 12 is the story of AlphaGo – the first computer program to defeat a top human player at the board game Go. On March 19, 2016, grandmaster Lee Sedol took on AlphaGo for a US$1 million prize in a best of five match. Experts expected that it would be easy money for Sedol. To most observers surprise, AlphaGo swept the first three games to win the match. AlphaGo was based on deep artificial neural networks (ANNs). The networks were trained with 30 million example moves followed 1.2 million games played against itself. AlphaGo was the creation of a London based company named Deep Mind Technologies. Founded in 2010 and acquired by Google 2014, DeepMind’s made a succession of high profile breakthroughs in artificial intelligence. Recently, their AlphaZero ANN displayed signs of general-purpose intelligence. It learned to play Chess, Shogi, and Go to world champion level in a few days.

Download Full-text

A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUs

Journal of Circuits System and Computers ◽

10.1142/s0218126614300025 ◽

2014 ◽

Vol 23 (08) ◽

pp. 1430002 ◽

Cited By ~ 11

Author(s):

SPARSH MITTAL

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

General Purpose ◽

System Level ◽

Cache Management ◽

Full Potential ◽

Wide Range ◽

Computing Platforms ◽

Graphics Processing

Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU–GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.

Download Full-text

Brian2GeNN: a system for accelerating a large variety of spiking neural networks with graphics hardware

10.1101/448050 ◽

2018 ◽

Cited By ~ 4

Author(s):

Marcel Stimberg ◽

Dan F. M. Goodman ◽

Thomas Nowotny

Keyword(s):

Neural Networks ◽

Code Generation ◽

Graphics Processing Units ◽

High Performance ◽

Spiking Neural Networks ◽

Graphics Hardware ◽

Performance Grade ◽

Network Simulations ◽

Nvidia Gpu ◽

Graphics Processing

“Brian” is a popular Python-based simulator for spiking neural networks, commonly used in computational neuroscience. GeNN is a C++-based meta-compiler for accelerating spiking neural network simulations using consumer or high performance grade graphics processing units (GPUs). Here we introduce a new software package, Brian2GeNN, that connects the two systems so that users can make use of GeNN GPU acceleration when developing their models in Brian, without requiring any technical knowledge about GPUs, C++ or GeNN. The new Brian2GeNN software uses a pipeline of code generation to translate Brian scripts into C++ code that can be used as input to GeNN, and subsequently can be run on suitable NVIDIA GPU accelerators. From the user’s perspective, the entire pipeline is invoked by adding two simple lines to their Brian scripts. We have shown that using Brian2GeNN, typical models can run tens to hundreds of times faster than on CPU.

Download Full-text

Improving Performance of Particle Tracking Velocimetry Analysis with Artificial Neural Networks and Graphics Processing Units

Research in Computing Science ◽

10.13053/rcs-104-1-6 ◽

2015 ◽

Vol 104 (1) ◽

pp. 71-79

Author(s):

Rubén Hernández Pérez ◽

Ruslan Gabbasov ◽

Joel Suárez Cansino

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Particle Tracking ◽

Graphics Processing Units ◽

Particle Tracking Velocimetry ◽

Artificial Neural ◽

Graphics Processing

Download Full-text

SPOC: GPGPU PROGRAMMING THROUGH STREAM PROCESSING WITH OCAML

Parallel Processing Letters ◽

10.1142/s0129626412400075 ◽

2012 ◽

Vol 22 (02) ◽

pp. 1240007 ◽

Cited By ~ 8

Author(s):

MATHIAS BOURGOIN ◽

EMMANUEL CHAILLOUX ◽

JEAN-LUC LAMOTTE

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Stream Processing ◽

General Purpose ◽

Data Sets ◽

Specific Data ◽

Garbage Collector ◽

Great Performance ◽

Graphics Processing

General purpose computing on graphics processing units (GPGPU) consists of using GPUs to handle computations commonly handled by CPUs. GPGPU programming implies developing specific programs to run on GPUs managed by a host program running on the CPU. To achieve high performance implies to explicitly organize memory transfers between devices. Besides, different incompatible frameworks exist making productivity and portability difficult to achieve. In this paper, we describe SPOC, an OCaml library, defining specific data sets in order to automatically manage transfers between GPU and CPU. SPOC also offers a runtime library looking for multiple frameworks and making them usable transparently. We also describe the link between SPOC and the OCaml garbage collector to optimize transfers dynamically. SPOC benchmarks show that SPOC can offer great performance while simplifying GPGPU programming

Download Full-text

Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Security and Communication Networks ◽

10.1155/2017/3508786 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Jiankuo Dong ◽

Fangyu Zheng ◽

Wuqiong Pan ◽

Jingqiang Lin ◽

Jiwu Jing ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Chinese Remainder Theorem ◽

General Purpose ◽

Floating Point ◽

Double Precision ◽

Computing Power ◽

Cryptographic Algorithm ◽

Graphics Processing ◽

Gpu Architecture

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

Download Full-text

CUDA or OpenCL

Research Advances in the Integration of Big Data and Smart Computing - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-4666-8737-0.ch015 ◽

2016 ◽

pp. 267-279

Author(s):

Mayank Bhura ◽

Pranav H. Deshpande ◽

K. Chandrasekaran

Keyword(s):

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Systems ◽

General Purpose ◽

Programming Environment ◽

Pros And Cons ◽

Nvidia Gpu ◽

Graphics Processing ◽

Performance Computing

Usage of General Purpose Graphics Processing Units (GPGPUs) in high-performance computing is increasing as heterogeneous systems continue to become dominant. CUDA had been the programming environment for nearly all such NVIDIA GPU based GPGPU applications. Still, the framework runs only on NVIDIA GPUs, for other frameworks it requires reimplementation to utilize additional computing devices that are available. OpenCL provides a vendor-neutral and open programming environment, with many implementations available on CPUs, GPUs, and other types of accelerators, OpenCL can thus be regarded as write once, run anywhere framework. Despite this, both frameworks have their own pros and cons. This chapter presents a comparison of the performance of CUDA and OpenCL frameworks, using an algorithm to find the sum of all possible triple products on a list of integers, implemented on GPUs.

Download Full-text

High Performance Matrix Multiplication on General Purpose Graphics Processing Units

2010 International Conference on Computational Intelligence and Software Engineering ◽

10.1109/cise.2010.5677044 ◽

2010 ◽

Author(s):

Fan Wu ◽

Miguel Cabral ◽

Jessica Brazelton

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Matrix Multiplication ◽

General Purpose ◽

Performance Matrix ◽

Graphics Processing

Download Full-text