State-of-the-Art GPGPU Applications in Bioinformatics

Nikitas Papangelopoulos; Dimitrios Vlachakis; Arianna Filntisi; Paraskevas Fakourelis; Louis Papageorgiou; Vasileios Megalooikonomou; Sophia Kossida

doi:10.4018/ijsbbt.2013100103

State-of-the-Art GPGPU Applications in Bioinformatics

International Journal of Systems Biology and Biomedical Technologies ◽

10.4018/ijsbbt.2013100103 ◽

2013 ◽

Vol 2 (4) ◽

pp. 24-48 ◽

Cited By ~ 2

Author(s):

Nikitas Papangelopoulos ◽

Dimitrios Vlachakis ◽

Arianna Filntisi ◽

Paraskevas Fakourelis ◽

Louis Papageorgiou ◽

...

Keyword(s):

Structure Prediction ◽

Graphics Processing Units ◽

Large Data ◽

General Purpose ◽

Biological Data ◽

Data Set ◽

Central Processing ◽

Processing Power ◽

Increasing Demand ◽

Graphics Processing

The exponential growth of available biological data in recent years coupled with their increasing complexity has made their analysis a computationally challenging process. Traditional central processing unist (CPUs) are reaching their limit in processing power and are not designed primarily for multithreaded applications. Graphics processing units (GPUs) on the other hand are affordable, scalable computer powerhouses that, thanks to the ever increasing demand for higher quality graphics, have yet to reach their limit. Typically high-end CPUs have 8-16 cores, whereas GPUs can have more than 2,500 cores. GPUs are also, by design, highly parallel, multicore and multithreaded, able of handling thousands of threads doing the same calculation on different subsets of a large data set. This ability is what makes them perfectly suited for biological analysis tasks. Lately this potential has been realized by many bioinformatics researches and a huge variety of tools and algorithms have been ported to GPUs, or designed from the ground up to maximize the usage of available cores. Here, we present a comprehensive review of available bioinformatics tools ranging from sequence and image analysis to protein structure prediction and systems biology that use NVIDIA Compute Unified Device Architecture (CUDA) general-purpose computing on graphics processing units (GPGPU) programming language.

Download Full-text

uBlasCL: Architecture Agnostic Massively Parallel Linear Algebra System

Volume 2: 31st Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2011-48228 ◽

2011 ◽

Author(s):

Athanasios Iliopoulos ◽

John G. Michopoulos

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

Processing System ◽

Performance Testing ◽

General Purpose ◽

Massively Parallel ◽

Domain Specific Languages ◽

Central Processing ◽

Domain Specific ◽

Graphics Processing

The need for more efficient, more abstract and easier to use parallel programming interfaces has been recently intensified with the introduction and remarkable evolution of technologies such as the General Purpose Graphics Processing Units (GPG-PUs) and multi-core Central Processing Units (CPUs). In the present paper we present the introduction of the uBlasCL system as a Domain Specific Embedded Language within C++ that implements a Basic Linear Algebra Interface for OpenCL. The system is architecture agnostic, in the sense that it can be programmed independently of the targeted architecture, is massively parallel, and achieves efficiency that tracks well the increase in hardware performance advances. Our effort is based on the utilization of template metaprogramming and domain specific languages fundamentals, for developing a system that has the syntactic flexibility of a symbolic term processing system for expressing mathematics, and the semantic and executional power to exploit the parallelism offered by the hardware in an automated, transparent to the user, and efficiently mapped on the hardware manner. We also describe its relation to C++, template programming, domain specific languages and OpenCL. In the effort to develop uBlasCL we also developed a middleware library named CL++, as a convenient C++ interface to OpenCL. After the architectural and the implementation descriptions of the system, we present performance testing results demonstrating its potential power.

Download Full-text

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

Publications of the Astronomical Society of Australia ◽

10.1071/as10021 ◽

2011 ◽

Vol 28 (1) ◽

pp. 1-14 ◽

Cited By ~ 172

Author(s):

W. van Straten ◽

M. Bailes

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Graphics Processing Units ◽

High Performance ◽

Digital Signal ◽

General Purpose ◽

Design Decisions ◽

Extensive Range ◽

Processing Software ◽

Graphics Processing

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.

Download Full-text

Accelerating reaction–diffusion simulations with general-purpose graphics processing units

Bioinformatics ◽

10.1093/bioinformatics/btq622 ◽

2010 ◽

Vol 27 (2) ◽

pp. 288-290 ◽

Cited By ~ 30

Author(s):

Matthias Vigelius ◽

Aidan Lane ◽

Bernd Meyer

Keyword(s):

Graphics Processing Units ◽

Reaction Diffusion ◽

General Purpose ◽

Graphics Processing

Download Full-text

More Faster Self-Organizing Maps by General Purpose on Graphics Processing Units

Advances in Intelligent Systems and Computing - Soft Computing in Machine Learning ◽

10.1007/978-3-319-05533-6_5 ◽

2014 ◽

pp. 41-51

Author(s):

Shinji Kawakami ◽

Keiji Kamei

Keyword(s):

Graphics Processing Units ◽

General Purpose ◽

Self Organizing Maps ◽

Graphics Processing ◽

Self Organizing

Download Full-text

A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

Processes ◽

10.3390/pr8091199 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1199

Author(s):

Ravie Chandren Muniyandi ◽

Ali Maroosi

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Hamiltonian Path ◽

Fold Increase ◽

General Purpose ◽

Processing Unit ◽

Thread Block ◽

Hard Problems ◽

Graphical Processing ◽

Graphics Processing

Long-timescale simulations of biological processes such as photosynthesis or attempts to solve NP-hard problems such as traveling salesman, knapsack, Hamiltonian path, and satisfiability using membrane systems without appropriate parallelization can take hours or days. Graphics processing units (GPU) deliver an immensely parallel mechanism to compute general-purpose computations. Previous studies mapped one membrane to one thread block on GPU. This is disadvantageous given that when the quantity of objects for each membrane is small, the quantity of active thread will also be small, thereby decreasing performance. While each membrane is designated to one thread block, the communication between thread blocks is needed for executing the communication between membranes. Communication between thread blocks is a time-consuming process. Previous approaches have also not addressed the issue of GPU occupancy. This study presents a classification algorithm to manage dependent objects and membranes based on the communication rate associated with the defined weighted network and assign them to sub-matrices. Thus, dependent objects and membranes are allocated to the same threads and thread blocks, thereby decreasing communication between threads and thread blocks and allowing GPUs to maintain the highest occupancy possible. The experimental results indicate that for 48 objects per membrane, the algorithm facilitates a 93-fold increase in processing speed compared to a 1.6-fold increase with previous algorithms.

Download Full-text

Heterogenous Computing on Iris Matching with OpenCL

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.850.129 ◽

2016 ◽

Vol 850 ◽

pp. 129-135

Author(s):

Buğra Şimşek ◽

Nursel Akçam

Keyword(s):

Graphics Processing Units ◽

Iris Recognition ◽

Heterogeneous Computing ◽

Hamming Distance ◽

Heterogeneous Systems ◽

Digital Signal ◽

Mobile Platforms ◽

Central Processing ◽

Field Programmable ◽

Graphics Processing

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.

Download Full-text

Accelerating in-memory transaction processing using general purpose graphics processing units

Future Generation Computer Systems ◽

10.1016/j.future.2019.03.034 ◽

2019 ◽

Vol 97 ◽

pp. 836-848

Author(s):

Lan Gao ◽

Yunlong Xu ◽

Rui Wang ◽

Hailong Yang ◽

Zhongzhi Luan ◽

...

Keyword(s):

Graphics Processing Units ◽

Transaction Processing ◽

General Purpose ◽

Graphics Processing

Download Full-text

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units - GPGPU-6

10.1145/2458523 ◽

2013 ◽

Keyword(s):

Graphics Processing Units ◽

General Purpose ◽

General Purpose Processor ◽

Graphics Processing

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text

Accelerated FDPS: Algorithms to use accelerators with FDPS

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psz133 ◽

2020 ◽

Vol 72 (1) ◽

Cited By ~ 2

Author(s):

Masaki Iwasawa ◽

Daisuke Namekata ◽

Keigo Nitadori ◽

Kentaro Nomura ◽

Long Wang ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

General Purpose ◽

Performance Model ◽

Performance Tuning ◽

Data Types ◽

Interaction Function ◽

Current Implementation ◽

And Performance ◽

Graphics Processing

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Download Full-text