dadi.CUDA: Accelerating population genetic inference with Graphics Processing Units

Population Genetic ◽

Gpu Computing ◽

Demographic History ◽

Processing Unit ◽

Speed Increase ◽

Population Genetic Inference ◽

Computationally Intensive ◽

Extracting insight from population genetic data often demands computationally intensive modeling. dadi is a popular program for fitting models of demographic history and natural selection to such data. Here, I show that running dadi on a Graphics Processing Unit (GPU) can speed computation by orders of magnitude compared to the CPU implementation, with minimal user burden. This speed increase enables the analysis of more complex models, which motivated the extension of dadi to four- and five-population models. Remarkably, dadi performs almost as well on inexpensive consumer-grade GPUs as on expensive server-grade GPUs. GPU computing thus offers large and accessible benefits to the community of dadi users. This functionality is available in dadi version 2.1.0.

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

GPU Computation and Platforms

10.4018/978-1-4666-8853-7.ch007 ◽

2016 ◽

pp. 136-174

Author(s):

K. Bhargavi ◽

Sathish Babu B.

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Gpu Computing ◽

General Purpose ◽

Processing Unit ◽

Computing Platforms ◽

Computationally Intensive ◽

The GPUs (Graphics Processing Unit) were mainly used to speed up computation intensive high performance computing applications. There are several tools and technologies available to perform general purpose computationally intensive application. This chapter primarily discusses about GPU parallelism, applications, probable challenges and also highlights some of the GPU computing platforms, which includes CUDA, OpenCL (Open Computing Language), OpenMPC (Open MP extended for CUDA), MPI (Message Passing Interface), OpenACC (Open Accelerator), DirectCompute, and C++ AMP (C++ Accelerated Massive Parallelism). Each of these platforms is discussed briefly along with their advantages and disadvantages.

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Processing Unit ◽

Performance Portability ◽

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Graphics processing unit implementation of the F-statistic for continuous gravitational wave searches

Classical and Quantum Gravity ◽

10.1088/1361-6382/ac4616 ◽

2021 ◽

Author(s):

Liam Dunn ◽

Patrick Clearwater ◽

Andrew Melatos ◽

Karl Wette

Keyword(s):

Gravitational Wave ◽

Computational Cost ◽

Processing Unit ◽

Central Processing ◽

Long Baseline ◽

Using Data ◽

Graphics Processing ◽

Gpu Implementation

Abstract The F-statistic is a detection statistic used widely in searches for continuous gravitational waves with terrestrial, long-baseline interferometers. A new implementation of the F-statistic is presented which accelerates the existing "resampling" algorithm using graphics processing units (GPUs). The new implementation runs between 10 and 100 times faster than the existing implementation on central processing units without sacrificing numerical accuracy. The utility of the GPU implementation is demonstrated on a pilot narrowband search for four newly discovered millisecond pulsars in the globular cluster Omega Centauri using data from the second Laser Interferometer Gravitational-Wave Observatory observing run. The computational cost is 17:2 GPU-hours using the new implementation, compared to 1092 core-hours with the existing implementation.

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

Advanced Topics GPU Programming and CUDA Architecture

10.4018/978-1-4666-8853-7.ch008 ◽

2016 ◽

pp. 175-203

Author(s):

Mainak Adhikari ◽

Sukhendu Kar

Keyword(s):

High Performance ◽

Programming Model ◽

Direct Access ◽

Gpu Programming ◽

Processing Unit ◽

Computing Platform ◽

Cuda Architecture ◽

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.

Encyclopedia of Information Science and Technology, Fifth Edition - Advances in Information Quality and Management ◽

Data Streaming Processing Window Joined With Graphics Processing Units (GPUs)

10.4018/978-1-7998-3479-3.ch043 ◽

2021 ◽

pp. 602-623

Author(s):

Shen Lu ◽

Richard S. Segall

Keyword(s):

Big Data ◽

Data Streams ◽

Data Stream ◽

Large Scale ◽

Processing Unit ◽

Data Streaming ◽

Large Scale Data ◽

Big data is large-scale data and can be either discrete or continuous. This article entails research that discusses the continuous case of big data often called “data streaming.” More and more businesses will depend on being able to process and make decisions on streams of data. This article utilizes the algorithmic side of data stream processing often called “stream analytics” or “stream mining.” Data streaming Windows Join can be improved by using graphics processing unit (GPU) for higher performance computing. Data streams are generated by two independent threads: one thread can be used to generate Data Stream A, and the other thread can be used to generate Data Stream B. One would use a Windows Join thread to merge the two data streams, which is also the process of “Data Stream Window Join.” The Window Join process can be implemented in parallel that can efficiently improve the computing speed. Experiments are provided for Data Stream Window Joins using both static and dynamic data.

Construction of an optoacoustic image of biological tissues based on an algorithm for a graphics processor

Applied Physics ◽

10.51368/1996-0948-2021-5-106-109 ◽

2021 ◽

pp. 106-109

Author(s):

Denis Kravchuk

Keyword(s):

Gpu Computing ◽

Biological Tissues ◽

Ultrasonic Field ◽

Processing Unit ◽

Optoacoustic Imaging ◽

Optoacoustic Interaction ◽

Speed Up ◽

Migration Method ◽

The use of optical contrast between different blood particles allows the use of optoacoustic imaging to visualize the distribution of blood particles (erythrocytes, taking into account oxygen saturation), the delivery of drugs to organs through blood vessels. An algorithm for calculating the ultrasonic field obtained as a result of optoacoustic interaction has been developed to speed up calculations on the GPU board. An architecture for fast restoration of an optoacoustic signal based on graphics processing unit (GPU) programming is proposed. The algorithm used in combination with the pre-migration method provides an improvement in the resolution and sharpness of the optoacoustic image of the simulated biological tissues. Thanks to the advanced graphics processing unit (GPU) computing architecture, time-consuming main processing unit (CPU) computing is accelerated with great computational efficiency.

Accelerating the RTTOV-7 IASI and AMSU-A radiative transfer models on graphics processing units: evaluating central processing unit/graphics processing unit-hybrid and pure-graphics processing unit approaches

Journal of Applied Remote Sensing ◽

10.1117/1.3658028 ◽

2011 ◽

Vol 5 (1) ◽

pp. 051503 ◽

Cited By ~ 4

Author(s):

Jarno Mielikainen

Keyword(s):

Radiative Transfer ◽

Central Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Radiative Transfer Models ◽

Graphics Processing ◽

Transfer Models

Test and Analysis GPU-Accelerated in Molecular Dynamics Simulation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.380-384.1652 ◽

2013 ◽

Vol 380-384 ◽

pp. 1652-1655

Author(s):

Zhang Yang ◽

Chen Wen Bo ◽

Bai Qi Feng ◽

Lian Li

Keyword(s):

Large Scale ◽

Gpu Computing ◽

System Simulation ◽

Dynamics Simulation ◽

Utilization Rate ◽

Processing Unit ◽

The Difference ◽

Graphics Processing ◽

Memory Utilization

GPU computing is the use of a graphics processing unit together with a CPU to accelerate large scale scientific and engineering applications, such as molecule simulation. The paper use NVIDIA Tesla C2050NVIDIA GTX580 and NAMD 2.9 simulates three differences molecule systems: Beta2,SET9 and Ubiquitin. We compared and analyzed the results of the simulations experiment, and come to conclusion that the difference molecule systems will get the difference speed accelerated. The computing times of four GPU is nearly half of the time used by one GPU; and this is especially in the case of macromolecules system. Furthermore, from the GPUs memory utilization rate, the larger the protein system is, the higher the memory use of the GPU is. The performance of NVIDIA GTX580 is only half of the NVIDIAC2050. NVIDIA Tesla C2050 is can satisfy an even larger system simulation.

Proceedings of the International Conference on Emerging Trends in Engineering & Technology (IConETech-2020) ◽

DEVELOPING PARALLEL COMPUTING ALGORITHMS USING GPU’S TO DETERMINE OIL AND GAS RESERVES PRESENTED IN THE UPSTREAM (EXPLORATION) SECTOR

10.47412/mruu5197 ◽

2020 ◽

Author(s):

Stefan Boodoo ◽

Ajay Joshi

Keyword(s):

High Performance ◽

Oil And Gas ◽

Gpu Computing ◽

Reservoir Rock ◽

Processing Unit ◽

Potential Wells ◽

Central Processing ◽

Rock Formations ◽

Oil and Gas companies keep exploring every new possible method to increase the likelihood of finding a commercial hydrocarbon bearing prospect. Well logging generates gigabytes of data from various probes and sensors. After processing, a prospective reservoir will indicate areas of oil, gas, water and reservoir rock. Incorporating High Performance Computing (HPC) methodologies will allow for thousands of potential wells to be indicative of its hydrocarbon bearing potential. This study will present the use of the Graphics Processing Unit (GPU) computing as another method of analyzing probable reserves. Raw well log data from the Kansas Geological Society (1999-2018) forms the basis of the data analysis. Parallel algorithms are developed and make use of Nvidia’s Compute Unified Device Architecture (CUDA). The results gathered highlight a 5 times speedup using a Nvidia GeForce GT 330M GPU as compared to an Intel Core i7 740QM Central Processing Unit (CPU). The processed results display depth wise areas of shale and rock formations as well as water, oil and/or gas reserves.

Parallel SVD Algorithm for a Three-Diagonal Matrix on a Video Card Using the Nvidia CUDA Architecture

NaUKMA Research Papers Computer Science ◽

10.18523/2617-3808.2021.4.16-22 ◽

2021 ◽

Vol 4 ◽

pp. 16-22

Author(s):

Mykola Semylitko ◽

Gennadii Malaschonok

Keyword(s):

Computation Time ◽

Application Programming Interface ◽

General Purpose ◽

Diagonal Matrix ◽

Free Access ◽

Processing Unit ◽

Cuda Architecture ◽

SVD (Singular Value Decomposition) algorithm is used in recommendation systems, machine learning, image processing, and in various algorithms for working with matrices which can be very large and Big Data, so, given the peculiarities of this algorithm, it can be performed on a large number of computing threads that have only video cards.CUDA is a parallel computing platform and application programming interface model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit for general purpose processing – an approach termed GPGPU (general-purpose computing on graphics processing units). The GPU provides much higher instruction throughput and memory bandwidth than the CPU within a similar price and power envelope. Many applications leverage these higher capabilities to run faster on the GPU than on the CPU. Other computing devices, like FPGAs, are also very energy efficient, but they offer much less programming flexibility than GPUs.The developed modification uses the CUDA architecture, which is intended for a large number of simultaneous calculations, which allows to quickly process matrices of very large sizes. The algorithm of parallel SVD for a three-diagonal matrix based on the Givents rotation provides a high accuracy of calculations. Also the algorithm has a number of optimizations to work with memory and multiplication algorithms that can significantly reduce the computation time discarding empty iterations.This article proposes an approach that will reduce the computation time and, consequently, resources and costs. The developed algorithm can be used with the help of a simple and convenient API in C ++ and Java, as well as will be improved by using dynamic parallelism or parallelization of multiplication operations. Also the obtained results can be used by other developers for comparison, as all conditions of the research are described in detail, and the code is in free access.