cuda programming Latest Research Papers

2021 ◽

pp. 344-359

Author(s):

М.Л. Цымблер ◽

А.И. Гоглачев

Keyword(s):

Time Series ◽

High Performance ◽

Life Support ◽

Technical Condition ◽

Wide Range ◽

Cuda Programming ◽

Time Series Mining ◽

Intelligent Management ◽

The Given ◽

Efficient Parallelization

Поиск типичных подпоследовательностей временного ряда является одной из актуальных задач интеллектуального анализа временных рядов. Данная задача предполагает нахождение набора подпоследовательностей временного ряда, которые адекватно отражают течение процесса или явления, задаваемого этим рядом. Поиск типичных подпоследовательностей дает возможность резюмировать и визуализировать большие временные ряды в широком спектре приложений: мониторинг технического состояния сложных машин и механизмов, интеллектуальное управление системами жизнеобеспечения, мониторинг показателей функциональной диагностики организма человека и др. Предложенная недавно концепция сниппета формализует типичную подпоследовательность временного ряда следующим образом. Сниппет представляет собой подпоследовательность, на которую похожи многие другие подпоследовательности данного ряда в смысле специализированной меры схожести, основанной на евклидовом расстоянии. Поиск типичных подпоследовательностей с помощью сниппетов показывает адекватные результаты для временных рядов из широкого спектра предметных областей, однако соответствующий алгоритм имеет высокую вычислительную сложность. В настоящей работе предложен новый параллельный алгоритм поиска сниппетов во временном ряде на графическом ускорителе. Распараллеливание выполнено с помощью технологии программирования CUDA. Разработаны структуры данных, позволяющие эффективно распараллелить вычисления на графическом процессоре. Представлены результаты вычислительных экспериментов, подтверждающих высокую производительность разработанного алгоритма. Discovery of typical subsequences in a time series is one of the topical problems of time series mining. In this problem, we are to find a set of subsequences that adequately represents the specified time series. The solution of such a problem makes it possible to summarize and visualize a large time series in a wide range of applications: monitoring of the technical condition of complex machines and mechanisms, intelligent management of life support systems, monitoring of indicators of functional diagnostics of the human body, etc. The recently proposed snippet concept formalizes a typical time series subsequence as follows. A snippet of a time series is a subsequence that many other subsequences of the given series are similar to, with respect to a specialized similarity measure based on the Euclidean distance. Despite the snippets discovery algorithm shows adequate results for time series from a wide range of subject domains, it has a high computational complexity. In this article, we propose a novel parallel algorithm for snippets discovery on GPU. Parallelization is performed through the CUDA programming technology. We developed data structures that allow for efficient parallelization of GPU calculations. The experimental results show the high performance of the proposed algorithm.

Download Full-text

Performance Analysis of OpenCL and CUDA Programming Models for the High Efficiency Video Coding

10.5772/intechopen.99823 ◽

2021 ◽

Author(s):

Randa Khemiri ◽

Soulef Bouaafia ◽

Asma Bahba ◽

Maha Nasr ◽

Fatma Ezahra Sayadi

Keyword(s):

Motion Estimation ◽

Execution Time ◽

High Efficiency ◽

Graphics Processing Unit ◽

Block Matching ◽

Performance Ratio ◽

Processing Unit ◽

High Efficiency Video Coding ◽

Estimation Algorithms ◽

Cuda Programming

In Motion estimation (ME), the block matching algorithms have a great potential of parallelism. This process of the best match is performed by computing the similarity for each block position inside the search area, using a similarity metric, such as Sum of Absolute Differences (SAD). It is used in the various steps of motion estimation algorithms. Moreover, it can be parallelized using Graphics Processing Unit (GPU) since the computation algorithm of each block pixels is similar, thus offering better results. In this work a fixed OpenCL code was performed firstly on several architectures as CPU and GPU, secondly a parallel GPU-implementation was proposed with CUDA and OpenCL for the SAD process using block of sizes from 4x4 to 64x64. A comparative study established between execution time on GPU on the same video sequence. The experimental results indicated that GPU OpenCL execution time was better than that of CUDA times with performance ratio that reached the double.

Download Full-text

GPU-accelerated multitiered iterative phasing algorithm for fluctuation X-ray scattering

Journal of Applied Crystallography ◽

10.1107/s1600576721005744 ◽

2021 ◽

Vol 54 (4) ◽

Author(s):

Pranay Reddy Kommera ◽

Vinay Ramakrishnaiah ◽

Christine Sweeney ◽

Jeffrey Donatelli ◽

Petrus H. Zwart

Keyword(s):

Graphics Processing Units ◽

Programming Model ◽

Scattering Data ◽

X Ray ◽

Computational Performance ◽

X Ray Scattering ◽

Cuda Programming ◽

Order Of Magnitude ◽

Algorithm Implementation ◽

Ray Scattering

The multitiered iterative phasing (MTIP) algorithm is used to determine the biological structures of macromolecules from fluctuation scattering data. It is an iterative algorithm that reconstructs the electron density of the sample by matching the computed fluctuation X-ray scattering data to the external observations, and by simultaneously enforcing constraints in real and Fourier space. This paper presents the first ever MTIP algorithm acceleration efforts on contemporary graphics processing units (GPUs). The Compute Unified Device Architecture (CUDA) programming model is used to accelerate the MTIP algorithm on NVIDIA GPUs. The computational performance of the CUDA-based MTIP algorithm implementation outperforms the CPU-based version by an order of magnitude. Furthermore, the Heterogeneous-Compute Interface for Portability (HIP) runtime APIs are used to demonstrate portability by accelerating the MTIP algorithm across NVIDIA and AMD GPUs.

Download Full-text

Graphics processing unit acceleration of the island model genetic algorithm using the CUDA programming platform

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6286 ◽

2021 ◽

Author(s):

Dylan M. Janssen ◽

Wayne Pullan ◽

Alan Wee‐Chung Liew

Keyword(s):

Genetic Algorithm ◽

Graphics Processing Unit ◽

Island Model ◽

Processing Unit ◽

Cuda Programming ◽

Graphics Processing

Download Full-text

Research on the single-host parallel computing with the local time step scheme for modeling of hydro-sediment-morphodynamic processes

10.5194/egusphere-egu21-7027 ◽

2021 ◽

Author(s):

Zixiong Zhao ◽

Peng Hu ◽

Wei Li ◽

Zhixian Cao ◽

Zhiguo He

Keyword(s):

Parallel Computing ◽

Local Time ◽

High Performance ◽

Water Model ◽

Time Step ◽

Time Stepping ◽

Computational Performance ◽

Cuda Programming ◽

Single Host ◽

Algorithmic Acceleration

<p>In recent decades, computational hydraulics and sediment modelling have a great development due to compute technology. Applying a finite-volume Godunov-type hydrodynamic shallow water model with hydro-sediment-morphodynamic processes, this work demonstrates and analysis the ability of single-host parallel computing technology with algorithmic acceleration technology. This model is implemented for high-performance computing using the NVIDIA&#8217;s Compute Unified Device Architecture (CUDA) programming framework, using a domain decomposition technique and across multiple cores through an efficient implementation of the Open Multi-Processing (Open MP) architecture, and using an algorithmic acceleration technology named local time stepping scheme (LTS), which is capable of obtain much efficiency improvement via different time step sizes for different grid sizes. The model is applied for three cases, through which we compare the effectiveness of CPU, Open MP, Open MP+LTS, CUDA, and CUDA+LTS, demonstrating high computational performance across CUDA+LTS which can lead to speedups of 40 times with respect to CPU and high-precision results across CUDA +LTS.</p><p>KEY WORDS: Hydro-sediment-morphological modeling; local time step; Open MP; CUDA.</p>

Download Full-text

Another Parallelism Technique of GLCM Implementation Using CUDA Programming

2020 4th International Conference on Advances in Image Processing ◽

10.1145/3441250.3441251 ◽

2020 ◽

Author(s):

Teamsar Muliadi Panggabean ◽

Mario Elyezer Simaremare ◽

Rusmina Siahaan ◽

Chandro Pardede ◽

Wiwin Putri Gurning

Keyword(s):

Cuda Programming

Download Full-text

The Maximum Common Subgraph Problem: A Parallel and Multi-Engine Approach

Computation ◽

10.3390/computation8020048 ◽

2020 ◽

Vol 8 (2) ◽

pp. 48 ◽

Cited By ~ 1

Author(s):

Stefano Quer ◽

Andrea Marcelli ◽

Giovanni Squillero

Keyword(s):

Divide And Conquer ◽

Parallel Applications ◽

Practical Applications ◽

Original Algorithm ◽

Maximum Common Subgraph ◽

Thread Pool ◽

Cuda Programming ◽

On Line ◽

Gpu Architectures ◽

Many Core

The maximum common subgraph of two graphs is the largest possible common subgraph, i.e., the common subgraph with as many vertices as possible. Even if this problem is very challenging, as it has been long proven NP-hard, its countless practical applications still motivates searching for exact solutions. This work discusses the possibility to extend an existing, very effective branch-and-bound procedure on parallel multi-core and many-core architectures. We analyze a parallel multi-core implementation that exploits a divide-and-conquer approach based on a thread pool, which does not deteriorate the original algorithmic efficiency and it minimizes data structure repetitions. We also extend the original algorithm to parallel many-core GPU architectures adopting the CUDA programming framework, and we show how to handle the heavily workload-unbalance and the massive data dependency. Then, we suggest new heuristics to reorder the adjacency matrix, to deal with “dead-ends”, and to randomize the search with automatic restarts. These heuristics can achieve significant speed-ups on specific instances, even if they may not be competitive with the original strategy on average. Finally, we propose a portfolio approach, which integrates all the different local search algorithms as component tools; such portfolio, rather than choosing the best tool for a given instance up-front, takes the decision on-line. The proposed approach drastically limits memory bandwidth constraints and avoids other typical portfolio fragility as CPU and GPU versions often show a complementary efficiency and run on separated platforms. Experimental results support the claims and motivate further research to better exploit GPUs in embedded task-intensive and multi-engine parallel applications.

Download Full-text