coalesced memory Latest Research Papers

The logistic map is a class of chaotic maps. It is still in use in image cryptography. The logistic map cryptosystem has two stages, namely permutation, and diffusion. These two stages being computationally intensive, the permutation relocates the pixels, whereas the diffusion rescales them. The research on refining the logistic map is progressing to make the encryption more secure. Now there is a need to improve its efficiency to enable such models to fit for high-speed applications. The new invention of accelerators offers efficiency. But the inherent data dependencies hinder the use of accelerators. This paper discusses the novelty of identifying independent data-parallel tasks in a logistic map, handing them over to the accelerators, and improving their efficiency. Among the two accelerator models proposed, the first one achieves peak efficiency using coalesced memory access. The other cryptosystem further improves performance at the cost of more execution resources. In this investigation, it is noteworthy that the parallelly accelerated logistic map achieved a significant speedup to the larger grayscale image used. The objective security estimates proved that the two stages of the proposed systems progressively ensure security.

Download Full-text

GPrimer: a fast GPU-based pipeline for primer design for qPCR experiments

BMC Bioinformatics ◽

10.1186/s12859-021-04133-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jeongmin Bae ◽

Hajin Jeon ◽

Min-Soo Kim

Keyword(s):

Data Structures ◽

Primer Design ◽

Main Memory ◽

Design Tools ◽

Workload Balancing ◽

Qpcr Analysis ◽

Computational Speed ◽

Entire Sequence ◽

Target Sequences ◽

Coalesced Memory

Abstract Background Design of valid high-quality primers is essential for qPCR experiments. MRPrimer is a powerful pipeline based on MapReduce that combines both primer design for target sequences and homology tests on off-target sequences. It takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB. Due to the effectiveness of primers designed by MRPrimer in qPCR analysis, it has been widely used for developing many online design tools and building primer databases. However, the computational speed of MRPrimer is too slow to deal with the sizes of sequence DBs growing exponentially and thus must be improved. Results We develop a fast GPU-based pipeline for primer design (GPrimer) that takes the same input and returns the same output with MRPrimer. MRPrimer consists of a total of seven MapReduce steps, among which two steps are very time-consuming. GPrimer significantly improves the speed of those two steps by exploiting the computational power of GPUs. In particular, it designs data structures for coalesced memory access in GPU and workload balancing among GPU threads and copies the data structures between main memory and GPU memory in a streaming fashion. For human RefSeq DB, GPrimer achieves a speedup of 57 times for the entire steps and a speedup of 557 times for the most time-consuming step using a single machine of 4 GPUs, compared with MRPrimer running on a cluster of six machines. Conclusions We propose a GPU-based pipeline for primer design that takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB at once without an additional step using BLAST-like tools. The software is available at https://github.com/qhtjrmin/GPrimer.git.

Download Full-text

Optimizing non-coalesced memory access for irregular applications with GPU computing

Frontiers of Information Technology & Electronic Engineering ◽

10.1631/fitee.1900262 ◽

2020 ◽

Vol 21 (9) ◽

pp. 1285-1301 ◽

Cited By ~ 1

Author(s):

Ran Zheng ◽

Yuan-dong Liu ◽

Hai Jin

Keyword(s):

Gpu Computing ◽

Memory Access ◽

Irregular Applications ◽

Coalesced Memory

Download Full-text

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Applied Sciences ◽

10.3390/app9050947 ◽

2019 ◽

Vol 9 (5) ◽

pp. 947 ◽

Cited By ~ 9

Author(s):

Thaha Muhammed ◽

Rashid Mehmood ◽

Aiiad Albeshri ◽

Iyad Katib

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Memory Access ◽

Group Matrix ◽

The Matrix ◽

Novel Method ◽

Coalesced Memory ◽

Graphics Processing ◽

Matrix Vector

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Download Full-text

Parallel Computation of EM Backscattering from Large Three-Dimensional Sea Surface with CUDA

Sensors ◽

10.3390/s18113656 ◽

2018 ◽

Vol 18 (11) ◽

pp. 3656 ◽

Cited By ~ 5

Author(s):

Longxiang Linghu ◽

Jiaji Wu ◽

Zhensen Wu ◽

Xiaobing Wang

Keyword(s):

Parallel Computation ◽

Graphics Processing Units ◽

Data Transfer ◽

Three Dimensional ◽

Block Size ◽

Sea Surface ◽

Scale Model ◽

Specular Scattering ◽

Coalesced Memory ◽

The Impact

An efficient parallel computation using graphics processing units (GPUs) is developed for studying the electromagnetic (EM) backscattering characteristics from a large three-dimensional sea surface. A slope-deterministic composite scattering model (SDCSM), which combines the quasi-specular scattering of Kirchhoff Approximation (KA) and Bragg scattering of the two-scale model (TSM), is utilized to calculate the normalized radar cross section (NRCS in dB) of the sea surface. However, with the improvement of the radar resolution, there will be millions of triangular facets on the large sea surface which make the computation of NRCS time-consuming and inefficient. In this paper, the feasibility of using NVIDIA Tesla K80 GPU with four compute unified device architecture (CUDA) optimization strategies to improve the calculation efficiency of EM backscattering from a large sea surface is verified. The whole GPU-accelerated SDCSM calculation takes full advantage of coalesced memory access, constant memory, fast math compiler options, and asynchronous data transfer. The impact of block size and the number of registers per thread is analyzed to further improve the computation speed. A significant speedup of 748.26x can be obtained utilizing a single GPU for the GPU-based SDCSM implemented compared with the CPU-based counterpart performing on the Intel(R) Core(TM) i5-3450.

Download Full-text