GPU Preconditioning for Block Linear Systems Using Block Incomplete Sparse Approximate Inverses

Solving sparse triangular systems is the building block for incomplete LU- (ILU-) based preconditioning, but parallel algorithms, such as the level-scheduling scheme, are sometimes limited by available parallelism extracted from the sparsity pattern. In this study, the block version of the incomplete sparse approximate inverses (ISAI) algorithm is studied, and the block-ISAI is considered for preconditioning by proposing an efficient algorithm and implementation on graphical processing unit (GPU) accelerators. Performance comparisons are carried out between the proposed algorithm and serial and parallel block triangular solvers from PETSc and cuSPARSE libraries. The experimental results show that GMRES (30) with the proposed block-ISAI preconditioning achieves accelerations 1.4 × –6.9 × speedups over that using the cuSPARSE library on NVIDIA Tesla V100 GPU.

Download Full-text

Mixed-mode database miner classifier: Parallel computation of graphical processing unit mining

International Journal of Electrical Engineering Education ◽

10.1177/0020720920988494 ◽

2021 ◽

pp. 002072092098849

Author(s):

Soumya Ranjan Nayak ◽

S Sivakumar ◽

Akash Kumar Bhoi ◽

Gyoo-Soo Chae ◽

Pradeep Kumar Mallick

Keyword(s):

Credit Card ◽

Mixed Mode ◽

Processing Time ◽

Gpu Computing ◽

Graphical Processing Unit ◽

Computational Time ◽

Processing Unit ◽

Large Set ◽

Minimal Processing ◽

Graphical Processing

Graphical processing unit (GPU) has gained more popularity among researchers in the field of decision making and knowledge discovery systems. However, most of the earlier studies have GPU memory utilization, computational time, and accuracy limitations. The main contribution of this paper is to present a novel algorithm called the Mixed Mode Database Miner (MMDBM) classifier by implementing multithreading concepts on a large number of attributes. The proposed method use the quick sort algorithm in GPU parallel computing to overcome the state of the art limitations. This method applies the dynamic rule generation approach for constructing the decision tree based on the predicted rules. Moreover, the implementation results are compared with both SLIQ and MMDBM using Java and GPU with the computed acceleration ratio time using the BP dataset. The primary objective of this work is to improve the performance with less processing time. The results are also analyzed using various threads in GPU mining using eight different datasets of UCI Machine learning repository. The proposed MMDBM algorithm have been validated on these chosen eight different dataset with accuracy of 91.3% in diabetes, 89.1% in breast cancer, 96.6% in iris, 89.9% in labor, 95.4% in vote, 89.5% in credit card, 78.7% in supermarket and 78.7% in BP, and simultaneously, it also takes less computational time for given datasets. The outcome of this work will be beneficial for the research community to develop more effective multi thread based GPU solution in GPU mining to handle large set of data in minimal processing time. Therefore, this can be considered a more reliable and precise method for GPU computing.

Download Full-text

A Comparative Study of Block Incomplete Sparse Approximate Inverses Preconditioning on Tesla K20 and V100 GPUs

Algorithms ◽

10.3390/a14070204 ◽

2021 ◽

Vol 14 (7) ◽

pp. 204

Author(s):

Wenpeng Ma ◽

Wu Yuan ◽

Xiazhen Liu

Keyword(s):

Comparative Study ◽

Open Source ◽

Multiple Gpus ◽

Approximate Inverses ◽

Level Scheduling

Incomplete Sparse Approximate Inverses (ISAI) has shown some advantages over sparse triangular solves on GPUs when it is used for the incomplete LU based preconditioner. In this paper, we extend the single GPU method for Block–ISAI to multiple GPUs algorithm by coupling Block–Jacobi preconditioner, and introduce the detailed implementation in the open source numerical package PETSc. In the experiments, two representative cases are performed and a comparative study of Block–ISAI on up to four GPUs are conducted on two major generations of NVIDIA’s GPUs (Tesla K20 and Tesla V100). Block–Jacobi preconditioning with Block–ISAI (BJPB-ISAI) shows an advantage over the level-scheduling based triangular solves from the cuSPARSE library for the cases, and the overhead of setting up Block–ISAI and the total wall clock times of GMRES is greatly reduced using Tesla V100 GPUs compared to Tesla K20 GPUs.

Download Full-text

A graphical processing unit‐based parallel hybrid genetic algorithm for resource‐constrained multi‐project scheduling problem

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6266 ◽

2021 ◽

Author(s):

Furkan Uysal ◽

Rifat Sonmez ◽

Selcuk Kursat Isleyen

Keyword(s):

Genetic Algorithm ◽

Project Scheduling ◽

Hybrid Genetic Algorithm ◽

Graphical Processing Unit ◽

Processing Unit ◽

Scheduling Problem ◽

Resource Constrained ◽

Parallel Hybrid ◽

Project Scheduling Problem ◽

Graphical Processing

Download Full-text

ACCELERATING 3D NON-RIGID REGISTRATION USING GRAPHICS HARDWARE

International Journal of Image and Graphics ◽

10.1142/s0219467808002988 ◽

2008 ◽

Vol 08 (01) ◽

pp. 81-98 ◽

Cited By ~ 11

Author(s):

NICOLAS COURTY ◽

PIERRE HELLIER

Keyword(s):

Image Analysis ◽

Classical Method ◽

Graphics Hardware ◽

Processing Unit ◽

Recursive Filtering ◽

Rigid Registration ◽

Guided Surgery ◽

Graphical Processing Unit Implementation ◽

Demons Algorithm ◽

Graphical Processing

There is an increasing need for real-time implementation of 3D image analysis processes, especially in the context of image-guided surgery. Among the various image analysis tasks, non-rigid image registration is particularly needed and is also computationally prohibitive. This paper presents a GPU (Graphical Processing Unit) implementation of the popular Demons algorithm using a Gaussian recursive filtering. Acceleration of the classical method is mainly achieved by a new filtering scheme on GPU which could be reused in or extended to other applications and denotes a significant contribution to the GPU-based image processing domain. This implementation was able to perform a non-rigid registration of 3D MR volumes in less than one minute, which corresponds to an acceleration factor of 10 compared to the corresponding CPU implementation. This demonstrated the usefulness of such method in an intra-operative context.

Download Full-text

A computationally efficient approach to segmentation of the aorta and coronary arteries using deep learning

10.1101/2021.02.18.21252005 ◽

2021 ◽

Author(s):

Wing Keung Cheung ◽

Robert Bell ◽

Arjun Nair ◽

Leon Menezies ◽

Riyaz Patel ◽

...

Keyword(s):

Deep Learning ◽

Coronary Arteries ◽

Automatic Segmentation ◽

Three Dimensional ◽

Regions Of Interest ◽

Dice Similarity Coefficient ◽

Processing Unit ◽

Two Dimensional ◽

Computed Tomography Images ◽

Graphical Processing

AbstractA fully automatic two-dimensional Unet model is proposed to segment aorta and coronary arteries in computed tomography images. Two models are trained to segment two regions of interest, (1) the aorta and the coronary arteries or (2) the coronary arteries alone. Our method achieves 91.20% and 88.80% dice similarity coefficient accuracy on regions of interest 1 and 2 respectively. Compared with a semi-automatic segmentation method, our model performs better when segmenting the coronary arteries alone. The performance of the proposed method is comparable to existing published two-dimensional or three-dimensional deep learning models. Furthermore, the algorithmic and graphical processing unit memory efficiencies are maintained such that the model can be deployed within hospital computer networks where graphical processing units are typically not available.

Download Full-text

Graphical processing unit (GPU) acceleration for numerical solution of population balance models using high resolution finite volume algorithm

Computers & Chemical Engineering ◽

10.1016/j.compchemeng.2016.03.023 ◽

2016 ◽

Vol 91 ◽

pp. 167-181 ◽

Cited By ~ 25

Author(s):

Botond Szilágyi ◽

Zoltán K. Nagy

Keyword(s):

High Resolution ◽

Numerical Solution ◽

Finite Volume ◽

Population Balance ◽

Graphical Processing Unit ◽

Gpu Acceleration ◽

Processing Unit ◽

Graphical Processing ◽

Volume Algorithm

Download Full-text

Low latency iterative reconstruction of first pass stress cardiac perfusion with physiological stress using graphical processing unit

Journal of Cardiovascular Magnetic Resonance ◽

10.1186/1532-429x-15-s1-e10 ◽

2013 ◽

Vol 15 (S1) ◽

Author(s):

Sébastien Roujol ◽

Tamer A Basha ◽

Christophe Schülke ◽

Martin Buehrer ◽

Warren J Manning ◽

...

Keyword(s):

Iterative Reconstruction ◽

Physiological Stress ◽

Graphical Processing Unit ◽

Low Latency ◽

Processing Unit ◽

Cardiac Perfusion ◽

First Pass ◽

Graphical Processing

Download Full-text

A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

Processes ◽

10.3390/pr8091199 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1199

Author(s):

Ravie Chandren Muniyandi ◽

Ali Maroosi

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Hamiltonian Path ◽

Fold Increase ◽

General Purpose ◽

Processing Unit ◽

Thread Block ◽

Hard Problems ◽

Graphical Processing ◽

Graphics Processing

Long-timescale simulations of biological processes such as photosynthesis or attempts to solve NP-hard problems such as traveling salesman, knapsack, Hamiltonian path, and satisfiability using membrane systems without appropriate parallelization can take hours or days. Graphics processing units (GPU) deliver an immensely parallel mechanism to compute general-purpose computations. Previous studies mapped one membrane to one thread block on GPU. This is disadvantageous given that when the quantity of objects for each membrane is small, the quantity of active thread will also be small, thereby decreasing performance. While each membrane is designated to one thread block, the communication between thread blocks is needed for executing the communication between membranes. Communication between thread blocks is a time-consuming process. Previous approaches have also not addressed the issue of GPU occupancy. This study presents a classification algorithm to manage dependent objects and membranes based on the communication rate associated with the defined weighted network and assign them to sub-matrices. Thus, dependent objects and membranes are allocated to the same threads and thread blocks, thereby decreasing communication between threads and thread blocks and allowing GPUs to maintain the highest occupancy possible. The experimental results indicate that for 48 objects per membrane, the algorithm facilitates a 93-fold increase in processing speed compared to a 1.6-fold increase with previous algorithms.

Download Full-text