A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1)

Abstract. The Parallel Ocean Program (POP) is used in many strongly eddying ocean circulation simulations. Ideally one would like to do thousand-year long simulations, but the current performance of POP prohibits this type of simulations. In this work, using a new distributed computing approach, two innovations to improve the performance of POP are presented. The first is a new block partitioning scheme for the optimization of the load balancing of POP such that it can be run efficiently in a multi-platform setting. The second is an implementation of part of the POP model code on Graphics Processing Units. We show that the combination of both innovations leads to a substantial performance increase also when running POP simultaneously over multiple computational platforms.

Download Full-text

A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1)

Geoscientific Model Development ◽

10.5194/gmd-7-267-2014 ◽

2014 ◽

Vol 7 (1) ◽

pp. 267-281 ◽

Cited By ~ 12

Author(s):

B. van Werkhoven ◽

J. Maassen ◽

M. Kliphuis ◽

H. A. Dijkstra ◽

S. E. Brunnabend ◽

...

Keyword(s):

Distributed Computing ◽

Load Balancing ◽

Ocean Circulation ◽

Graphics Processing Units ◽

Block Partitioning ◽

Model Code ◽

Graphics Processing ◽

Computing Approach ◽

Parallel Ocean Program

Abstract. The Parallel Ocean Program (POP) is used in many strongly eddying ocean circulation simulations. Ideally it would be desirable to be able to do thousand-year-long simulations, but the current performance of POP prohibits these types of simulations. In this work, using a new distributed computing approach, two methods to improve the performance of POP are presented. The first is a block-partitioning scheme for the optimization of the load balancing of POP such that it can be run efficiently in a multi-platform setting. The second is the implementation of part of the POP model code on graphics processing units (GPUs). We show that the combination of both innovations also leads to a substantial performance increase when running POP simultaneously over multiple computational platforms.

Download Full-text

Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6515 ◽

2021 ◽

Author(s):

José I. Aliaga ◽

Hartwig Anzt ◽

Thomas Grützmacher ◽

Enrique S. Quintana‐Ortí ◽

Andrés E. Tomás

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Multicore Processors ◽

Vector Product ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

A parallel computing approach to viewshed analysis of large terrain data using graphics processing units

International Journal of Geographical Information Science ◽

10.1080/13658816.2012.692372 ◽

2013 ◽

Vol 27 (2) ◽

pp. 363-384 ◽

Cited By ~ 60

Author(s):

Yanli Zhao ◽

Anand Padmanabhan ◽

Shaowen Wang

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Viewshed Analysis ◽

Graphics Processing ◽

Computing Approach

Download Full-text

Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

The International Journal of High Performance Computing Applications ◽

10.1177/1094342010383998 ◽

2010 ◽

Vol 25 (2) ◽

pp. 205-222 ◽

Cited By ~ 5

Author(s):

Juan Gómez-Luna ◽

José María González-Linares ◽

José Ignacio Benavides ◽

Emilio L. Zapata ◽

Nicolás Guil

Keyword(s):

Load Balancing ◽

Hough Transform ◽

Graphics Processing Units ◽

Generalized Hough Transform ◽

Graphics Processing

Download Full-text

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Applied Sciences ◽

10.3390/app9050947 ◽

2019 ◽

Vol 9 (5) ◽

pp. 947 ◽

Cited By ~ 9

Author(s):

Thaha Muhammed ◽

Rashid Mehmood ◽

Aiiad Albeshri ◽

Iyad Katib

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Memory Access ◽

Group Matrix ◽

The Matrix ◽

Novel Method ◽

Coalesced Memory ◽

Graphics Processing ◽

Matrix Vector

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Download Full-text