compressed sparse row
Recently Published Documents


TOTAL DOCUMENTS

19
(FIVE YEARS 2)

H-INDEX

3
(FIVE YEARS 0)

Electronics ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. 1675
Author(s):  
Sarah AlAhmadi ◽  
Thaha Mohammed ◽  
Aiiad Albeshri ◽  
Iyad Katib ◽  
Rashid Mehmood

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (nnz, anpr, nprvariance, maxnpr, and distavg), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU–GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future.


Electronics ◽  
2018 ◽  
Vol 7 (11) ◽  
pp. 307 ◽  
Author(s):  
Cheng Qian ◽  
Bruce Childers ◽  
Libo Huang ◽  
Hui Guo ◽  
Zhiying Wang

Graph traversal is widely used in map routing, social network analysis, causal discovery and many more applications. Because it is a memory-bound process, graph traversal puts significant pressure on the memory subsystem. Due to poor spatial locality and the increasing size of today’s datasets, graph traversal consumes an ever-larger part of application execution time. One way to mitigate this cost is memory prefetching, which issues requests from the processor to the memory in anticipation of needing certain data. However, traditional prefetching does not work well for graph traversal due to data dependencies, the parallel nature of graphs and the need to move vast amounts of data from memory to the caches. In this paper, we propose a compressed sparse row representation-based graph accelerator on the Hybrid Memory Cube (HMC), called CGAcc. CGAcc combines Compressed Sparse Row (CSR) graph representation with in-memory prefetching and processing to improve the performance of graph traversal. Our approach integrates the prefetching and processing in the logic layer of a 3D stacked Dynamic Random-Access Memory (DRAM) architecture, based on Micron’s HMC. We selected HMC to implement CGAcc because it can provide quite high bandwidth and low access latency. Furthermore, this device has multiple DRAM layers connected to internal logic to control memory access and perform rudimentary computation. Using the CSR representation, CGAcc deploys prefetchers in the HMC to exploit the short transaction latency between the logic and DRAM layers. By doing this, it can also avoid large data movement costs. In the runtime, CGAcc pipelines the prefetching to fetch data from DRAM arrays to improve memory-level parallelism. To further reduce the access latency, several optimized internal caches are also introduced to hold the prefetched data to be Processed In-Memory (PIM). A comprehensive evaluation shows the effectiveness of CGAcc. Experimental results showed that, compared to a conventional HMC main memory equipped with a stream prefetcher, CGAcc achieved an average 3.51× speedup with moderate hardware cost.


2018 ◽  
Vol 10 (2) ◽  
pp. 64-79
Author(s):  
Rafael Machado De Salles ◽  
Leonardo Figueira Werneck ◽  
Grazione De Souza ◽  
Helio Pedro Amaral Souto

Este trabalho visa, principalmente, à simulação numérica de escoamentos monofásicos de óleo em reservatórios de petróleo do tipo anticlinal. Portanto, uma técnica específica para a representação de células inativas foi desenvolvida. Além disso, a fim de melhorar a eficiência computacional, a interface de programação OpenMP foi utilizada, juntamente com a técnica Compressed Sparse Row, com a finalidade de paralelizar-se o método dos Gradientes Conjugados, empregado na resolução do sistema algébrico de equações oriundo da discretização da Equação da Difusividade Hidráulica (EDH) que governa o escoamento. Testes de sensibilidade, convergência e desempenho foram realizados considerando-se diferentes reservatórios do tipo anticlinal.


Nova Scientia ◽  
2018 ◽  
Vol 10 (20) ◽  
pp. 263-279
Author(s):  
Gerardo Mario Ortigoza Capetillo ◽  
Alberto Pedro Lorandi Medina ◽  
Alfonso Cuauhtemoc García Reynoso

Reverse Cuthill McKee (RCM) reordering can be applied to either edges or elements of unstructured meshes (triangular/tetrahedral) , in accordance to the respective finite element formulation,  to reduce the bandwidth of stiffness matrices . Grid generators are mainly designed for nodal based finite elements. Their output is a list of nodes (2d or 3d) and an array describing element connectivity, be it triangles or tetrahedra. However,  for edge-defined finite element formulations a numbering of the edges is required. Observations are reported for Triangle/Tetgen Delaunay grid generators and for the sparse structure of the assembled matrices in both edge- and element-defined formulations. The RCM is a renumbering algorithm traditionally applied to the nodal graph of the mesh. Thus, in order to apply this renumbering to either the edges or the elements of the respective finite element formulation,  graphs of the mesh were generated. Significant bandwidth reduction was obtained. This translates to reduction in the execution effort of the sparse-matrix-times-vector product. Compressed Sparse Row format was adopted and the matrix-times-vector product was implemented in an OpenMp parallel routine.


2018 ◽  
Vol 10 (1) ◽  
pp. 54-70
Author(s):  
Saira Banu Jamal Mohammed ◽  
M. Rajasekhara Babu ◽  
Sumithra Sriram

With the growth of data parallel computing, role of GPU computing in non-graphic applications such as image processing becomes a focus in research fields. Convolution is an integral operation in filtering, smoothing and edge detection. In this article, the process of convolution is realized as a sparse linear system and is solved using Sparse Matrix Vector Multiplication (SpMV). The Compressed Sparse Row (CSR) format of SPMV shows better CPU performance compared to normal convolution. To overcome the stalling of threads for short rows in the GPU implementation of CSR SpMV, a more efficient model is proposed, which uses the Adaptive-Compressed Row Storage (A-CSR) format to implement the same. Using CSR in the convolution process achieves a 1.45x and a 1.159x increase in speed compared to the normal convolution of image smoothing and edge detection operations, respectively. An average speedup of 2.05x is achieved for image smoothing technique and 1.58x for edge detection technique in GPU platform usig adaptive CSR format.


Engevista ◽  
2017 ◽  
Vol 19 (4) ◽  
pp. 1095
Author(s):  
Gylles Ricardo Ströher ◽  
Thays Rolim Mendes ◽  
Neyva Maria Lopes Romeiro

Os esquemas de compressão de matrizes possibilitam armazenar matrizes esparsas em vetores de forma que apenas os elementos não nulos das matrizes são armazenados, provendo assim uma redução significativa do consumo de memória computacional para o armazenamento de matrizes esparsas. Dentre os esquemas existentes, o implementado no desenvolvimento do presente trabalho foi o Compressed Sparse Row (CSR), o qual armazena apenas os elementos não nulos da matriz em três vetores. O esquema CSR foi implementado em associação com três métodos iterativos de resolução de sistemas lineares, Jacob, Gauss-Seidel e Gradiente Conjugado. Os resultados obtidos sinalizam para qual ordem e grau de esparsidade mínimos o esquema CSR se torna vantajoso, em relação à redução do consumo de memória computacional e os resultados também indicam que como as operações com os elementos nulos são suprimidas, o tempo de processamento para a resolução de sistemas lineares esparsos pode ser significativamente reduzido com o esquema de compressão explorado.


2016 ◽  
Vol 2016 ◽  
pp. 1-12 ◽  
Author(s):  
Guixia He ◽  
Jiaquan Gao

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.


2016 ◽  
Vol 2016 ◽  
pp. 1-14 ◽  
Author(s):  
Jiaquan Gao ◽  
Panpan Qi ◽  
Guixia He

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science and needs be accelerated because it often represents the dominant cost in many widely used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different numbers of rows to each thread block and executes different optimization implementations on the basis of the number of rows it involves for each block. The process of accesses to the CSR arrays is fully coalesced, and the GPU’s DRAM bandwidth is efficiently utilized by loading data into the shared memory, which alleviates the bottleneck of many existing CSR-based algorithms (i.e., CSR-scalar and CSR-vector). Test results on C2050 and K20c GPUs show that our method outperforms a perfect-CSR algorithm that inspires our work, the vendor tuned CUSPARSE V6.5 and CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.


Sign in / Sign up

Export Citation Format

Share Document