General-purpose join algorithms for large graph triangle listing on heterogeneous systems

Author(s):  
Daniel Zinn ◽  
Haicheng Wu ◽  
Jin Wang ◽  
Molham Aref ◽  
Sudhakar Yalamanchili
1998 ◽  
Vol 1 (06) ◽  
pp. 567-574 ◽  
Author(s):  
S.H. Lee ◽  
L.J. Durlofsky ◽  
M.F. Lough ◽  
W.H. Chen

This paper (SPE 52637) was revised for publication from paper SPE 38002, first presented at the 1997 SPE Reservoir Simulation Symposium, Dallas, 8-11 June. Original manuscript received for review 1 July 1997. Revised manuscript received 5 August 1998. Paper peer approved 3 September 1998. Summary The gridblock permeabilities used in reservoir simulation are commonly determined through the upscaling of a fine scale geostatistical reservoir description. Though it is well established that permeabilities computed in this manner are, in general, full tensor quantities, most finite difference reservoir simulators still treat permeability as a diagonal tensor. In this paper, we implement a capability to handle full tensor permeabilities in a general purpose finite difference simulator and apply this capability to the modeling of several complex geological systems. We formulate a flux continuous approach for the pressure equation by use of a method analogous to that of previous researchers (Edwards and Rogers; Aavatsmark et al.), consider methods for upwinding in multiphase flow problems, and additionally discuss some relevant implementation and reservoir characterization issues. The accuracy of the finite difference formulation, assessed through comparisons to an accurate finite element approach, is shown to be generally good, particularly for immiscible displacements in heterogeneous systems. The formulation is then applied to the simulation of upscaled descriptions of several geologically complex reservoirs involving crossbedding and extensive fracturing. The method performs quite well for these systems and is shown to capture the effects of the underlying geology accurately. Finally, the significant errors that can be incurred through inaccurate representation of the full permeability tensor are demonstrated for several cases. P. 567


2018 ◽  
Vol 4 ◽  
pp. e160 ◽  
Author(s):  
Dragan D. Nikolić

Numerical solutions of equation-based simulations require computationally intensive tasks such as evaluation of model equations, linear algebra operations and solution of systems of linear equations. The focus in this work is on parallel evaluation of model equations on shared memory systems such as general purpose processors (multi-core CPUs and manycore devices), streaming processors (Graphics Processing Units and Field Programmable Gate Arrays) and heterogeneous systems. The current approaches for evaluation of model equations are reviewed and their capabilities and shortcomings analysed. Since stream computing differs from traditional computing in that the system processes a sequential stream of elements, equations must be transformed into a data structure suitable for both types. The postfix notation expression stacks are recognised as a platform and programming language independent method to describe, store in computer memory and evaluate general systems of differential and algebraic equations of any size. Each mathematical operation and its operands are described by a specially designed data structure, and every equation is transformed into an array of these structures (a Compute Stack). Compute Stacks are evaluated by a stack machine using a Last In First Out queue. The stack machine is implemented in the DAE Tools modelling software in the C99 language using two Application Programming Interface (APIs)/frameworks for parallelism. The Open Multi-Processing (OpenMP) API is used for parallelisation on general purpose processors, and the Open Computing Language (OpenCL) framework is used for parallelisation on streaming processors and heterogeneous systems. The performance of the sequential Compute Stack approach is compared to the direct C++ implementation and to the previous approach that uses evaluation trees. The new approach is 45% slower than the C++ implementation and more than five times faster than the previous one. The OpenMP and OpenCL implementations are tested on three medium-scale models using a multi-core CPU, a discrete GPU, an integrated GPU and heterogeneous computing setups. Execution times are compared and analysed and the advantages of the OpenCL implementation running on a discrete GPU and heterogeneous systems are discussed. It is found that the evaluation of model equations using the parallel OpenCL implementation running on a discrete GPU is up to twelve times faster than the sequential version while the overall simulation speed-up gained is more than three times.


2020 ◽  
Vol 14 (4) ◽  
pp. 708-720
Author(s):  
Ran Rui ◽  
Hao Li ◽  
Yi-Cheng Tu

Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.8X over best known code developed for multi-core CPUs and GPUs respectively.


2003 ◽  
Vol 1 ◽  
pp. 171-175
Author(s):  
T. von Sydow ◽  
H. Blume ◽  
T. G. Noll

Abstract. Various reasons like technology progress, flexibility demands, shortened product cycle time and shortened time to market have brought up the possibility and necessity to integrate different architecture blocks on one heterogeneous System-on-Chip (SoC). Architecture blocks like programmable processor cores (DSP- and GPP-kernels), embedded FPGAs as well as dedicated macros will be integral parts of such a SoC. Especially programmable architecture blocks and associated optimization techniques are discussed in this contribution. Design space exploration and thus the choice which architecture blocks should be integrated in a SoC is a challenging task. Crucial to this exploration is the evaluation of the application domain characteristics and the costs caused by individual architecture blocks integrated on a SoC. An ATE-cost function has been applied to examine the performance of the aforementioned programmable architecture blocks. Therefore, representative discrete devices have been analyzed. Furthermore, several architecture dependent optimization steps and their effects on the cost ratios are presented.


Author(s):  
Mayank Bhura ◽  
Pranav H. Deshpande ◽  
K. Chandrasekaran

Usage of General Purpose Graphics Processing Units (GPGPUs) in high-performance computing is increasing as heterogeneous systems continue to become dominant. CUDA had been the programming environment for nearly all such NVIDIA GPU based GPGPU applications. Still, the framework runs only on NVIDIA GPUs, for other frameworks it requires reimplementation to utilize additional computing devices that are available. OpenCL provides a vendor-neutral and open programming environment, with many implementations available on CPUs, GPUs, and other types of accelerators, OpenCL can thus be regarded as write once, run anywhere framework. Despite this, both frameworks have their own pros and cons. This chapter presents a comparison of the performance of CUDA and OpenCL frameworks, using an algorithm to find the sum of all possible triple products on a list of integers, implemented on GPUs.


2017 ◽  
Vol 2017 ◽  
pp. 1-19 ◽  
Author(s):  
Gabriel A. León-Paredes ◽  
Liliana I. Barbosa-Santillán ◽  
Juan J. Sánchez-Escobar

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing the term-by-document matrix using the Singular Value Decomposition (SVD) technique. However, LSA has a high computational cost for analyzing large amounts of information. The goals of this work are (i) to improve the execution time of semantic space construction, dimensionality reduction, and information retrieval stages of LSA based on heterogeneous systems and (ii) to evaluate the accuracy and recall of the information retrieval stage. We present a heterogeneous Latent Semantic Analysis (hLSA) system, which has been developed using General-Purpose computing on Graphics Processing Units (GPGPUs) architecture, which can solve large numeric problems faster through the thousands of concurrent threads on multiple CUDA cores of GPUs and multi-CPU architecture, which can solve large text problems faster through a multiprocessing environment. We execute the hLSA system with documents from the PubMed Central (PMC) database. The results of the experiments show that the acceleration reached by the hLSA system for large matrices with one hundred and fifty thousand million values is around eight times faster than the standard LSA version with an accuracy of 88% and a recall of 100%.


1998 ◽  
Vol 37 (04/05) ◽  
pp. 518-526 ◽  
Author(s):  
D. Sauquet ◽  
M.-C. Jaulent ◽  
E. Zapletal ◽  
M. Lavril ◽  
P. Degoulet

AbstractRapid development of community health information networks raises the issue of semantic interoperability between distributed and heterogeneous systems. Indeed, operational health information systems originate from heterogeneous teams of independent developers and have to cooperate in order to exchange data and services. A good cooperation is based on a good understanding of the messages exchanged between the systems. The main issue of semantic interoperability is to ensure that the exchange is not only possible but also meaningful. The main objective of this paper is to analyze semantic interoperability from a software engineering point of view. It describes the principles for the design of a semantic mediator (SM) in the framework of a distributed object manager (DOM). The mediator is itself a component that should allow the exchange of messages independently of languages and platforms. The functional architecture of such a SM is detailed. These principles have been partly applied in the context of the HEllOS object-oriented software engineering environment. The resulting service components are presented with their current state of achievement.


Sign in / Sign up

Export Citation Format

Share Document