Compiling a High-Level Directive-Based Programming Model for GPGPUs

Author(s):  
Xiaonan Tian ◽  
Rengan Xu ◽  
Yonghong Yan ◽  
Zhifeng Yun ◽  
Sunita Chandrasekaran ◽  
...  
Keyword(s):  
Author(s):  
Breno A. de Melo Menezes ◽  
Nina Herrmann ◽  
Herbert Kuchen ◽  
Fernando Buarque de Lima Neto

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.


Author(s):  
Rong Gu ◽  
Zhixiang Zhang ◽  
Zhihao Xu ◽  
Zhaokang Wang ◽  
Kai Zhang ◽  
...  

2021 ◽  
Vol 251 ◽  
pp. 03032
Author(s):  
Haiwang Yu ◽  
Zhihua Dong ◽  
Kyle Knoepfel ◽  
Meifeng Lin ◽  
Brett Viren ◽  
...  

The Liquid Argon Time Projection Chamber (LArTPC) technology plays an essential role in many current and future neutrino experiments. Accurate and fast simulation is critical to developing efficient analysis algorithms and precise physics model projections. The speed of simulation becomes more important as Deep Learning algorithms are getting more widely used in LArTPC analysis and their training requires a large simulated dataset. Heterogeneous computing is an efficient way to delegate computationally intensive tasks to specialized hardware. However, as the landscape of compute accelerators quickly evolves, it becomes increasingly difficult to manually adapt the code to the latest hardware or software environments. A solution which is portable to multiple hardware architectures without substantially compromising performance would thus be very beneficial, especially for long-term projects such as the LArTPC simulations. In search of a portable, scalable and maintainable software solution for LArTPC simulations, we have started to explore high-level portable programming frameworks that support several hardware backends. In this paper, we present our experience porting the LArTPC simulation code in the Wire-Cell Toolkit to NVIDIA GPUs, first with the CUDA programming model and then with a portable library called Kokkos. Preliminary performance results on NVIDIA V100 GPUs and multi-core CPUs are presented, followed by a discussion of the factors affiecting the performance and plans for future improvements.


2021 ◽  
Vol 24 (1) ◽  
pp. 157-183
Author(s):  
Никита Андреевич Катаев

Automation of parallel programming is important at any stage of parallel program development. These stages include profiling of the original program, program transformation, which allows us to achieve higher performance after program parallelization, and, finally, construction and optimization of the parallel program. It is also important to choose a suitable parallel programming model to express parallelism available in a program. On the one hand, the parallel programming model should be capable to map the parallel program to a variety of existing hardware resources. On the other hand, it should simplify the development of the assistant tools and it should allow the user to explore the parallel program the assistant tools generate in a semi-automatic way. The SAPFOR (System FOR Automated Parallelization) system combines various approaches to automation of parallel programming. Moreover, it allows the user to guide the parallelization if necessary. SAPFOR produces parallel programs according to the high-level DVMH parallel programming model which simplify the development of efficient parallel programs for heterogeneous computing clusters. This paper focuses on the approach to semi-automatic parallel programming, which SAPFOR implements. We discuss the architecture of the system and present the interactive subsystem which is useful to guide the SAPFOR through program parallelization. We used the interactive subsystem to parallelize programs from the NAS Parallel Benchmarks in a semi-automatic way. Finally, we compare the performance of manually written parallel programs with programs the SAPFOR system builds.


2019 ◽  
Vol 67 ◽  
pp. 01010
Author(s):  
Igor Posokhov ◽  
Victoriia Cherepanova ◽  
Olha Podrez

The Ukrainian real economy encounters similar problems that do not allow its rapid development – high level of wear of productive assets, lack of modern equipment, outdated technologies, inappropriate environmental measures, high rate of occupational injury, etc. All this requires designing of new tools to manage the development of such important sectors of economy as industry and rail transport. Therefore, the urgent issues at the current stage of development of these industries include the definition of conditions for fixed assets capitalization and the sources of its financing. The scientific novelty of the results is identification and justification of the capitalization main components, determining the sources of funding and the mechanism for their attraction. The tool for managing the productive and environmental protection assets capitalization has been designed which is optimized using a two-dimensional dynamic programming model. The results obtained are the basis for the practical solution of the problem, and for further scientific research. This approach allows solving the problem of rail transport and industrial enterprises capitalization in a comprehensive manner, which contributes to their sustainable development.


Energies ◽  
2020 ◽  
Vol 13 (23) ◽  
pp. 6214
Author(s):  
Sara Ceschia ◽  
Luca Di Gaspero ◽  
Antonella Meneghetti

In recent years, cold food chains have shown an impressive growth, mainly due to customers life style changes. Consequently, the transportation of refrigerated food is becoming a crucial aspect of the chain, aiming at ensuring efficiency and sustainability of the process while keeping a high level of product quality. The recently defined Refrigerated Routing Problem (RRP) consists of finding the optimal delivery tour that minimises the fuel consumption for both the traction and the refrigeration components. The total fuel consumption is related, in a complex way, to the distance travelled, the vehicle load and speed, and the outdoor temperature. All these factors depend, in turn, on the traffic and the climate conditions of the region where deliveries take place and they change during the day and the year. The original RRP has been extended to take into account also the total driving cost and to add the possibility to slow down the deliveries by allowing arbitrarily long waiting times when this is beneficial for the objective function. The new RRP is formulated and solved as both a Mixed Integer Programming and a novel Constraint Programming model. Moreover, a Local Search metaheuristic technique (namely Late Acceptance Hill Climbing), based on a combination of different neighborhood structures, is also proposed. The results obtained by the different solution methods on a set of benchmarks scenarios are compared and discussed.


Constraints ◽  
2020 ◽  
Vol 25 (3-4) ◽  
pp. 319-337 ◽  
Author(s):  
Mark Wallace ◽  
Neil Yorke-Smith

AbstractThe cyclic hoist scheduling problem (CHSP) is a well-studied optimisation problem due to its importance in industry. Despite the wide range of solving techniques applied to the CHSP and its variants, the models have remained complicated and inflexible, or have failed to scale up with larger problem instances. This article re-examines modelling of the CHSP and proposes a new simple, flexible constraint programming formulation. We compare current state-of-the-art solving technologies on this formulation, and show that modelling in a high-level constraint language, MiniZinc, leads to both a simple, generic model and to computational results that outperform the state of the art. We further demonstrate that combining integer programming and lazy clause generation, using the multiple cores of modern processors, has potential to improve over either solving approach alone.


2013 ◽  
Vol 380-384 ◽  
pp. 1338-1341
Author(s):  
Yu Liu ◽  
Yi Xiao

in order to improve the efficiency of maze optimal routing problem, a GPU acceleration programming model OpenACC is used in this paper. By analyzing an algorithm which solves the maze problem based on ant colony algorithm, we complete the task mapping on the model. Though GPU acceleration, ant colony searching process was changed into parallel matrix operations. To decrease the algorithm accessing overhead and increase operating speed, data were rationally organized and stored for GPU. Experiments of different scale maze matrix show that the parallel algorithm greatly reduces the operation time. Speedup will be increased with the expansion of the matrix size. In our experiments, the maximum speedup is about 6.1. The algorithm can solve larger matrices with a high level of processing performance by adding efficient OpenACC instruction to serial code and organizing the data structure for parallel accessing.


2012 ◽  
Vol 21 (07) ◽  
pp. 1250058
Author(s):  
BINGBING XIA ◽  
FEI QIAO ◽  
ZIDONG DU ◽  
DI ZHU ◽  
HUAZHONG YANG

H.264 video decoder is a good choice for embedded video processing applications because of its higher compression ratio than MPEG2, although it has higher requirements of run-time computational resource. Multi-core system is the future of the embedded processor design for its power efficiency and multi-thread parallelization capability, and can be used to fit well with the requirements for such video processing algorithms. To simulate and evaluate the performance of these multi-core systems effectively, a design flow at the system level is developed, at the higher level, the combination of TLM language (SystemC) and shared-memory parallel programming model (OpenMP) is used for such transaction-level simulation, and at the lower level, a multi-core simulator based on the extension of the SimpleScalar 3.0 ToolSet is developed for the cycle-accurate level simulation. Compared with other high-level simulation methods, ours has the ability to realize the true-parallelization simulation. What is more, experiments show that such simulation methodology can effectively simulate these complex multi-core applications in a short time to get the appropriate core number and the task allocation strategy (much less than RTL-level simulation) and the results can get at less than 15% deviated from the ideal ones calculated based on Amadal's Law, so the parallelization strategy obtained from such simulation is the best one that can be further applied for the RTL-level design of the final multi-core system.


Sign in / Sign up

Export Citation Format

Share Document