scholarly journals Balanced Sparsity for Efficient DNN Inference on GPU

Author(s):  
Zhuliang Yao ◽  
Shijie Cao ◽  
Wencong Xiao ◽  
Chen Zhang ◽  
Lanshun Nie

In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference on general-purpose hardwares by adopting coarse-grained sparsity to prune or regularize consecutive weights for efficient computation. But this method often sacrifices model accuracy. In this paper, we propose a novel fine-grained sparsity approach, Balanced Sparsity, to achieve high model accuracy with commercial hardwares efficiently. Our approach adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services. Experiment results show that Balanced Sparsity achieves up to 3.1x practical speedup for model inference on GPU, while retains the same high model accuracy as finegrained sparsity.

Author(s):  
Mário Pereira Vestias

High-performance reconfigurable computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to general-purpose processors. Better performance and lower power consumption could be achieved using application-specific integrated circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter, the authors provide a description of reconfigurable hardware for high-performance computing.


Author(s):  
Mário Pereira Vestias

High-Performance Reconfigurable Computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to General-Purpose Processors. Better performance and lower power consumption could be achieved using Application Specific Integrated Circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter we will provide a description of reconfigurable hardware for high performance computing.


Author(s):  
JIANYONG CHEN ◽  
QIUZHEN LIN ◽  
QINGBIN HU

In this paper, a novel clonal algorithm applied in multiobjecitve optimization (NCMO) is presented, which is designed from the improvement of search operators, i.e. dynamic mutation probability, dynamic simulated binary crossover (D-SBX) operator and hybrid mutation operator combining with Gaussian and polynomial mutations (GP-HM) operator. The main notion of these approaches is to perform more coarse-grained search at initial stage in order to speed up the convergence toward the Pareto-optimal front. Once the solutions are getting close to the Pareto-optimal front, more fine-grained search is performed in order to reduce the gaps between the solutions and the Pareto-optimal front. Based on this purpose, a cooling schedule is adopted in these approaches, reducing the parameters gradually to a minimal threshold, the aim of which is to keep a desirable balance between fine-grained search and coarse-grained search. By this means, the exploratory capabilities of NCMO are enhanced. When compared with various state-of-the-art multiobjective optimization algorithms developed recently, simulation results show that NCMO has remarkable performance.


Author(s):  
K. Liagkouras ◽  
K. Metaxiotis

In this paper, we present a novel Interval-Based Mutation (IBMU) operator. The proposed mutation operator is performing coarse-grained search at initial stage in order to speed up convergence toward more promising regions of the search landscape. Then, more fine-grained search is performed in order to guide the solutions towards the Pareto front. Computational experiments indicate that the proposed mutation operator performs better than conventional approaches for solving several well-known benchmarking problems.


Author(s):  
Vladimir Alexandrovich Frolov ◽  
Vadim Sanzharov ◽  
Vladimir Alexandrovich Galaktionov ◽  
Alexandr Scherbakov

We propose a novel high-level approach for software development on GPU using Vulkan API. Our goal is to speed-up development and performance studies for complex algorithms on GPU, which is quite difficult and laborious for Vulkan due to large number of HW features low level details. The proposed approach uses auto programming to translate ordinary C++ to optimized Vulkan implementation with automatic shaders generation, resource binding and fine-grained barriers placement. Our model is not general-purpose programming, but is extendible and customer-focused. For a single C++ input our tool can generate multiple different implementations of algorithm in Vulkan for different cases or types of hardware. For example, we automatically detect reduction in C++ source code and then generate several variants of parallel reduction on GPU: with optimization for different warp size, with or without atomics, using or not subgroup operations. Another example is GPU ray tracing applications for which we can generate different variants: pure software implementation in compute shader, using hardware accelerated ray queries, using full RTX pipeline. The goal of our work is to increase productivity of developers who are forced to use Vulkan due to various required hardware features in their software but still do care about cross-platform ability of the developed software and want to debug their algorithm logic on the CPU. Therefore, we assume that the user will take generated code and integrate it with hand-written Vulkan code.


2008 ◽  
Vol 363 (1512) ◽  
pp. 3977-3984 ◽  
Author(s):  
Alexandros Stamatakis ◽  
Michael Ott

The continuous accumulation of sequence data, for example, due to novel wet-laboratory techniques such as pyrosequencing, coupled with the increasing popularity of multi-gene phylogenies and emerging multi-core processor architectures that face problems of cache congestion, poses new challenges with respect to the efficient computation of the phylogenetic maximum-likelihood (ML) function. Here, we propose two approaches that can significantly speed up likelihood computations that typically represent over 95 per cent of the computational effort conducted by current ML or Bayesian inference programs. Initially, we present a method and an appropriate data structure to efficiently compute the likelihood score on ‘gappy’ multi-gene alignments. By ‘gappy’ we denote sampling-induced gaps owing to missing sequences in individual genes (partitions), i.e. not real alignment gaps. A first proof-of-concept implementation in RAxML indicates that this approach can accelerate inferences on large and gappy alignments by approximately one order of magnitude. Moreover, we present insights and initial performance results on multi-core architectures obtained during the transition from an OpenMP-based to a Pthreads-based fine-grained parallelization of the ML function.


Author(s):  
Sebastião Miranda ◽  
Jonas Feldt ◽  
Frederico Pratas ◽  
Ricardo A Mata ◽  
Nuno Roma ◽  
...  

A novel perturbative Monte Carlo mixed quantum mechanics (QM)/molecular mechanics (MM) approach has been recently developed to simulate molecular systems in complex environments. However, the required accuracy to efficiently simulate such complex molecular systems is usually granted at the cost of long executing times. To alleviate this problem, a new parallelization strategy of multi-level Monte Carlo molecular simulations is herein proposed for heterogeneous systems. It simultaneously exploits fine-grained (at the data level), coarse-grained (at the Markov chain level) and task-grained (pure QM, pure MM and QM/MM procedures) parallelism to ensure an efficient execution in heterogeneous systems composed of central processing units and multiple and possibly different graphical processing units. This is achieved by making use of the OpenCL library, together with appropriate dynamic load balancing schemes. From the conducted evaluation with real benchmarking data, a speed-up of 56x in the computational bottleneck part was observed, which results in a global speed-up of 38x for the whole simulation, reducing the time of a typical simulation from 80 hours to only 2 hours.


2021 ◽  
Vol 8 (3) ◽  
pp. 1-18
Author(s):  
James Edwards ◽  
Uzi Vishkin

Boolean satisfiability (SAT) is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing, parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers, which all happen to be of the so-called Conflict-Directed Clause Learning variety. We show the potential for speedups of up to 382× across a variety of problem instances. We hope that these results will stimulate future research, particularly with respect to a computer architecture open problem we present.


2012 ◽  
Vol 2012 ◽  
pp. 1-15 ◽  
Author(s):  
Ilia Lebedev ◽  
Christopher Fletcher ◽  
Shaoyi Cheng ◽  
James Martin ◽  
Austin Doupnik ◽  
...  

We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms.


2015 ◽  
Vol 57 (3) ◽  
Author(s):  
Lars Bauer ◽  
Jörg Henkel ◽  
Andreas Herkersdorf ◽  
Michael A. Kochte ◽  
Johannes M. Kühn ◽  
...  

AbstractAchieving system-level dependability is a demanding task. The manifold requirements and dependability threats can no longer be statically addressed at individual abstraction layers. Instead, all components of future multi-processor systems-on-chip (MPSoCs) have to contribute to this common goal in an adaptive manner.In this paper we target a generic heterogeneous MPSoC that combines general purpose processors along with dedicated application-specific hard-wired accelerators, fine-grained reconfigurable processors, and coarse-grained reconfigurable architectures. We present different


Sign in / Sign up

Export Citation Format

Share Document