Balanced Sparsity for Efficient DNN Inference on GPU

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015676 ◽

2019 ◽

Vol 33 ◽

pp. 5676-5683 ◽

Cited By ~ 3

Author(s):

Zhuliang Yao ◽

Shijie Cao ◽

Wencong Xiao ◽

Chen Zhang ◽

Lanshun Nie

Keyword(s):

Deep Neural Networks ◽

General Purpose ◽

Coarse Grained ◽

Efficient Computation ◽

Model Accuracy ◽

Sparse Model ◽

Model Inference ◽

Fine Grained ◽

Practical Inference ◽

Speed Up

In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference on general-purpose hardwares by adopting coarse-grained sparsity to prune or regularize consecutive weights for efficient computation. But this method often sacrifices model accuracy. In this paper, we propose a novel fine-grained sparsity approach, Balanced Sparsity, to achieve high model accuracy with commercial hardwares efficiently. Our approach adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services. Experiment results show that Balanced Sparsity achieves up to 3.1x practical speedup for model inference on GPU, while retains the same high model accuracy as finegrained sparsity.

Download Full-text

High-Performance Reconfigurable Computing

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch053 ◽

2019 ◽

pp. 731-744

Author(s):

Mário Pereira Vestias

Keyword(s):

Power Consumption ◽

Integrated Circuit ◽

Reconfigurable Computing ◽

High Performance ◽

General Purpose ◽

Reconfigurable Hardware ◽

Coarse Grained ◽

Lower Power ◽

Fine Grained ◽

Application Specific

High-performance reconfigurable computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to general-purpose processors. Better performance and lower power consumption could be achieved using application-specific integrated circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter, the authors provide a description of reconfigurable hardware for high-performance computing.

Download Full-text

High-Performance Reconfigurable Computing

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch348 ◽

2018 ◽

pp. 4018-4029

Author(s):

Mário Pereira Vestias

Keyword(s):

Power Consumption ◽

Integrated Circuit ◽

Reconfigurable Computing ◽

High Performance ◽

General Purpose ◽

Reconfigurable Hardware ◽

Coarse Grained ◽

Lower Power ◽

Fine Grained ◽

Application Specific

High-Performance Reconfigurable Computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to General-Purpose Processors. Better performance and lower power consumption could be achieved using Application Specific Integrated Circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter we will provide a description of reconfigurable hardware for high performance computing.

Download Full-text

APPLICATION OF NOVEL CLONAL ALGORITHM IN MULTIOBJECTIVE OPTIMIZATION

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622010003804 ◽

2010 ◽

Vol 09 (02) ◽

pp. 239-266 ◽

Cited By ~ 19

Author(s):

JIANYONG CHEN ◽

QIUZHEN LIN ◽

QINGBIN HU

Keyword(s):

Multiobjective Optimization ◽

Coarse Grained ◽

Pareto Optimal ◽

Pareto Optimal Front ◽

Fine Grained ◽

Initial Stage ◽

Speed Up ◽

Main Notion ◽

Hybrid Mutation Operator ◽

Cooling Schedule

In this paper, a novel clonal algorithm applied in multiobjecitve optimization (NCMO) is presented, which is designed from the improvement of search operators, i.e. dynamic mutation probability, dynamic simulated binary crossover (D-SBX) operator and hybrid mutation operator combining with Gaussian and polynomial mutations (GP-HM) operator. The main notion of these approaches is to perform more coarse-grained search at initial stage in order to speed up the convergence toward the Pareto-optimal front. Once the solutions are getting close to the Pareto-optimal front, more fine-grained search is performed in order to reduce the gaps between the solutions and the Pareto-optimal front. Based on this purpose, a cooling schedule is adopted in these approaches, reducing the parameters gradually to a minimal threshold, the aim of which is to keep a desirable balance between fine-grained search and coarse-grained search. By this means, the exploratory capabilities of NCMO are enhanced. When compared with various state-of-the-art multiobjective optimization algorithms developed recently, simulation results show that NCMO has remarkable performance.

Download Full-text

An Experimental Analysis of a New Interval-Based Mutation Operator

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026815500182 ◽

2015 ◽

Vol 14 (03) ◽

pp. 1550018 ◽

Cited By ~ 3

Author(s):

K. Liagkouras ◽

K. Metaxiotis

Keyword(s):

Experimental Analysis ◽

Pareto Front ◽

Computational Experiments ◽

Coarse Grained ◽

Mutation Operator ◽

Fine Grained ◽

Initial Stage ◽

Speed Up ◽

Better Than

In this paper, we present a novel Interval-Based Mutation (IBMU) operator. The proposed mutation operator is performing coarse-grained search at initial stage in order to speed up convergence toward more promising regions of the search landscape. Then, more fine-grained search is performed in order to guide the solutions towards the Pareto front. Computational experiments indicate that the proposed mutation operator performs better than conventional approaches for solving several well-known benchmarking problems.

Download Full-text

An Auto-Programming Approach to Vulkan

10.20948/graphicon-2021-3027-150-165 ◽

2021 ◽

Author(s):

Vladimir Alexandrovich Frolov ◽

Vadim Sanzharov ◽

Vladimir Alexandrovich Galaktionov ◽

Alexandr Scherbakov

Keyword(s):

Performance Studies ◽

General Purpose ◽

Software Implementation ◽

Programming Approach ◽

Fine Grained ◽

Speed Up ◽

Cross Platform ◽

Increase Productivity ◽

And Performance ◽

High Level

We propose a novel high-level approach for software development on GPU using Vulkan API. Our goal is to speed-up development and performance studies for complex algorithms on GPU, which is quite difficult and laborious for Vulkan due to large number of HW features low level details. The proposed approach uses auto programming to translate ordinary C++ to optimized Vulkan implementation with automatic shaders generation, resource binding and fine-grained barriers placement. Our model is not general-purpose programming, but is extendible and customer-focused. For a single C++ input our tool can generate multiple different implementations of algorithm in Vulkan for different cases or types of hardware. For example, we automatically detect reduction in C++ source code and then generate several variants of parallel reduction on GPU: with optimization for different warp size, with or without atomics, using or not subgroup operations. Another example is GPU ray tracing applications for which we can generate different variants: pure software implementation in compute shader, using hardware accelerated ray queries, using full RTX pipeline. The goal of our work is to increase productivity of developers who are forced to use Vulkan due to various required hardware features in their software but still do care about cross-platform ability of the developed software and want to debug their algorithm logic on the CPU. Therefore, we assume that the user will take generated code and integrate it with hand-written Vulkan code.

Download Full-text

Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2008.0163 ◽

2008 ◽

Vol 363 (1512) ◽

pp. 3977-3984 ◽

Cited By ~ 70

Author(s):

Alexandros Stamatakis ◽

Michael Ott

Keyword(s):

Likelihood Function ◽

Sequence Data ◽

Computational Effort ◽

Efficient Computation ◽

Fine Grained ◽

Processor Architectures ◽

Order Of Magnitude ◽

Speed Up ◽

Performance Results ◽

Continuous Accumulation

The continuous accumulation of sequence data, for example, due to novel wet-laboratory techniques such as pyrosequencing, coupled with the increasing popularity of multi-gene phylogenies and emerging multi-core processor architectures that face problems of cache congestion, poses new challenges with respect to the efficient computation of the phylogenetic maximum-likelihood (ML) function. Here, we propose two approaches that can significantly speed up likelihood computations that typically represent over 95 per cent of the computational effort conducted by current ML or Bayesian inference programs. Initially, we present a method and an appropriate data structure to efficiently compute the likelihood score on ‘gappy’ multi-gene alignments. By ‘gappy’ we denote sampling-induced gaps owing to missing sequences in individual genes (partitions), i.e. not real alignment gaps. A first proof-of-concept implementation in RAxML indicates that this approach can accelerate inferences on large and gappy alignments by approximately one order of magnitude. Moreover, we present insights and initial performance results on multi-core architectures obtained during the transition from an OpenMP-based to a Pthreads-based fine-grained parallelization of the ML function.

Download Full-text

Efficient parallelization of perturbative Monte Carlo QM/MM simulations in heterogeneous platforms

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016649420 ◽

2016 ◽

Vol 31 (6) ◽

pp. 499-516 ◽

Cited By ~ 1

Author(s):

Sebastião Miranda ◽

Jonas Feldt ◽

Frederico Pratas ◽

Ricardo A Mata ◽

Nuno Roma ◽

...

Keyword(s):

Monte Carlo ◽

Heterogeneous Systems ◽

Coarse Grained ◽

Molecular Systems ◽

Fine Grained ◽

Central Processing ◽

Speed Up ◽

Graphical Processing ◽

Computational Bottleneck ◽

The Cost

A novel perturbative Monte Carlo mixed quantum mechanics (QM)/molecular mechanics (MM) approach has been recently developed to simulate molecular systems in complex environments. However, the required accuracy to efficiently simulate such complex molecular systems is usually granted at the cost of long executing times. To alleviate this problem, a new parallelization strategy of multi-level Monte Carlo molecular simulations is herein proposed for heterogeneous systems. It simultaneously exploits fine-grained (at the data level), coarse-grained (at the Markov chain level) and task-grained (pure QM, pure MM and QM/MM procedures) parallelism to ensure an efficient execution in heterogeneous systems composed of central processing units and multiple and possibly different graphical processing units. This is achieved by making use of the OpenCL library, together with appropriate dynamic load balancing schemes. From the conducted evaluation with real benchmarking data, a speed-up of 56x in the computational bottleneck part was observed, which results in a global speed-up of 38x for the whole simulation, reducing the time of a typical simulation from 80 hours to only 2 hours.

Download Full-text

Study of Fine-grained Nested Parallelism in CDCL SAT Solvers

ACM Transactions on Parallel Computing ◽

10.1145/3470639 ◽

2021 ◽

Vol 8 (3) ◽

pp. 1-18

Author(s):

James Edwards ◽

Uzi Vishkin

Keyword(s):

Computer Architecture ◽

Coarse Grained ◽

Future Research ◽

Sat Solvers ◽

Fine Grained ◽

Nested Parallelism ◽

Clause Learning ◽

Speed Up ◽

Fine Grained Parallelism ◽

Problem Instances

Boolean satisfiability (SAT) is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing, parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers, which all happen to be of the so-called Conflict-Directed Clause Learning variety. We show the potential for speedups of up to 382× across a variety of problem instances. We hope that these results will stimulate future research, particularly with respect to a computer architecture open problem we present.

Download Full-text

Exploring Many-Core Design Templates for FPGAs and ASICs

International Journal of Reconfigurable Computing ◽

10.1155/2012/439141 ◽

2012 ◽

Vol 2012 ◽

pp. 1-15 ◽

Cited By ~ 4

Author(s):

Ilia Lebedev ◽

Christopher Fletcher ◽

Shaoyi Cheng ◽

James Martin ◽

Austin Doupnik ◽

...

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Coarse Grained ◽

Processing Unit ◽

Fine Grained ◽

Data Parallel ◽

Level Data ◽

Graph Inference ◽

High Level ◽

Many Core

We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms.

Download Full-text

Adaptive multi-layer techniques for increased system dependability

it - Information Technology ◽

10.1515/itit-2014-1082 ◽

2015 ◽

Vol 57 (3) ◽

Author(s):

Lars Bauer ◽

Jörg Henkel ◽

Andreas Herkersdorf ◽

Michael A. Kochte ◽

Johannes M. Kühn ◽

...

Keyword(s):

General Purpose ◽

System Level ◽

Coarse Grained ◽

Common Goal ◽

Fine Grained ◽

Reconfigurable Processors ◽

Systems On Chip ◽

On Chip ◽

Application Specific ◽

Heterogeneous Mpsoc

AbstractAchieving system-level dependability is a demanding task. The manifold requirements and dependability threats can no longer be statically addressed at individual abstraction layers. Instead, all components of future multi-processor systems-on-chip (MPSoCs) have to contribute to this common goal in an adaptive manner.In this paper we target a generic heterogeneous MPSoC that combines general purpose processors along with dedicated application-specific hard-wired accelerators, fine-grained reconfigurable processors, and coarse-grained reconfigurable architectures. We present different

Download Full-text