Evaluation of State-of-the-Art Parallelizing Compilers Generating CUDA Code for Heterogeneous CPU/GPU Computing

REVIEW OF PARALLEL TECHNIQUES AND ITS IMPLICATION FOR JAVA

Journal of Circuits System and Computers ◽

10.1142/s0218126610006761 ◽

2010 ◽

Vol 19 (07) ◽

pp. 1465-1481

Author(s):

SUN YU ◽

WEI ZHANG

Keyword(s):

Garbage Collection ◽

Virtual Machines ◽

State Of The Art ◽

Parallelizing Compilers ◽

Research Topics ◽

Multiprocessor Architectures ◽

Program Parallelization ◽

Java Programs ◽

Java Virtual Machines ◽

Parallel Techniques

This paper surveys the state-of-the-art parallel techniques for multiprocessor architectures, and studies its implication for Java programs, which are typically compiled at run-time. First, this paper overviews basic techniques of program parallelization in traditional static compilers, followed by a survey of successful parallelizing compilers. Then this paper introduces the latest research topics in this area, particularly focusing on the efforts of combining parallelizing techniques with Java virtual machines, including parallel compilation and parallel real-time garbage collection. Finally, this paper summaries the opportunities and challenges of parallelizing Java computing on multicore platforms.

Darknet on OpenCL: A Multi-platform Tool for Object Detection and Classification

10.20944/preprints202007.0506.v1 ◽

2020 ◽

Author(s):

Piotr Sowa ◽

Jacek Izydorczyk

Keyword(s):

Neural Networks ◽

Gpu Computing ◽

State Of The Art ◽

Computing Time ◽

Lessons Learned ◽

Memory Transfer ◽

Training Performance ◽

Weak Points ◽

And Training

The article’s goal is to overview challenges and problems on the way from the state of the art CUDA accelerated neural networks code to multi-GPU code. For this purpose, the authors describe the journey of porting the existing in the GitHub, fully-featured CUDA accelerated Darknet engine to OpenCL. The article presents lessons learned and the techniques that were put in place to make this port happen. There are few other implementations on the GitHub that leverage the OpenCL standard, and a few have tried to port Darknet as well. Darknet is a well known convolutional neural network (CNN) framework. The authors of this article investigated all aspects of the porting and achieved the fully-featured Darknet engine on OpenCL. The effort was focused not only on the classification with the use of YOLO1, YOLO2, and YOLO3 CNN models. They also covered other aspects, such as training neural networks, and benchmarks to look for the weak points in the implementation. The GPU computing code substantially improves Darknet computing time compared to the standard CPU version by using underused hardware in existing systems. If the system is OpenCL-based, then it is practically hardware independent. In this article, the authors report comparisons of the computation and training performance compared to the existing CUDA-based Darknet engine in the various computers, including single board computers, and, different CNN use-cases. The authors found that the OpenCL version could perform as fast as the CUDA version in the compute aspect, but it is slower in memory transfer between RAM (CPU memory) and VRAM (GPU memory). It depends on the quality of OpenCL implementation only. Moreover, loosening hardware requirements by the OpenCL Darknet can boost applications of DNN, especially in the energy-sensitive applications of Artificial Intelligence (AI) and Machine Learning (ML).

HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING

Parallel Processing Letters ◽

10.1142/s0129626411000187 ◽

2011 ◽

Vol 21 (02) ◽

pp. 245-272 ◽

Cited By ~ 106

Author(s):

DUANE MERRILL ◽

ANDREW GRIMSHAW

Keyword(s):

High Performance ◽

Gpu Computing ◽

State Of The Art ◽

Design Strategies ◽

Kernel Fusion ◽

Parallel Prefix ◽

Scan Data ◽

Many Core ◽

Global Data

The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.

Passively parallel regularized stokeslets

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0528 ◽

2020 ◽

Vol 378 (2179) ◽

pp. 20190528

Author(s):

Meurig T. Gallagher ◽

David J. Smith

Keyword(s):

Gpu Computing ◽

State Of The Art ◽

Computer Code ◽

Flagellar Motility ◽

The Past ◽

Recent Modification ◽

Flow Phenomena ◽

Order Of Magnitude ◽

Computational Research ◽

Magnitude Improvement

Stokes flow, discussed by G.G. Stokes in 1851, describes many microscopic biological flow phenomena, including cilia-driven transport and flagellar motility; the need to quantify and understand these flows has motivated decades of mathematical and computational research. Regularized stokeslet methods, which have been used and refined over the past 20 years, offer significant advantages in simplicity of implementation, with a recent modification based on nearest-neighbour interpolation providing significant improvements in efficiency and accuracy. Moreover this method can be implemented with the majority of the computation taking place through built-in linear algebra, entailing that state-of-the-art hardware and software developments in the latter, in particular multicore and GPU computing, can be exploited through minimal modifications (‘passive parallelism’) to existing Matlab computer code. Hence, and with widely available GPU hardware, significant improvements in the efficiency of the regularized stokeslet method can be obtained. The approach is demonstrated through computational experiments on three model biological flows: undulatory propulsion of multiple Caenorhabditis elegans , simulation of progression and transport by multiple sperm in a geometrically confined region, and left–right symmetry breaking particle transport in the ventral node of the mouse embryo. In general an order-of-magnitude improvement in efficiency is observed. This development further widens the complexity of biological flow systems that are accessible without the need for extensive code development or specialist facilities. This article is part of the theme issue ‘Stokes at 200 (part 2)’.

GPU Scaling

International Journal of Information Technology and Web Engineering ◽

10.4018/ijitwe.2014100102 ◽

2014 ◽

Vol 9 (4) ◽

pp. 13-23

Author(s):

Yaser Jararweh ◽

Moath Jarrah ◽

Abdelkader Bousselham

Keyword(s):

High Performance Computing ◽

High Performance ◽

Gpu Computing ◽

State Of The Art ◽

Computing Systems ◽

Current State ◽

Viable Solution ◽

Order Of Magnitude ◽

Computational Resources ◽

Performance Computing

Current state-of-the-art GPU-based systems offer unprecedented performance advantages through accelerating the most compute-intensive portions of applications by an order of magnitude. GPU computing presents a viable solution for the ever-increasing complexities in applications and the growing demands for immense computational resources. In this paper the authors investigate different platforms of GPU-based systems, starting from the Personal Supercomputing (PSC) to cloud-based GPU systems. The authors explore and evaluate the GPU-based platforms and the authors present a comparison discussion against the conventional high performance cluster-based computing systems. The authors' evaluation shows potential advantages of using GPU-based systems for high performance computing applications while meeting different scaling granularities.

GPU Scaling

Web-Based Services ◽

10.4018/978-1-4666-9466-8.ch105 ◽

2016 ◽

pp. 2373-2384

Author(s):

Yaser Jararweh ◽

Moath Jarrah ◽

Abdelkader Bousselham

Keyword(s):

High Performance Computing ◽

High Performance ◽

Gpu Computing ◽

State Of The Art ◽

Computing Systems ◽

Current State ◽

Viable Solution ◽

Order Of Magnitude ◽

Computational Resources ◽

Performance Computing

Current state-of-the-art GPU-based systems offer unprecedented performance advantages through accelerating the most compute-intensive portions of applications by an order of magnitude. GPU computing presents a viable solution for the ever-increasing complexities in applications and the growing demands for immense computational resources. In this paper the authors investigate different platforms of GPU-based systems, starting from the Personal Supercomputing (PSC) to cloud-based GPU systems. The authors explore and evaluate the GPU-based platforms and the authors present a comparison discussion against the conventional high performance cluster-based computing systems. The authors' evaluation shows potential advantages of using GPU-based systems for high performance computing applications while meeting different scaling granularities.

Fully Symmetric Convolutional Network for Effective Image Denoising

Applied Sciences ◽

10.3390/app9040778 ◽

2019 ◽

Vol 9 (4) ◽

pp. 778 ◽

Cited By ~ 3

Author(s):

Steffi Priyanka ◽

Yuan-Kai Wang

Keyword(s):

Neural Network ◽

Image Processing ◽

Image Denoising ◽

Gpu Computing ◽

State Of The Art ◽

Large Data ◽

Convolutional Network ◽

Proposed Model ◽

Feature Extractor ◽

A Chain

Neural-network-based image denoising is one of the promising approaches to deal with problems in image processing. In this work, a deep fully symmetric convolutional–deconvolutional neural network (FSCN) is proposed for image denoising. The proposed model comprises a novel architecture with a chain of successive symmetric convolutional–deconvolutional layers. This framework learns convolutional–deconvolutional mappings from corrupted images to the clean ones in an end-to-end fashion without using image priors. The convolutional layers act as feature extractor to encode primary components of the image contents while eliminating corruptions, and the deconvolutional layers then decode the image abstractions to recover the image content details. An adaptive moment optimizer is used to minimize the reconstruction loss as it is appropriate for large data and noisy images. Extensive experiments were conducted for image denoising to evaluate the FSCN model against the existing state-of-the-art denoising algorithms. The results show that the proposed model achieves superior denoising, both qualitatively and quantitatively. This work also presents the efficient implementation of the FSCN model by using GPU computing which makes it easy and attractive for practical denoising applications.

Practical Picture Processing

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100051700 ◽

1974 ◽

Vol 32 ◽

pp. 338-339

Author(s):

T. A. Welton

Keyword(s):

Radiation Damage ◽

Coherence Length ◽

Spatial Information ◽

State Of The Art ◽

Coherent Radiation ◽

The State ◽

Energy Spread ◽

Electron Micrograph ◽

Picture Processing ◽

Molecular Skeleton

Various authors have emphasized the spatial information resident in an electron micrograph taken with adequately coherent radiation. In view of the completion of at least one such instrument, this opportunity is taken to summarize the state of the art of processing such micrographs. We use the usual symbols for the aberration coefficients, and supplement these with £ and 6 for the transverse coherence length and the fractional energy spread respectively. He also assume a weak, biologically interesting sample, with principal interest lying in the molecular skeleton remaining after obvious hydrogen loss and other radiation damage has occurred.

A Macintosh Interface for the Cameca Camebax-Micro Electron Microprobe

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010018094x ◽

1990 ◽

Vol 48 (1) ◽

pp. 438-439

Author(s):

Carl E. Henderson

Keyword(s):

Electron Microprobe ◽

Processing Speed ◽

Word Processing ◽

State Of The Art ◽

Computer Control ◽

Third Party ◽

Disk Storage ◽

The Past ◽

User Facility ◽

Instrument Control

Over the past few years it has become apparent in our multi-user facility that the computer system and software supplied in 1985 with our CAMECA CAMEBAX-MICRO electron microprobe analyzer has the greatest potential for improvement and updating of any component of the instrument. While the standard CAMECA software running on a DEC PDP-11/23+ computer under the RSX-11M operating system can perform almost any task required of the instrument, the commands are not always intuitive and can be difficult to remember for the casual user (of which our laboratory has many). Given the widespread and growing use of other microcomputers (such as PC’s and Macintoshes) by users of the microprobe, the PDP has become the “oddball” and has also fallen behind the state-of-the-art in terms of processing speed and disk storage capabilities. Upgrade paths within products available from DEC are considered to be too expensive for the benefits received. After using a Macintosh for other tasks in the laboratory, such as instrument use and billing records, word processing, and graphics display, its unique and “friendly” user interface suggested an easier-to-use system for computer control of the electron microprobe automation. Specifically a Macintosh IIx was chosen for its capacity for third-party add-on cards used in instrument control.

Current Issues: Advanced Digital Technology for Supervising Graduate Clinicians

Perspectives on Administration and Supervision ◽

10.1044/aas20.1.9 ◽

2010 ◽

Vol 20 (1) ◽

pp. 9-13 ◽

Cited By ~ 2

Author(s):

Glenn Tellis ◽

Lori Cimino ◽

Jennifer Alberti

Keyword(s):

Real Time ◽

Digital Technology ◽

Clinical Training ◽

State Of The Art ◽

Clinical Skills ◽

Slow Motion ◽

Video Capture ◽

Microsoft Excel ◽

Clinical Supervisors ◽

Training And Supervision

Abstract The purpose of this article is to provide clinical supervisors with information pertaining to state-of-the-art clinic observation technology. We use a novel video-capture technology, the Landro Play Analyzer, to supervise clinical sessions as well as to train students to improve their clinical skills. We can observe four clinical sessions simultaneously from a central observation center. In addition, speech samples can be analyzed in real-time; saved on a CD, DVD, or flash/jump drive; viewed in slow motion; paused; and analyzed with Microsoft Excel. Procedures for applying the technology for clinical training and supervision will be discussed.