AN EVALUATION OF MULTIPLE FEED-FORWARD NETWORKS ON GPUs

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.

Download Full-text

Parallel implementation of an error diffusion halftoning algorithm with a general purpose graphics processing unit

2010 IEEE International Conference on Image Processing ◽

10.1109/icip.2010.5653503 ◽

2010 ◽

Cited By ~ 2

Author(s):

Becksang Seong ◽

Jaewoo Ahn ◽

Wonyong Sung

Keyword(s):

Graphics Processing Unit ◽

Parallel Implementation ◽

General Purpose ◽

Error Diffusion ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Massively parallel implementation of cyclic LDPC codes on a general purpose graphics processing unit

2009 IEEE Workshop on Signal Processing Systems ◽

10.1109/sips.2009.5336268 ◽

2009 ◽

Cited By ~ 10

Author(s):

Hyunwoo Ji ◽

Junho Cho ◽

Wonyong Sung

Keyword(s):

Graphics Processing Unit ◽

Parallel Implementation ◽

Ldpc Codes ◽

General Purpose ◽

Massively Parallel ◽

Processing Unit ◽

Graphics Processing

Download Full-text

A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit

10.21437/interspeech.2009-343 ◽

2009 ◽

Cited By ~ 1

Author(s):

Jike Chong ◽

Ekaterina Gonina ◽

Youngmin Yi ◽

Kurt Keutzer

Keyword(s):

Speech Recognition ◽

Graphics Processing Unit ◽

Processing Unit ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Large Vocabulary ◽

Data Parallel ◽

Graphics Processing

Download Full-text

Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

Acta Geophysica ◽

10.1515/acgeo-2016-0033 ◽

2016 ◽

Vol 64 (4) ◽

pp. 1051-1063 ◽

Cited By ~ 2

Author(s):

Guofeng Liu ◽

Chun Li

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Practical Implementation ◽

Processing Unit ◽

Time Migration ◽

Graphics Processing

Download Full-text

Fast X-Ray Diffraction (XRD) Tomography for Enhanced Identification of Materials

10.36227/techrxiv.17125448.v1 ◽

2021 ◽

Author(s):

Airidas Korolkovas ◽

Alexander Katsevich ◽

Michael Frenkel ◽

William Thompson ◽

Edward Morton

Keyword(s):

Graphics Processing Unit ◽

Low Cost ◽

Photon Counting ◽

Finite Size ◽

Processing Unit ◽

X Ray Diffraction ◽

X Ray ◽

Specific Material ◽

Xrd Patterns ◽

Graphics Processing

X-ray computed tomography (CT) can provide 3D images of density, and possibly the atomic number, for large objects like passenger luggage. This information, while generally very useful, is often insufficient to identify threats like explosives and narcotics, which can have a similar average composition as benign everyday materials such as plastics, glass, light metals, etc. A much more specific material signature can be measured with X-ray diffraction (XRD). Unfortunately, XRD signal is very faint compared to the transmitted one, and also challenging to reconstruct for objects larger than a small laboratory sample. In this article we analyze a novel low-cost scanner design which captures CT and XRD signals simultaneously, and uses the least possible collimation to maximize the flux. To simulate a realistic instrument, we derive a formula for the resolution of any diffraction pathway, taking into account the polychromatic spectrum, and the finite size of the source, detector, and each voxel. We then show how to reconstruct XRD patterns from a large phantom with multiple diffracting objects. Our approach includes a reasonable amount of photon counting noise (Poisson statistics), as well as measurement bias, in particular incoherent Compton scattering. The resolution of our reconstruction is sufficient to provide significantly more information than standard CT, thus increasing the accuracy of threat detection. Our theoretical model is implemented in GPU (Graphics Processing Unit) accelerated software which can be used to assess and further optimize scanner designs for specific applications in security, healthcare, and manufacturing quality control.

Download Full-text

Comprehensive regression-based model to predict performance of general-purpose graphics processing unit

Cluster Computing ◽

10.1007/s10586-019-03011-2 ◽

2019 ◽

Vol 23 (2) ◽

pp. 1505-1516 ◽

Cited By ~ 2

Author(s):

Mohammad Hossein Shafiabadi ◽

Hossein Pedram ◽

Midia Reshadi ◽

Akram Reza

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Graphics Processing

Download Full-text

General purpose graphics-processing-unit implementation of cosmological domain wall network evolution

Physical Review E ◽

10.1103/physreve.96.043310 ◽

2017 ◽

Vol 96 (4) ◽

Cited By ~ 5

Author(s):

J. R. C. C. C. Correia ◽

C. J. A. P. Martins

Keyword(s):

Domain Wall ◽

Graphics Processing Unit ◽

General Purpose ◽

Network Evolution ◽

Processing Unit ◽

Graphics Processing

Download Full-text

High throughput transmission optical projection tomography using low cost graphics processing unit

Optics Express ◽

10.1364/oe.17.022320 ◽

2009 ◽

Vol 17 (25) ◽

pp. 22320 ◽

Cited By ~ 20

Author(s):

Claudio Vinegoni ◽

Lyuba Fexon ◽

Paolo Fumene Feruglio ◽

Misha Pivovarov ◽

Jose-Luiz Figueiredo ◽

...

Keyword(s):

High Throughput ◽

Graphics Processing Unit ◽

Low Cost ◽

Processing Unit ◽

Optical Projection Tomography ◽

Optical Projection ◽

Graphics Processing

Download Full-text

GPU-accelerated alignment of bisulfite-treated short-read sequences

10.1101/175729 ◽

2017 ◽

Author(s):

Richard Wilton ◽

Xin Li ◽

Andrew P. Feinberg ◽

Alexander S. Szalay

Keyword(s):

Dna Sequences ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Short Read ◽

Wide Range ◽

Programming Logic ◽

Short Read Aligner ◽

Graphics Processing ◽

Better Than

AbstractThe alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algorithmic improvements. One strategy is to integrate this additional programming logic into the read-alignment implementation in a way that the software becomes amenable to optimizations that lead to both higher speed and greater sensitivity than can be achieved without this integration.We have evaluated this approach using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by the most widely used BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings.The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

Download Full-text

Analysis of Fast Fourier Transformations algorithm for CUDA Architecture

Lietuvos matematikos rinkinys ◽

10.15388/lmr.b.2012.46 ◽

2012 ◽

Vol 53 ◽

Author(s):

Beatričė Andziulienė ◽

Evaldas Žulkas ◽

Audrius Kuprinavičius

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Fast Fourier Transformation ◽

Processing Unit ◽

Data Allocation ◽

Analysis Method ◽

Central Processing ◽

Execution Speed ◽

Cuda Architecture ◽

Graphics Processing

In this work Fast Fourier transformation algorithm for general purpose graphics processing unit processing (GPGPU) is discussed. Algorithm structure and individual stages performance were analysed. With performance analysis method algorithm distribution and data allocation possibilities were determined, depending on algorithm stages execution speed and algorithm structure. Ratio between CPU and GPU execution during Fast Fourier transform signal processing was determined using computer-generated data with frequency. When adopting CPU code for CUDA execution, it not becomes more complex, even if stream procesor parallelization and data transfering algorith stages are considered. But central processing unit serial execution).

Download Full-text