LDPC Decoding on GPU for Mobile Device

A flexible software LDPC decoder that exploits data parallelism for simultaneous multicode words decoding on the mobile device is proposed in this paper, supported by multithreading on OpenCL based graphics processing units. By dividing the check matrix into several parts to make full use of both the local memory and private memory on GPU and properly modify the code capacity each time, our implementation on a mobile phone shows throughputs above 100 Mbps and delay is less than 1.6 millisecond in decoding, which make high-speed communication like video calling possible. To realize efficient software LDPC decoding on the mobile device, the LDPC decoding feature on communication baseband chip should be replaced to save the cost and make it easier to upgrade decoder to be compatible with a variety of channel access schemes.

Download Full-text

Integrated photonic FFT for photonic tensor operations towards efficient and high-speed neural networks

Nanophotonics ◽

10.1515/nanoph-2020-0055 ◽

2020 ◽

Vol 9 (13) ◽

pp. 4097-4108 ◽

Cited By ~ 1

Author(s):

Moustafa Ahmed ◽

Yas Al-Hadeethi ◽

Ahmed Bakry ◽

Hamed Dalir ◽

Volker J. Sorger

Keyword(s):

Neural Networks ◽

Graphics Processing Units ◽

High Speed ◽

Fourier Transforms ◽

Optoelectronic Device ◽

Small Sample ◽

Sample Number ◽

Chip Area ◽

Domain Specific ◽

Graphics Processing

AbstractThe technologically-relevant task of feature extraction from data performed in deep-learning systems is routinely accomplished as repeated fast Fourier transforms (FFT) electronically in prevalent domain-specific architectures such as in graphics processing units (GPU). However, electronics systems are limited with respect to power dissipation and delay, due to wire-charging challenges related to interconnect capacitance. Here we present a silicon photonics-based architecture for convolutional neural networks that harnesses the phase property of light to perform FFTs efficiently by executing the convolution as a multiplication in the Fourier-domain. The algorithmic executing time is determined by the time-of-flight of the signal through this photonic reconfigurable passive FFT ‘filter’ circuit and is on the order of 10’s of picosecond short. A sensitivity analysis shows that this optical processor must be thermally phase stabilized corresponding to a few degrees. Furthermore, we find that for a small sample number, the obtainable number of convolutions per {time, power, and chip area) outperforms GPUs by about two orders of magnitude. Lastly, we show that, conceptually, the optical FFT and convolution-processing performance is indeed directly linked to optoelectronic device-level, and improvements in plasmonics, metamaterials or nanophotonics are fueling next generation densely interconnected intelligent photonic circuits with relevance for edge-computing 5G networks by processing tensor operations optically.

Download Full-text

End-to-End High Speed Forward Error Correction Using Graphics Processing Units

Lecture Notes in Electrical Engineering - Mobile, Ubiquitous, and Intelligent Computing ◽

10.1007/978-3-642-40675-1_8 ◽

2014 ◽

pp. 47-53

Author(s):

Md Shohidul Islam ◽

Jong-Myon Kim

Keyword(s):

Error Correction ◽

Graphics Processing Units ◽

High Speed ◽

Forward Error Correction ◽

End To End ◽

Forward Error ◽

Graphics Processing

Download Full-text

High-Speed Nonlinear Finite Element Analysis for Surgical Simulation Using Graphics Processing Units

IEEE Transactions on Medical Imaging ◽

10.1109/tmi.2007.913112 ◽

2008 ◽

Vol 27 (5) ◽

pp. 650-663 ◽

Cited By ~ 109

Author(s):

Z.A. Taylor ◽

M. Cheng ◽

S. Ourselin

Keyword(s):

Finite Element Analysis ◽

Finite Element ◽

Graphics Processing Units ◽

High Speed ◽

Surgical Simulation ◽

Nonlinear Finite Element Analysis ◽

Nonlinear Finite Element ◽

Element Analysis ◽

Graphics Processing

Download Full-text

Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration

Journal of Biomedical Optics ◽

10.1117/1.3041496 ◽

2008 ◽

Vol 13 (6) ◽

pp. 060504 ◽

Cited By ~ 232

Author(s):

Erik Alerstam ◽

Tomas Svensson ◽

Stefan Andersson-Engels

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Parallel Computing ◽

Graphics Processing Units ◽

High Speed ◽

Photon Migration ◽

Graphics Processing

Download Full-text

A GPU-accelerated image reduction pipeline

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psaa091 ◽

2020 ◽

Author(s):

Masafumi Niwano ◽

Katsuhiro L Murata ◽

Ryo Adachi ◽

Sili Wang ◽

Yutaro Tachibana ◽

...

Keyword(s):

Image Processing ◽

Graphics Processing Units ◽

High Speed ◽

Emission Measure ◽

Robotic Telescope ◽

Graphics Processing ◽

High Speed Image Processing ◽

Python Package ◽

Telescope System

Abstract We developed a high-speed image reduction pipeline using Graphics Processing Units (GPUs) as hardware accelerators. Astronomers desire to detect the emission measure counterpart of gravitational-wave sources as soon as possible and to share in the systematic follow-up observation. Therefore, high-speed image processing is important. We developed a new image-reduction pipeline for our robotic telescope system, which uses a GPU via the Python package CuPy for high-speed image processing. As a result, the new pipeline has increased in processing speed by more than 40 times compared with the current one, while maintaining the same functions.

Download Full-text

Enhancing the performance of the aggregated bit vector algorithm in network packet classification using GPU

PeerJ Computer Science ◽

10.7717/peerj-cs.185 ◽

2019 ◽

Vol 5 ◽

pp. e185 ◽

Cited By ~ 2

Author(s):

Mahdi Abbasi ◽

Razieh Tahouri ◽

Milad Rafiee

Keyword(s):

Graphics Processing Units ◽

High Speed ◽

Parallel Implementation ◽

Packet Classification ◽

Experimental Results ◽

Analysis Method ◽

Network Systems ◽

Computationally Intensive ◽

Bit Vector ◽

Graphics Processing

Packet classification is a computationally intensive, highly parallelizable task in many advanced network systems like high-speed routers and firewalls that enable different functionalities through discriminating incoming traffic. Recently, graphics processing units (GPUs) have been exploited as efficient accelerators for parallel implementation of software classifiers. The aggregated bit vector is a highly parallelizable packet classification algorithm. In this work, first we present a parallel kernel for running this algorithm on GPUs. Next, we adapt an asymptotic analysis method which predicts any empirical result of the proposed kernel. Experimental results not only confirm the efficiency of the proposed parallel kernel but also reveal the accuracy of the analysis method in predicting important trends in experimental results.

Download Full-text

A High Granularity Approach to NetworkPacket Processing for Latency-TolerantApplications with CUDA (Corvyd)

Avances en Ciencias e Ingeniería ◽

10.18272/aci.v13i2.2142 ◽

2021 ◽

Vol 13 (2) ◽

pp. 7

Author(s):

Maria Pantoja

Keyword(s):

Graphics Processing Units ◽

General Purpose ◽

Packet Processing ◽

Maximum Throughput ◽

Intrusion Prevention ◽

Detection Systems ◽

Enterprise Level ◽

Specialized Hardware ◽

The Cost ◽

Graphics Processing

Currently, practical network packet processing used for In-trusion Detection Systems/Intrusion Prevention Systems (IDS/IPS) tendto belong to one of two disjoint categories: software-only implementa-tions running on general-purpose CPUs, or highly specialized networkhardware implementations using ASICs or FPGAs for the most commonfunctions, general-purpose CPUs for the rest. These approaches cover tryto maximize the performance and minimize the cost, but neither system,when implemented effectively, is affordable to any clients except for thoseat the well-funded enterprise level. In this paper, we aim to improve theperformance of affordable network packet processing in heterogeneoussystems with consumer Graphics Processing Units (GPUs) hardware byoptimizing latency-tolerant packet processing operations, notably IDS,to obtain maximum throughput required by such systems in networkssophisticated enough to demand a dedicated IDS/IPS system, but notenough to justify the high cost of cutting-edge specialized hardware. Inparticular, this project investigated increasing the granularity of OSIlayer-based packet batching over that of previous batching approaches.We demonstrate that highly granular GPU-enabled packet processing isgenerally impractical, compared with existing methods, by implementingour own solution that we call Corvyd, a heterogeneous real-time packetprocessing engine.

Download Full-text

High-speed simulation of the dynamic neural responses of retinal and cortical simple neurons to complex visual scenes using general purpose computing on graphics processing units.

Frontiers in Neuroinformatics ◽

10.3389/conf.fninf.2014.08.00074 ◽

2014 ◽

Vol 8 ◽

Author(s):

Shouno Osamu ◽

Tsujino Hiroshi

Keyword(s):

Graphics Processing Units ◽

High Speed ◽

General Purpose ◽

Neural Responses ◽

Visual Scenes ◽

Graphics Processing ◽

Speed Simulation

Download Full-text

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00124 ◽

2015 ◽

Vol 3 ◽

pp. 87-100

Author(s):

Hua He ◽

Jimmy Lin ◽

Adam Lopez

Keyword(s):

Hierarchical Models ◽

Graphics Processing Units ◽

General Purpose ◽

Practical Applications ◽

On Demand ◽

Mt Evaluation ◽

Order Of Magnitude ◽

Evaluation Dataset ◽

The Cost ◽

Graphics Processing

Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

Download Full-text

High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units

2008 International Conference on Multimedia and Ubiquitous Engineering (mue 2008) ◽

10.1109/mue.2008.94 ◽

2008 ◽

Cited By ~ 9

Author(s):

Yongjin Yeom ◽

Yongkuk Cho ◽

Moti Yung

Keyword(s):

Graphics Processing Units ◽

High Speed ◽

Block Cipher ◽

Graphics Processing

Download Full-text