Fast Low-Precision Computer-Generated Holography on GPU

Computer-generated holography (CGH) is a notoriously difficult computation problem, simulating numerical diffraction, where every scene point can affect every hologram pixel. To tackle this challenge, specialized software instructions and hardware solutions are developed to significantly reduce calculation time and power consumption. In this work, we propose a novel algorithm for high-performance point-based CGH, leveraging fixed-point integer representations, the separability of the Fresnel transform and using new look-up table free cosine representation. We report up to a 3-fold speed up over an optimized floating-point GPU implementation, as well as a 15 dB increase in quality over a state-of-the-art FPGA-based fixed-point integer solution.

Download Full-text

An FPGA Implementation of Deep Spiking Neural Networks for Low-Power and Fast Classification

Neural Computation ◽

10.1162/neco_a_01245 ◽

2020 ◽

Vol 32 (1) ◽

pp. 182-204 ◽

Cited By ~ 3

Author(s):

Xiping Ju ◽

Biao Fang ◽

Rui Yan ◽

Xiaoliang Xu ◽

Huajin Tang

Keyword(s):

Neural Networks ◽

High Performance ◽

Large Scale ◽

Hardware Architecture ◽

Clock Frequency ◽

Data Set ◽

Speed Up ◽

Fast Classification ◽

Spike Signals ◽

Gpu Implementation

A spiking neural network (SNN) is a type of biological plausibility model that performs information processing based on spikes. Training a deep SNN effectively is challenging due to the nondifferention of spike signals. Recent advances have shown that high-performance SNNs can be obtained by converting convolutional neural networks (CNNs). However, the large-scale SNNs are poorly served by conventional architectures due to the dynamic nature of spiking neurons. In this letter, we propose a hardware architecture to enable efficient implementation of SNNs. All layers in the network are mapped on one chip so that the computation of different time steps can be done in parallel to reduce latency. We propose new spiking max-pooling method to reduce computation complexity. In addition, we apply approaches based on shift register and coarsely grained parallels to accelerate convolution operation. We also investigate the effect of different encoding methods on SNN accuracy. Finally, we validate the hardware architecture on the Xilinx Zynq ZCU102. The experimental results on the MNIST data set show that it can achieve an accuracy of 98.94% with eight-bit quantized weights. Furthermore, it achieves 164 frames per second (FPS) under 150 MHz clock frequency and obtains 41[Formula: see text] speed-up compared to CPU implementation and 22 times lower power than GPU implementation.

Download Full-text

Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

International Journal of Advances in Telecommunications Electrotechnics Signals and Systems ◽

10.11601/ijates.v5i2.142 ◽

2016 ◽

Vol 5 (2) ◽

pp. 101 ◽

Cited By ~ 1

Author(s):

Jan Masek ◽

Radim Burget ◽

Lukas Povoda ◽

Malay Kishore Dutta

Keyword(s):

High Performance ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Data Sets ◽

K Nearest Neighbor ◽

Graphic Processing Units ◽

Speed Up ◽

Performance Of Algorithm ◽

Dual Core ◽

Gpu Implementation

Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k–NN) algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison witha single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes.

Download Full-text

State-of-the-Art Admixture for high performance SCC in China

SCC'2005-China - 1st International Symposium on Design, Performance and Use of Self-Consolidating Concrete ◽

10.1617/2912143624.012 ◽

2005 ◽

Author(s):

S. Asmus

Keyword(s):

High Performance ◽

State Of The Art

Download Full-text

Review on biomass feedstocks, pyrolysis mechanism and physicochemical properties of biochar: State-of-the-art framework to speed up vision of circular bioeconomy

Journal of Cleaner Production ◽

10.1016/j.jclepro.2021.126645 ◽

2021 ◽

Vol 297 ◽

pp. 126645

Author(s):

Gajanan Sampatrao Ghodake ◽

Surendra Krushna Shinde ◽

Avinash Ashok Kadam ◽

Rijuta Ganesh Saratale ◽

Ganesh Dattatraya Saratale ◽

...

Keyword(s):

Physicochemical Properties ◽

State Of The Art ◽

Pyrolysis Mechanism ◽

Biomass Feedstocks ◽

Speed Up

Download Full-text

Multiple objects tracking in the UAV system based on hierarchical deep high-resolution network

Multimedia Tools and Applications ◽

10.1007/s11042-020-10427-1 ◽

2021 ◽

Author(s):

Wei Huang ◽

Xiaoshu Zhou ◽

Mingchao Dong ◽

Huaiyu Xu

Keyword(s):

High Resolution ◽

Object Tracking ◽

High Performance ◽

State Of The Art ◽

Class Imbalance ◽

Unified Framework ◽

Multiple Objects ◽

Tracking Process ◽

Objects Tracking ◽

Different Types

AbstractRobust and high-performance visual multi-object tracking is a big challenge in computer vision, especially in a drone scenario. In this paper, an online Multi-Object Tracking (MOT) approach in the UAV system is proposed to handle small target detections and class imbalance challenges, which integrates the merits of deep high-resolution representation network and data association method in a unified framework. Specifically, while applying tracking-by-detection architecture to our tracking framework, a Hierarchical Deep High-resolution network (HDHNet) is proposed, which encourages the model to handle different types and scales of targets, and extract more effective and comprehensive features during online learning. After that, the extracted features are fed into different prediction networks for interesting targets recognition. Besides, an adjustable fusion loss function is proposed by combining focal loss and GIoU loss to solve the problems of class imbalance and hard samples. During the tracking process, these detection results are applied to an improved DeepSORT MOT algorithm in each frame, which is available to make full use of the target appearance features to match one by one on a practical basis. The experimental results on the VisDrone2019 MOT benchmark show that the proposed UAV MOT system achieves the highest accuracy and the best robustness compared with state-of-the-art methods.

Download Full-text

Purinergic ATP triggers moxibustion-induced local anti-nociceptive effect on inflammatory pain model

Purinergic Signalling ◽

10.1007/s11302-021-09815-5 ◽

2021 ◽

Author(s):

Hai-Yan Yin ◽

Ya-Peng Fan ◽

Juan Liu ◽

Dao-Tong Li ◽

Jing Guo ◽

...

Keyword(s):

High Performance Liquid Chromatography ◽

Liquid Chromatography ◽

Inflammatory Pain ◽

Analgesic Effect ◽

Intramuscular Injection ◽

High Performance ◽

Atp Hydrolysis ◽

Purinergic Signalling ◽

Pain Model ◽

Speed Up

AbstractPurinergic signalling adenosine and its A1 receptors have been demonstrated to get involved in the mechanism of acupuncture (needling therapy) analgesia. However, whether purinergic signalling would be responsible for the local analgesic effect of moxibustion therapy, the predominant member in acupuncture family procedures also could trigger analgesic effect on pain diseases, it still remains unclear. In this study, we applied moxibustion to generate analgesic effect on complete Freund’s adjuvant (CFA)-induced inflammatory pain rats and detected the purine released from moxibustioned-acupoint by high-performance liquid chromatography (HPLC) approach. Intramuscular injection of ARL67156 into the acupoint Zusanli (ST36) to inhibit the breakdown of ATP showed the analgesic effect of moxibustion was increased while intramuscular injection of ATPase to speed up ATP hydrolysis caused a reduced moxibustion-induced analgesia. These data implied that purinergic ATP at the location of ST36 acupoint is a potentially beneficial factor for moxibustion-induced analgesia.

Download Full-text

Direct laser additive manufacturing of high performance oxide ceramics: a state-of-the-art review

Journal of the European Ceramic Society ◽

10.1016/j.jeurceramsoc.2021.05.035 ◽

2021 ◽

Author(s):

Stefan Pfeiffer ◽

Kevin Florio ◽

Dario Puccio ◽

Marco Grasso ◽

Bianca Maria Colosimo ◽

...

Keyword(s):

Additive Manufacturing ◽

High Performance ◽

State Of The Art ◽

Oxide Ceramics ◽

Laser Additive Manufacturing ◽

Direct Laser

Download Full-text

Evaluation of recent advances in recommender systems on Arabic content

Journal Of Big Data ◽

10.1186/s40537-021-00420-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Mehdi Srifi ◽

Ahmed Oussous ◽

Ayoub Ait Lahcen ◽

Salma Mouline

Keyword(s):

Recommender Systems ◽

High Performance ◽

Large Scale ◽

State Of The Art ◽

Experimental Results ◽

Recent Advances ◽

Research Gap ◽

Text Preprocessing

AbstractVarious recommender systems (RSs) have been developed over recent years, and many of them have concentrated on English content. Thus, the majority of RSs from the literature were compared on English content. However, the research investigations about RSs when using contents in other languages such as Arabic are minimal. The researchers still neglect the field of Arabic RSs. Therefore, we aim through this study to fill this research gap by leveraging the benefit of recent advances in the English RSs field. Our main goal is to investigate recent RSs in an Arabic context. For that, we firstly selected five state-of-the-art RSs devoted originally to English content, and then we empirically evaluated their performance on Arabic content. As a result of this work, we first build four publicly available large-scale Arabic datasets for recommendation purposes. Second, various text preprocessing techniques have been provided for preparing the constructed datasets. Third, our investigation derived well-argued conclusions about the usage of modern RSs in the Arabic context. The experimental results proved that these systems ensure high performance when applied to Arabic content.

Download Full-text

Communication Failure Resilient Distributed Neural Network for Edge Devices

Electronics ◽

10.3390/electronics10141614 ◽

2021 ◽

Vol 10 (14) ◽

pp. 1614

Author(s):

Jonghun Jeong ◽

Jong Sung Park ◽

Hoeseok Yang

Keyword(s):

Neural Network ◽

Neural Networks ◽

High Performance ◽

State Of The Art ◽

Wearable Devices ◽

Communication Failure ◽

Canadian Institute ◽

Multiple Devices ◽

Knowledge Distillation ◽

Partitioning Technique

Recently, the necessity to run high-performance neural networks (NN) is increasing even in resource-constrained embedded systems such as wearable devices. However, due to the high computational and memory requirements of the NN applications, it is typically infeasible to execute them on a single device. Instead, it has been proposed to run a single NN application cooperatively on top of multiple devices, a so-called distributed neural network. In the distributed neural network, workloads of a single big NN application are distributed over multiple tiny devices. While the computation overhead could effectively be alleviated by this approach, the existing distributed NN techniques, such as MoDNN, still suffer from large traffics between the devices and vulnerability to communication failures. In order to get rid of such big communication overheads, a knowledge distillation based distributed NN, called Network of Neural Networks (NoNN), was proposed, which partitions the filters in the final convolutional layer of the original NN into multiple independent subsets and derives smaller NNs out of each subset. However, NoNN also has limitations in that the partitioning result may be unbalanced and it considerably compromises the correlation between filters in the original NN, which may result in an unacceptable accuracy degradation in case of communication failure. In this paper, in order to overcome these issues, we propose to enhance the partitioning strategy of NoNN in two aspects. First, we enhance the redundancy of the filters that are used to derive multiple smaller NNs by means of averaging to increase the immunity of the distributed NN to communication failure. Second, we propose a novel partitioning technique, modified from Eigenvector-based partitioning, to preserve the correlation between filters as much as possible while keeping the consistent number of filters distributed to each device. Throughout extensive experiments with the CIFAR-100 (Canadian Institute For Advanced Research-100) dataset, it has been observed that the proposed approach maintains high inference accuracy (over 70%, 1.53× improvement over the state-of-the-art approach), on average, even when a half of eight devices in a distributed NN fail to deliver their partial inference results.

Download Full-text