Accelerating Deep Neuroevolution on Distributed FPGAs for Reinforcement Learning Problems

Reinforcement learning, augmented by the representational power of deep neural networks, has shown promising results on high-dimensional problems, such as game playing and robotic control. However, the sequential nature of these problems poses a fundamental challenge for computational efficiency. Recently, alternative approaches such as evolutionary strategies and deep neuroevolution demonstrated competitive results with faster training time on distributed CPU cores. Here we report record training times (running at about 1 million frames per second) for Atari 2600 games using deep neuroevolution implemented on distributed FPGAs. Combined hardware implementation of the game console, image preprocessing and the neural network in an optimized pipeline, multiplied with the system level parallelism enabled the acceleration. These results are the first application demonstration on the IBM Neural Computer, which is a custom designed system that consists of 432 Xilinx FPGAs interconnected in a 3D mesh network topology. In addition to high performance, experiments also showed improvement in accuracy for all games compared to the CPU implementation of the same algorithm.

Download Full-text

Deep Reinforcement Learning in Ice Hockey for Context-Aware Player Evaluation

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/478 ◽

2018 ◽

Cited By ~ 9

Author(s):

Guiliang Liu ◽

Oliver Schulte

Keyword(s):

Reinforcement Learning ◽

Empirical Evaluation ◽

Professional Sports ◽

Ice Hockey ◽

New Approach ◽

The Neural Network ◽

Overall Performance ◽

Game Context ◽

Player Performance ◽

Q Function

A variety of machine learning models have been proposed to assess the performance of players in professional sports. However, they have only a limited ability to model how player performance depends on the game context. This paper proposes a new approach to capturing game context: we apply Deep Reinforcement Learning (DRL) to learn an action-value Q function from 3M play-by-play events in the National Hockey League (NHL). The neural network representation integrates both continuous context signals and game history, using a possession-based LSTM. The learned Q-function is used to value players' actions under different game contexts. To assess a player's overall performance, we introduce a novel Game Impact Metric (GIM) that aggregates the values of the player's actions. Empirical Evaluation shows GIM is consistent throughout a play season, and correlates highly with standard success measures and future salary.

Download Full-text

Deep Learning of COVID-19 Chest X-Rays: New Models or Fine Tuning?

10.36227/techrxiv.12656948.v1 ◽

2020 ◽

Author(s):

Tuan Pham

Keyword(s):

Deep Learning ◽

High Performance ◽

Data Augmentation ◽

Dominant Role ◽

Care Center ◽

Characteristic Curve ◽

Fine Tuning ◽

Urgent Care ◽

X Rays ◽

Training Time

Chest X-rays have been found to be very promising for assessing COVID-19 patients, especially for resolving emergency-department and urgent-care-center overcapacity. Deep-learning (DL) methods in artificial intelligence (AI) play a dominant role as high-performance classifiers in the detection of the disease using chest X-rays. While many new DL models have been being developed for this purpose, this study aimed to investigate the fine tuning of pretrained convolutional neural networks (CNNs) for the classification of COVID-19 using chest X-rays. Three pretrained CNNs, which are AlexNet, GoogleNet, and SqueezeNet, were selected and fine-tuned without data augmentation to carry out 2-class and 3-class classification tasks using 3 public chest X-ray databases. In comparison with other recently developed DL models, the 3 pretrained CNNs achieved very high classification results in terms of accuracy, sensitivity, specificity, precision, F1 score, and area under the receiver-operating-characteristic curve. AlexNet, GoogleNet, and SqueezeNet require the least training time among pretrained DL models, but with suitable selection of training parameters, excellent classification results can be achieved without data augmentation by these networks. The findings contribute to the urgent need for harnessing the pandemic by facilitating the deployment of AI tools that are fully automated and readily available in the public domain for rapid implementation.

Download Full-text

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

Journal of Circuits System and Computers ◽

10.1142/s0218126617501298 ◽

2017 ◽

Vol 26 (09) ◽

pp. 1750129 ◽

Cited By ~ 2

Author(s):

Mohamed Najoui ◽

Mounir Bahtat ◽

Anas Hatim ◽

Said Belkouch ◽

Noureddine Chabini

Keyword(s):

High Performance ◽

Qr Decomposition ◽

Numerical Linear Algebra ◽

Instruction Level Parallelism ◽

Management Approach ◽

Real Time Processing ◽

Low Level ◽

Processor Architectures ◽

Efficient Data ◽

Level Parallelism

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

Download Full-text

RAMAN: Reinforcement Learning Inspired Algorithm for Mapping Applications onto Mesh Network-on-Chip

10.1109/slip52707.2021.00019 ◽

2021 ◽

Author(s):

Jitesh Choudhary ◽

Soumya J ◽

Linga Reddy Cenkeramaddi

Keyword(s):

Reinforcement Learning ◽

Network On Chip ◽

Mesh Network ◽

On Chip

Download Full-text

Invited Talk: Increasing Representational Power and Scaling Inference in Reinforcement Learning

Lecture Notes in Computer Science - Recent Advances in Reinforcement Learning ◽

10.1007/978-3-642-29946-9_2 ◽

2012 ◽

pp. 2-2

Author(s):

Kristian Kersting

Keyword(s):

Reinforcement Learning ◽

Representational Power

Download Full-text

Deep Reinforcement Learning with Different Rewards for Scheduling in High-Performance Computing Systems

10.1109/mwscas47672.2021.9531852 ◽

2021 ◽

Author(s):

Md Farhadur Reza ◽

Bo Zhao

Keyword(s):

Reinforcement Learning ◽

High Performance Computing ◽

High Performance ◽

Computing Systems ◽

Performance Computing

Download Full-text

Signal processing algorithm for neural networks with integrodifferential splines as an activation function and its particular case of image classification

Highly available systems ◽

10.18127/j20729472-202102-02 ◽

2021 ◽

Author(s):

T.K. Biryukova

Keyword(s):

Neural Network ◽

Neural Networks ◽

Image Classification ◽

Activation Function ◽

Experimental Comparison ◽

Training Time ◽

Operation Speed ◽

The Neural Network ◽

Linear Algebraic Equations ◽

Network Operation

Classic neural networks suppose trainable parameters to include just weights of neurons. This paper proposes parabolic integrodifferential splines (ID-splines), developed by author, as a new kind of activation function (AF) for neural networks, where ID-splines coefficients are also trainable parameters. Parameters of ID-spline AF together with weights of neurons are vary during the training in order to minimize the loss function thus reducing the training time and increasing the operation speed of the neural network. The newly developed algorithm enables software implementation of the ID-spline AF as a tool for neural networks construction, training and operation. It is proposed to use the same ID-spline AF for neurons in the same layer, but different for different layers. In this case, the parameters of the ID-spline AF for a particular layer change during the training process independently of the activation functions (AFs) of other network layers. In order to comply with the continuity condition for the derivative of the parabolic ID-spline on the interval (x x0, n) , its parameters fi (i= 0,...,n) should be calculated using the tridiagonal system of linear algebraic equations: To solve the system it is necessary to use two more equations arising from the boundary conditions for specific problems. For exam- ple the values of the grid function (if they are known) in the points (x x0, n) may be used for solving the system above: f f x0 = ( 0) , f f xn = ( n) . The parameters Iii+1 (i= 0,...,n−1 ) are used as trainable parameters of neural networks. The grid boundaries and spacing of the nodes of ID-spline AF are best chosen experimentally. The optimal selection of grid nodes allows improving the quality of results produced by the neural network. The formula for a parabolic ID-spline is such that the complexity of the calculations does not depend on whether the grid of nodes is uniform or non-uniform. An experimental comparison of the results of image classification from the popular FashionMNIST dataset by convolutional neural 0, x< 0 networks with the ID-spline AFs and the well-known ReLUx( ) =AF was carried out. The results reveal that the usage x x, ≥ 0 of the ID-spline AFs provides better accuracy of neural network operation than the ReLU AF. The training time for two convolutional layers network with two ID-spline AFs is just about 2 times longer than with two instances of ReLU AF. Doubling of the training time due to complexity of the ID-spline formula is the acceptable price for significantly better accuracy of the network. Wherein the difference of an operation speed of the networks with ID-spline and ReLU AFs will be negligible. The use of trainable ID-spline AFs makes it possible to simplify the architecture of neural networks without losing their efficiency. The modification of the well-known neural networks (ResNet etc.) by replacing traditional AFs with ID-spline AFs is a promising approach to increase the neural network operation accuracy. In a majority of cases, such a substitution does not require to train the network from scratch because it allows to use pre-trained on large datasets neuron weights supplied by standard software libraries for neural network construction thus substantially shortening training time.

Download Full-text

High Performance Storage for Big Data Analytics and Visualization

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch010 ◽

2018 ◽

pp. 254-275

Author(s):

Armando Fandango ◽

William Rivera

Keyword(s):

Big Data ◽

High Speed ◽

High Performance ◽

File System ◽

Predictive Analytics ◽

Big Data Analytics ◽

File Systems ◽

Distributed Applications ◽

System Level ◽

File Formats

Scientific Big Data being gathered at exascale needs to be stored, retrieved and manipulated. The storage stack for scientific Big Data includes a file system at the system level for physical organization of the data, and a file format and input/output (I/O) system at the application level for logical organization of the data; both of them of high-performance variety for exascale. The high-performance file system is designed with concurrent access, high-speed transmission and fault tolerance characteristics. High-performance file formats and I/O are designed to allow parallel and distributed applications with easy and fast access to Big Data. These specialized file formats make it easier to store and access Big Data for scientific visualization and predictive analytics. This chapter provides a brief review of the characteristics of high-performance file systems such as Lustre and GPFS, and high-performance file formats such as HDF5, NetCDF, MPI-IO, and HDFS.

Download Full-text

Advanced SiP Packaging Technologies of IPD for Mobile Applications

Additional Conferences (Device Packaging HiTEC HiTEN & CICMT) ◽

10.4071/2010dpc-wp31 ◽

2010 ◽

Vol 2010 (DPC) ◽

pp. 1-20

Author(s):

Geun Sik Kim ◽

Kai Liu ◽

Flynn Carson ◽

Seung Wook Yoon ◽

Meenakshi Padmanathan

Keyword(s):

High Performance ◽

Low Cost ◽

Autonomous Systems ◽

System Level ◽

Design And Technology ◽

Wafer Level ◽

Future Technology ◽

Rf Transceivers ◽

And Performance ◽

3D Stacking

IPD technology was originally developed as a way to replace bulky discrete passive components, but it¡¯s now gaining popularity in ESD/EMI protection applications, as well as in RF, high-brightness LED silicon sub-mounts, and digital and mixed-signal devices. Already well known as a key enabler of system-in-packages (SiPs), IPDs enable the assembly of increasingly complete and autonomous systems with the integration of diverse electronic functions such as sensors, RF transceivers, MEMS, power amplifiers, power management units, and digital processors. The application area for IPD will continue to evolve, especially as new packaging technology, such as flipchip, 3D stacking, wafer level packaging become available to provide vertical interconnections within the IPD. New applications like silicon interposers will become increasingly significant to the market. Currently the IPD market is being driven primarily by RF or wireless packages and applications including, but not limited to, cell phones, WiFi, GPS, WiMAX, and WiBro. In particular, applications and products in the emerging RF CMOS market that require a low cost, smaller size, and high performance are driving demand. In order to get right products in size and performance, packaging design and technology should be considered in device integration and implemented together in IPD designs. In addition, a comprehensive understanding of electrical and mechanical properties in component and system level design is important. This paper will highlight some of the recent advancements in SiP technology for IPD and integration as well as what is developed to address future technology requirements in IPD SiP solutions. The advantage and applications of SiP solution for IPD will be presented with several examples of IPD products. The design, assembly and packaging challenges and performance characteristics will be also discussed.

Download Full-text

End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013387 ◽

2019 ◽

Vol 33 ◽

pp. 3387-3395 ◽

Cited By ~ 21

Author(s):

Richard Cheng ◽

Gábor Orosz ◽

Richard M. Murray ◽

Joel W. Burdick

Keyword(s):

Reinforcement Learning ◽

System Dynamics ◽

Learning Process ◽

High Performance ◽

Continuous Control ◽

Barrier Functions ◽

Synthesis Algorithm ◽

Vehicle Communication ◽

Model Free ◽

Vehicle To Vehicle

Reinforcement Learning (RL) algorithms have found limited success beyond simulated applications, and one main reason is the absence of safety guarantees during the learning process. Real world systems would realistically fail or break before an optimal controller can be learned. To address this issue, we propose a controller architecture that combines (1) a model-free RL-based controller with (2) model-based controllers utilizing control barrier functions (CBFs) and (3) online learning of the unknown system dynamics, in order to ensure safety during learning. Our general framework leverages the success of RL algorithms to learn high-performance controllers, while the CBF-based controllers both guarantee safety and guide the learning process by constraining the set of explorable polices. We utilize Gaussian Processes (GPs) to model the system dynamics and its uncertainties. Our novel controller synthesis algorithm, RL-CBF, guarantees safety with high probability during the learning process, regardless of the RL algorithm used, and demonstrates greater policy exploration efficiency. We test our algorithm on (1) control of an inverted pendulum and (2) autonomous carfollowing with wireless vehicle-to-vehicle communication, and show that our algorithm attains much greater sample efficiency in learning than other state-of-the-art algorithms and maintains safety during the entire learning process.

Download Full-text