Learning 2-Opt Heuristics for Routing Problems via Deep Reinforcement Learning

Paulo da Costa; Jason Rhuggenaath; Yingqian Zhang; Alp Akcay; Uzay Kaymak

doi:10.1007/s42979-021-00779-2

Learning 2-Opt Heuristics for Routing Problems via Deep Reinforcement Learning

SN Computer Science ◽

10.1007/s42979-021-00779-2 ◽

2021 ◽

Vol 2 (5) ◽

Author(s):

Paulo da Costa ◽

Jason Rhuggenaath ◽

Yingqian Zhang ◽

Alp Akcay ◽

Uzay Kaymak

Keyword(s):

Deep Learning ◽

Reinforcement Learning ◽

State Of The Art ◽

Gradient Algorithm ◽

Routing Problems ◽

Routing Problem ◽

Local Search Heuristic ◽

Policy Gradient ◽

Previous State ◽

The Traveling Salesman Problem

AbstractRecent works using deep learning to solve routing problems such as the traveling salesman problem (TSP) have focused on learning construction heuristics. Such approaches find good quality solutions but require additional procedures such as beam search and sampling to improve solutions and achieve state-of-the-art performance. However, few studies have focused on improvement heuristics, where a given solution is improved until reaching a near-optimal one. In this work, we propose to learn a local search heuristic based on 2-opt operators via deep reinforcement learning. We propose a policy gradient algorithm to learn a stochastic policy that selects 2-opt operations given a current solution. Moreover, we introduce a policy neural network that leverages a pointing attention mechanism, which can be easily extended to more general k-opt moves. Our results show that the learned policies can improve even over random initial solutions and approach near-optimal solutions faster than previous state-of-the-art deep learning methods for the TSP. We also show we can adapt the proposed method to two extensions of the TSP: the multiple TSP and the Vehicle Routing Problem, achieving results on par with classical heuristics and learned methods.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving

Symmetry ◽

10.3390/sym13061061 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1061

Author(s):

Yanliang Jin ◽

Qianhong Liu ◽

Liquan Shen ◽

Leiji Zhu

Keyword(s):

Reinforcement Learning ◽

Supervised Learning ◽

State Of The Art ◽

Autonomous Driving ◽

Attention Mechanism ◽

Gradient Algorithm ◽

Excellent Performance ◽

Average Speed ◽

The Road ◽

Policy Gradient

The research on autonomous driving based on deep reinforcement learning algorithms is a research hotspot. Traditional autonomous driving requires human involvement, and the autonomous driving algorithms based on supervised learning must be trained in advance using human experience. To deal with autonomous driving problems, this paper proposes an improved end-to-end deep deterministic policy gradient (DDPG) algorithm based on the convolutional block attention mechanism, and it is called multi-input attention prioritized deep deterministic policy gradient algorithm (MAPDDPG). Both the actor network and the critic network of the model have the same structure with symmetry. Meanwhile, the attention mechanism is introduced to help the vehicles focus on useful environmental information. The experiments are conducted in the open racing car simulator (TORCS)and the results of five experiment runs on the test tracks are averaged to obtain the final result. Compared with the state-of-the-art algorithm, the maximum reward increases from 62,207 to 116,347, and the average speed increases from 135 km/h to 193 km/h, while the number of success episodes to complete a circle increases from 96 to 147. Also, the variance of the distance from the vehicle to the center of the road is compared, and the result indicates that the variance of the DDPG is 0.6 m while that of the MAPDDPG is only 0.2 m. The above results indicate that the proposed MAPDDPG achieves excellent performance.

Download Full-text

Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms

Symmetry ◽

10.3390/sym11020290 ◽

2019 ◽

Vol 11 (2) ◽

pp. 290 ◽

Cited By ~ 4

Author(s):

SeungYoon Choi ◽

Tuyen Le ◽

Quang Nguyen ◽

Md Layek ◽

SeungGwan Lee ◽

...

Keyword(s):

Reinforcement Learning ◽

Deep Neural Network ◽

Learning Algorithm ◽

State Of The Art ◽

The Other ◽

Gradient Algorithm ◽

Reward Function ◽

Policy Gradient ◽

Policy Optimization ◽

Start Location

In this paper, we propose a controller for a bicycle using the DDPG (Deep Deterministic Policy Gradient) algorithm, which is a state-of-the-art deep reinforcement learning algorithm. We use a reward function and a deep neural network to build the controller. By using the proposed controller, a bicycle can not only be stably balanced but also travel to any specified location. We confirm that the controller with DDPG shows better performance than the other baselines such as Normalized Advantage Function (NAF) and Proximal Policy Optimization (PPO). For the performance evaluation, we implemented the proposed algorithm in various settings such as fixed and random speed, start location, and destination location.

Download Full-text

Preceding vehicle following algorithm with human driving characteristics

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/0954407020981546 ◽

2021 ◽

pp. 095440702098154

Author(s):

Feng Pan ◽

Hong Bao

Keyword(s):

Reinforcement Learning ◽

Weight Vector ◽

Gradient Algorithm ◽

Inner Product ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Human Driver ◽

Policy Gradient ◽

Preceding Vehicle ◽

Action Spaces

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.

Download Full-text

Escape from solutions stagnation. A Study on Ant System solving TSP

Creative Mathematics and Informatics ◽

10.37193/cmi.2019.01.11 ◽

2019 ◽

Vol 28 (1) ◽

pp. 77-83

Author(s):

CAMELIA-M. PINTEA ◽

◽

BARNA IANTOVICS ◽

PETRICA POP ◽

MATTHIAS DEHMER ◽

...

Keyword(s):

Traveling Salesman Problem ◽

Traveling Salesman ◽

Current Work ◽

Ant System ◽

Routing Problems ◽

Routing Problem ◽

The Traveling Salesman Problem ◽

Exit State

Nowadays, routing problems arise in different contexts of distribution of goods, transportation of commodities and people. Routing problems deals with traveling along a given network in an optimal way. One of the major goals in optimization, including optimization of routing problems, is to reduce the time of stagnation by finding an exit state. The current work is a study about the ability of ants to escape from solution stagnation on a particular routing problem, the Traveling Salesman Problem.

Download Full-text

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014213 ◽

2019 ◽

Vol 33 ◽

pp. 4213-4220 ◽

Cited By ~ 12

Author(s):

Shihui Li ◽

Yi Wu ◽

Xinyue Cui ◽

Honghua Dong ◽

Fei Fang ◽

...

Keyword(s):

Reinforcement Learning ◽

Gradient Algorithm ◽

Training Environment ◽

Local Optima ◽

Continuous Action ◽

Agent Learning ◽

Policy Gradient ◽

Multi Agent ◽

Continuous Actions ◽

Computational Intractability

Despite the recent advances of deep reinforcement learning (DRL), agents trained by DRL tend to be brittle and sensitive to the training environment, especially in the multi-agent scenarios. In the multi-agent setting, a DRL agent’s policy can easily get stuck in a poor local optima w.r.t. its training partners – the learned policy may be only locally optimal to other agents’ current policies. In this paper, we focus on the problem of training robust DRL agents with continuous actions in the multi-agent learning setting so that the trained agents can still generalize when its opponents’ policies alter. To tackle this problem, we proposed a new algorithm, MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) with the following contributions: (1) we introduce a minimax extension of the popular multi-agent deep deterministic policy gradient algorithm (MADDPG), for robust policy learning; (2) since the continuous action space leads to computational intractability in our minimax learning objective, we propose Multi-Agent Adversarial Learning (MAAL) to efficiently solve our proposed formulation. We empirically evaluate our M3DDPG algorithm in four mixed cooperative and competitive multi-agent environments and the agents trained by our method significantly outperforms existing baselines.

Download Full-text

Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6047 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5883-5891

Author(s):

Jianwen Sun ◽

Tianwei Zhang ◽

Xiaofei Xie ◽

Lei Ma ◽

Yan Zheng ◽

...

Keyword(s):

Deep Learning ◽

Reinforcement Learning ◽

Critical Point ◽

State Of The Art ◽

Great Success ◽

Severe Damage ◽

Minimal Set ◽

Adversarial Attack ◽

Attack Strategy ◽

Critical Moments

Adversarial attacks against conventional Deep Learning (DL) systems and algorithms have been widely studied, and various defenses were proposed. However, the possibility and feasibility of such attacks against Deep Reinforcement Learning (DRL) are less explored. As DRL has achieved great success in various complex tasks, designing effective adversarial attacks is an indispensable prerequisite towards building robust DRL algorithms. In this paper, we introduce two novel adversarial attack techniques to stealthily and efficiently attack the DRL agents. These two techniques enable an adversary to inject adversarial samples in a minimal set of critical moments while causing the most severe damage to the agent. The first technique is the critical point attack: the adversary builds a model to predict the future environmental states and agent's actions, assesses the damage of each possible attack strategy, and selects the optimal one. The second technique is the antagonist attack: the adversary automatically learns a domain-agnostic model to discover the critical moments of attacking the agent in an episode. Experimental results demonstrate the effectiveness of our techniques. Specifically, to successfully attack the DRL agent, our critical point technique only requires 1 (TORCS) or 2 (Atari Pong and Breakout) steps, and the antagonist technique needs fewer than 5 steps (4 Mujoco tasks), which are significant improvements over state-of-the-art methods.

Download Full-text

Attention-Based Fault-Tolerant Approach for Multi-Agent Reinforcement Learning Systems

Entropy ◽

10.3390/e23091133 ◽

2021 ◽

Vol 23 (9) ◽

pp. 1133

Author(s):

Shanzhi Gu ◽

Mingyang Geng ◽

Long Lan

Keyword(s):

Reinforcement Learning ◽

Noise Intensity ◽

Fault Tolerant ◽

State Of The Art ◽

Learning Systems ◽

Noisy Environments ◽

Time Step ◽

Malicious Behavior ◽

Previous State ◽

Multi Agent

The aim of multi-agent reinforcement learning systems is to provide interacting agents with the ability to collaboratively learn and adapt to the behavior of other agents. Typically, an agent receives its private observations providing a partial view of the true state of the environment. However, in realistic settings, the harsh environment might cause one or more agents to show arbitrarily faulty or malicious behavior, which may suffice to allow the current coordination mechanisms fail. In this paper, we study a practical scenario of multi-agent reinforcement learning systems considering the security issues in the presence of agents with arbitrarily faulty or malicious behavior. The previous state-of-the-art work that coped with extremely noisy environments was designed on the basis that the noise intensity in the environment was known in advance. However, when the noise intensity changes, the existing method has to adjust the configuration of the model to learn in new environments, which limits the practical applications. To overcome these difficulties, we present an Attention-based Fault-Tolerant (FT-Attn) model, which can select not only correct, but also relevant information for each agent at every time step in noisy environments. The multihead attention mechanism enables the agents to learn effective communication policies through experience concurrent with the action policies. Empirical results showed that FT-Attn beats previous state-of-the-art methods in some extremely noisy environments in both cooperative and competitive scenarios, much closer to the upper-bound performance. Furthermore, FT-Attn maintains a more general fault tolerance ability and does not rely on the prior knowledge about the noise intensity of the environment.

Download Full-text

A Simple and Efficient Tensor Calculus for Machine Learning

Fundamenta Informaticae ◽

10.3233/fi-2020-1984 ◽

2020 ◽

Vol 177 (2) ◽

pp. 157-179

Author(s):

Sören Laue ◽

Matthias Mitterreiter ◽

Joachim Giesen

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Efficient Method ◽

Automatic Differentiation ◽

State Of The Art ◽

Tensor Representation ◽

Tensor Calculus ◽

Online Tool ◽

Previous State ◽

Derivatives Of

Computing derivatives of tensor expressions, also known as tensor calculus, is a fundamental task in machine learning. A key concern is the efficiency of evaluating the expressions and their derivatives that hinges on the representation of these expressions. Recently, an algorithm for computing higher order derivatives of tensor expressions like Jacobians or Hessians has been introduced that is a few orders of magnitude faster than previous state-of-the-art approaches. Unfortunately, the approach is based on Ricci notation and hence cannot be incorporated into automatic differentiation frameworks from deep learning like TensorFlow, PyTorch, autograd, or JAX that use the simpler Einstein notation. This leaves two options, to either change the underlying tensor representation in these frameworks or to develop a new, provably correct algorithm based on Einstein notation. Obviously, the first option is impractical. Hence, we pursue the second option. Here, we show that using Ricci notation is not necessary for an efficient tensor calculus and develop an equally efficient method for the simpler Einstein notation. It turns out that turning to Einstein notation enables further improvements that lead to even better efficiency. The methods that are described in this paper for computing derivatives of matrix and tensor expressions have been implemented in the online tool www.MatrixCalculus.org.

Download Full-text

A Method of Personalized Driving Decision for Smart Car Based on Deep Reinforcement Learning

Information ◽

10.3390/info11060295 ◽

2020 ◽

Vol 11 (6) ◽

pp. 295 ◽

Cited By ~ 1

Author(s):

Xinpeng Wang ◽

Chaozhong Wu ◽

Jie Xue ◽

Zhijun Chen

Keyword(s):

Reinforcement Learning ◽

Decision Model ◽

Gradient Algorithm ◽

Learning Goals ◽

Learning Method ◽

Automatic Driving ◽

Proposed Model ◽

Policy Gradient ◽

Self Learning ◽

Better Than

To date, automatic driving technology has become a hotspot in academia. It is necessary to provide a personalization of automatic driving decision for each passenger. The purpose of this paper is to propose a self-learning method for personalized driving decisions. First, collect and analyze driving data from different drivers to set learning goals. Then, Deep Deterministic Policy Gradient algorithm is utilized to design a driving decision system. Furthermore, personalized factors are introduced for some observed parameters to build a personalized driving decision model. Finally, compare the proposed method with classic Deep Reinforcement Learning algorithms. The results show that the performance of the personalized driving decision model is better than the classic algorithms, and it is similar to the manual driving situation. Therefore, the proposed model can effectively learn the human-like personalized driving decisions of different drivers for structured road. Based on this model, the smart car can accomplish personalized driving.

Download Full-text