Motion Planning with Energy Reduction for a Floating Robotic Platform Under Disturbances and Measurement Noise Using Reinforcement Learning

This paper investigates the use of reinforcement learning for the navigation of an over-actuated, i.e. more control inputs than degrees of freedom, marine platform in unknown environment. The proposed approach uses an online least-squared policy iteration scheme for value function approximation in order to estimate optimal policy, in conjunction with a low-level control system that controls the magnitude of the linear velocity, and the orientation of the platform. Primary goal of the proposed scheme is the reduction of the consumed energy. To that end, we propose a variable reward function that depends on the energy consumption of the platform. We evaluate our approach in a complex and realistic simulation environment and report results concerning its performance on estimating optimal navigation policies under different environmental disturbances, and position GPS measurement noise. The proposed framework is compared, in terms of energy consumption, to a baseline approach based on virtual potential fields. The results show that the marine platform successfully discovers the target point following a sub-optimal path, maintaining reduced energy consumption.

Download Full-text

End-to-End AUV Motion Planning Method Based on Soft Actor–Critic

Sensors ◽

10.3390/s21175893 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5893

Author(s):

Xin Yu ◽

Yushan Sun ◽

Xiangbin Wang ◽

Guocheng Zhang

Keyword(s):

Reinforcement Learning ◽

Motion Planning ◽

Autonomous Underwater Vehicle ◽

Planning System ◽

Optimal Decision ◽

Target Point ◽

Planning Problem ◽

Training Time ◽

Reward Function ◽

End To End

This study aims to solve the problems of poor exploration ability, single strategy, and high training cost in autonomous underwater vehicle (AUV) motion planning tasks and to overcome certain difficulties, such as multiple constraints and a sparse reward environment. In this research, an end-to-end motion planning system based on deep reinforcement learning is proposed to solve the motion planning problem of an underactuated AUV. The system directly maps the state information of the AUV and the environment into the control instructions of the AUV. The system is based on the soft actor–critic (SAC) algorithm, which enhances the exploration ability and robustness to the AUV environment. We also use the method of generative adversarial imitation learning (GAIL) to assist its training to overcome the problem that learning a policy for the first time is difficult and time-consuming in reinforcement learning. A comprehensive external reward function is then designed to help the AUV smoothly reach the target point, and the distance and time are optimized as much as possible. Finally, the end-to-end motion planning algorithm proposed in this research is tested and compared on the basis of the Unity simulation platform. Results show that the algorithm has an optimal decision-making ability during navigation, a shorter route, less time consumption, and a smoother trajectory. Moreover, GAIL can speed up the AUV training speed and minimize the training time without affecting the planning effect of the SAC algorithm.

Download Full-text

Reinforcement learning of motor skills using Policy Search and human corrective advice

The International Journal of Robotics Research ◽

10.1177/0278364919871998 ◽

2019 ◽

Vol 38 (14) ◽

pp. 1560-1580 ◽

Cited By ~ 2

Author(s):

Carlos Celemin ◽

Guilherme Maeda ◽

Javier Ruiz-del-Solar ◽

Jan Peters ◽

Jens Kober

Keyword(s):

Reinforcement Learning ◽

Motor Skills ◽

Degrees Of Freedom ◽

Sources Of Information ◽

Stable Convergence ◽

Interactive Approach ◽

Policy Search ◽

Physical Constraints ◽

Reward Function ◽

Movement Primitives

Robot learning problems are limited by physical constraints, which make learning successful policies for complex motor skills on real systems unfeasible. Some reinforcement learning methods, like Policy Search, offer stable convergence toward locally optimal solutions, whereas interactive machine learning or learning-from-demonstration methods allow fast transfer of human knowledge to the agents. However, most methods require expert demonstrations. In this work, we propose the use of human corrective advice in the actions domain for learning motor trajectories. Additionally, we combine this human feedback with reward functions in a Policy Search learning scheme. The use of both sources of information speeds up the learning process, since the intuitive knowledge of the human teacher can be easily transferred to the agent, while the Policy Search method with the cost/reward function take over for supervising the process and reducing the influence of occasional wrong human corrections. This interactive approach has been validated for learning movement primitives with simulated arms with several degrees of freedom in reaching via-point movements, and also using real robots in such tasks as “writing characters” and the ball-in-a-cup game. Compared with standard reinforcement learning without human advice, the results show that the proposed method not only converges to higher rewards when learning movement primitives, but also that the learning is sped up by a factor of 4–40 times, depending on the task.

Download Full-text

Reinforcement and Imitation Learning Applied to Autonomous Aerial Robot Control

10.5753/wtdr_ctdr.2020.14956 ◽

2020 ◽

Author(s):

Gabriel Moraes Barros ◽

Esther Colombini

Keyword(s):

Reinforcement Learning ◽

Autonomous Systems ◽

Control Policy ◽

Primary Objective ◽

Imitation Learning ◽

Level Control ◽

Reward Function ◽

Long Time ◽

Learning Reinforcement ◽

Function Approximator

In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability to learn, improve, adapt, and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning. Reinforcement Learning (RL) aims at addressing this problem by enabling a robot to learn behaviors through trial-and-error. With RL, a Neural Network can be trained as a function approximator to directly map states to actuator commands making any predefined control structure not-needed for training. However, the knowledge required to converge these methods is usually built from scratch. Learning may take a long time, not to mention that RL algorithms need a stated reward function. Sometimes, it is not trivial to define one. Often it is easier for a teacher, human or intelligent agent, do demonstrate the desired behavior or how to accomplish a given task. Humans and other animals have a natural ability to learn skills from observation, often from merely seeing these skills’ effects: without direct knowledge of the underlying actions. The same principle exists in Imitation Learning, a practical approach for autonomous systems to acquire control policies when an explicit reward function is unavailable, using supervision provided as demonstrations from an expert, typically a human operator. In this scenario, this work’s primary objective is to design an agent that can successfully imitate a prior acquired control policy using Imitation Learning. The chosen algorithm is GAIL since we consider that it is the proper algorithm to tackle this problem by utilizing expert (state, action) trajectories. As reference expert trajectories, we implement state-of-the-art on and off-policy methods PPO and SAC. Results show that the learned policies for all three methods can solve the task of low-level control of a quadrotor and that all can account for generalization on the original tasks.

Download Full-text

Autonomous reconfiguration of homogeneous pivoting cube modular satellite by deep reinforcement learning

Proceedings of the Institution of Mechanical Engineers Part I Journal of Systems and Control Engineering ◽

10.1177/0959651820956738 ◽

2020 ◽

pp. 095965182095673

Author(s):

Qiliang Song ◽

Dong Ye ◽

Zhaowei Sun ◽

Bo Wang

Keyword(s):

Graph Theory ◽

Reinforcement Learning ◽

Energy Consumption ◽

State Transition ◽

Optimal Path ◽

Electrical Energy ◽

Electrical Energy Consumption ◽

Mechanical Loss ◽

Link Module ◽

Simulation Results

Modular satellite, which has the ability of self-repairing and accomplishing different tasks, draws more and more satellite designers’ attention recently. One of the trending topics is to design the algorithm of self-reconfigurable path planning, since searching a near-optimal path is an effective way to reduce electrical energy consumption and mechanical loss of satellites. A major thrust of this article is to examine a series of algorithms based on graph theory and deep reinforcement learning. We creatively propose the concept of link module and find the link module by calculating articulation points in the undirected connected graph of configuration. We propose a compressed algorithm of state transition and the deep reinforcement learning algorithms in the domain of self-reconfigurable modular satellites. The simulation results show the feasibility and effectiveness of the proposed planning algorithms.

Download Full-text

Path Planning of Coastal Ships Based on Optimized DQN Reward Function

Journal of Marine Science and Engineering ◽

10.3390/jmse9020210 ◽

2021 ◽

Vol 9 (2) ◽

pp. 210

Author(s):

Siyu Guo ◽

Xiuguo Zhang ◽

Yiquan Du ◽

Yisong Zheng ◽

Zhiying Cao

Keyword(s):

Path Planning ◽

Optimal Path ◽

Convergence Speed ◽

Stability And Convergence ◽

Experimental Comparison ◽

Target Point ◽

A Algorithm ◽

Reward Function ◽

Space Action ◽

Status Information

Path planning is a key issue in the field of coastal ships, and it is also the core foundation of ship intelligent development. In order to better realize the ship path planning in the process of navigation, this paper proposes a coastal ship path planning model based on the optimized deep Q network (DQN) algorithm. The model is mainly composed of environment status information and the DQN algorithm. The environment status information provides training space for the DQN algorithm and is quantified according to the actual navigation environment and international rules for collision avoidance at sea. The DQN algorithm mainly includes four components which are ship state space, action space, action exploration strategy and reward function. The traditional reward function of DQN may lead to the low learning efficiency and convergence speed of the model. This paper optimizes the traditional reward function from three aspects: (a) the potential energy reward of the target point to the ship is set; (b) the reward area is added near the target point; and (c) the danger area is added near the obstacle. Through the above optimized method, the ship can avoid obstacles to reach the target point faster, and the convergence speed of the model is accelerated. The traditional DQN algorithm, A* algorithm, BUG2 algorithm and artificial potential field (APF) algorithm are selected for experimental comparison, and the experimental data are analyzed from the path length, planning time, number of path corners. The experimental results show that the optimized DQN algorithm has better stability and convergence, and greatly reduces the calculation time. It can plan the optimal path in line with the actual navigation rules, and improve the safety, economy and autonomous decision-making ability of ship navigation.

Download Full-text

Autonomous Control of Urban Storm Water Networks Using Reinforcement Learning

10.29007/hx4d ◽

2018 ◽

Cited By ~ 1

Author(s):

Abhiram Mullapudi ◽

Branko Kerkez

Keyword(s):

Reinforcement Learning ◽

System Level ◽

Storm Water ◽

Level Control ◽

Water Network ◽

Reward Function ◽

Autonomous Operation ◽

Control Robustness ◽

Control Objective ◽

Urban Storm

We investigate the real-time and autonomous operation of a 12 km2 urban storm water network, which has been retrofitted with sensors and control valves. Specifically, we evaluate reinforcement learning, a technique rooted in deep learning, as a system-level control methodology. The controller opens and closes valves in the system, which enhances the performance in the storm water network by coordinating the discharges amongst spatially distributed storm water assets (i.e. detention basins and wetlands). A reinforcement learning control algorithm is implemented to control the storm water network across an urban watershed. Results show that control of valves using reinforcement learning shows great potential, but extensive research still needs to be conducted to develop a fundamental understanding of control robustness. We specifically discuss the role and importance of the reward function (i.e. heuristic control objective), which guides the autonomous controller towards achieving the desired water shed scale response.

Download Full-text

Dynamic Optimization of a Steerable Screw In-pipe Inspection Robot Using HJB and Turbine Installation

Robotica ◽

10.1017/s0263574719001784 ◽

2020 ◽

Vol 38 (11) ◽

pp. 2001-2022

Author(s):

H. Tourajizadeh ◽

V. Boomeri ◽

M. Rezaei ◽

A. Sedigh

Keyword(s):

Energy Consumption ◽

Path Planning ◽

Degrees Of Freedom ◽

Optimal Path ◽

Turbine Blades ◽

Pipe Inspection ◽

Inspection Robot ◽

Propulsion Force ◽

Hamilton Jacobi Bellman ◽

Power And Energy

SUMMARYIn this paper, two strategies are proposed to optimize the energy consumption of a new screw in-pipe inspection robot which is steerable. In the first method, optimization is performed using the optimal path planning and implementing the Hamilton–Jacobi–Bellman (HJB) method. Since the number of actuators is more than the number of degrees of freedom of the system for the proposed steerable case, it is possible to minimize the energy consumption by the aid of the dynamics of the system. In the second method, the mechanics of the robot is modified by installing some turbine blades through which the drag force of the pipeline fluid can be employed to decrease the required propulsion force of the robot. It is shown that using both of the mentioned improvements, that is, using HJB formulation for the steerable robot and installing the turbine blades can significantly save power and energy. However, it will be shown that for the latter case this improvement is extremely dependent on the alignment of the fluid stream direction with respect to the direction of the robot velocity, while this optimization is independent of this case for the former strategy. On the other hand, the path planning dictates a special pattern of speed functionality while for the robot equipped by blades, saving the energy is possible for any desired input path. The correctness of the modeling is verified by comparing the results of MATLAB and ADAMS, while the efficiency of the proposed optimization algorithms is checked by the aid of some analytic and comparative simulations.

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

Computation offloading through mobile vehicles in IoT-edge-cloud network

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-020-01848-5 ◽

2020 ◽

Vol 2020 (1) ◽

Author(s):

Jun Long ◽

Yueyi Luo ◽

Xiaoyu Zhu ◽

Entao Luo ◽

Mingfeng Huang

Keyword(s):

Reinforcement Learning ◽

Energy Consumption ◽

Low Cost ◽

Computation Offloading ◽

Mobile Edge Computing ◽

Challenging Problem ◽

Learning Technique ◽

Cloud Network ◽

The City ◽

Task Offloading

AbstractWith the developing of Internet of Things (IoT) and mobile edge computing (MEC), more and more sensing devices are widely deployed in the smart city. These sensing devices generate various kinds of tasks, which need to be sent to cloud to process. Usually, the sensing devices do not equip with wireless modules, because it is neither economical nor energy saving. Thus, it is a challenging problem to find a way to offload tasks for sensing devices. However, many vehicles are moving around the city, which can communicate with sensing devices in an effective and low-cost way. In this paper, we propose a computation offloading scheme through mobile vehicles in IoT-edge-cloud network. The sensing devices generate tasks and transmit the tasks to vehicles, then the vehicles decide to compute the tasks in the local vehicle, MEC server or cloud center. The computation offloading decision is made based on the utility function of the energy consumption and transmission delay, and the deep reinforcement learning technique is adopted to make decisions. Our proposed method can make full use of the existing infrastructures to implement the task offloading of sensing devices, the experimental results show that our proposed solution can achieve the maximum reward and decrease delay.

Download Full-text

A 2D Optimal Path Planning Algorithm for Autonomous Underwater Vehicle Driving in Unknown Underwater Canyons

Journal of Marine Science and Engineering ◽

10.3390/jmse9030252 ◽

2021 ◽

Vol 9 (3) ◽

pp. 252

Author(s):

Yushan Sun ◽

Xiaokun Luo ◽

Xiangrui Ran ◽

Guocheng Zhang

Keyword(s):

Path Planning ◽

Obstacle Avoidance ◽

Autonomous Underwater Vehicles ◽

Optimal Path ◽

Small Scale ◽

Target Point ◽

Safe Driving ◽

Policy Gradient ◽

Planning Algorithm ◽

Path Planning Algorithm

This research aims to solve the safe navigation problem of autonomous underwater vehicles (AUVs) in deep ocean, which is a complex and changeable environment with various mountains. When an AUV reaches the deep sea navigation, it encounters many underwater canyons, and the hard valley walls threaten its safety seriously. To solve the problem on the safe driving of AUV in underwater canyons and address the potential of AUV autonomous obstacle avoidance in uncertain environments, an improved AUV path planning algorithm based on the deep deterministic policy gradient (DDPG) algorithm is proposed in this work. This method refers to an end-to-end path planning algorithm that optimizes the strategy directly. It takes sensor information as input and driving speed and yaw angle as outputs. The path planning algorithm can reach the predetermined target point while avoiding large-scale static obstacles, such as valley walls in the simulated underwater canyon environment, as well as sudden small-scale dynamic obstacles, such as marine life and other vehicles. In addition, this research aims at the multi-objective structure of the obstacle avoidance of path planning, modularized reward function design, and combined artificial potential field method to set continuous rewards. This research also proposes a new algorithm called deep SumTree-deterministic policy gradient algorithm (SumTree-DDPG), which improves the random storage and extraction strategy of DDPG algorithm experience samples. According to the importance of the experience samples, the samples are classified and stored in combination with the SumTree structure, high-quality samples are extracted continuously, and SumTree-DDPG algorithm finally improves the speed of the convergence model. Finally, this research uses Python language to write an underwater canyon simulation environment and builds a deep reinforcement learning simulation platform on a high-performance computer to conduct simulation learning training for AUV. Data simulation verified that the proposed path planning method can guide the under-actuated underwater robot to navigate to the target without colliding with any obstacles. In comparison with the DDPG algorithm, the stability, training’s total reward, and robustness of the improved Sumtree-DDPG algorithm planner in this study are better.

Download Full-text