Inaccuracy of State-Action Value Function For Non-Optimal Actions in Adversarially Trained Deep Neural Policies

Author(s):  
Ezgi Korkmaz
2020 ◽  
Vol 12 (21) ◽  
pp. 8883
Author(s):  
Kun Jin ◽  
Wei Wang ◽  
Xuedong Hua ◽  
Wei Zhou

As the key element of urban transportation, taxis services significantly provide convenience and comfort for residents’ travel. However, the reality has not shown much efficiency. Previous researchers mainly aimed to optimize policies by order dispatch on ride-hailing services, which cannot be applied in cruising taxis services. This paper developed the reinforcement learning (RL) framework to optimize driving policies on cruising taxis services. Firstly, we formulated the drivers’ behaviours as the Markov decision process (MDP) progress, considering the influences after taking action in the long run. The RL framework using dynamic programming and data expansion was employed to calculate the state-action value function. Following the value function, drivers can determine the best choice and then quantify the expected future reward at a particular state. By utilizing historic orders data in Chengdu, we analysed the function value’s spatial distribution and demonstrated how the model could optimize the driving policies. Finally, the realistic simulation of the on-demand platform was built. Compared with other benchmark methods, the results verified that the new model performs better in increasing total revenue, answer rate and decreasing waiting time, with the relative percentages of 4.8%, 6.2% and −27.27% at most.


2019 ◽  
Author(s):  
Jordão Memória ◽  
José Maia

In this work, a modeling and algorithm based on multiagent reinforcement learning is developed for the problem of elevator group dispatch. The main advantage is that, along with the function approximation, this multi-agent solution leads to reduction of the state space, allowing complex states to be addressed with a synthesizing evaluation function. Each elevator is considered an agent that have to decide about two actions: answer or ignore the new call. With some iterations, the agents learn the weights of an evaluation function which approximate the state-action value function. The performance of solution (average waiting time - AWT), shown varying the traffic pattern, flow of people, number of elevators and number of floors, is comparable to other current proposals reported in the literature.


Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1929
Author(s):  
Huan Shen ◽  
Yao Zhang ◽  
Jianguo Mao ◽  
Zhiwei Yan ◽  
Linwei Wu

In order to solve the flight time problem of Unmanned Aerial Vehicles (UAV), this paper proposes a set of energy management strategies based on reinforcement learning for hybrid agricultural UAV. The battery is used to optimize the working point of internal combustion engines to the greatest extent while solving the high power demand issues of UAV and the response problem of internal combustion engines. Firstly, the decision-making oriented hybrid model and UAV dynamic model are established. Owing to the characteristics of the energy management strategy (EMS) based on reinforcement learning (RL), which is an intelligent optimization algorithm that has emerged in recent years, the complex theoretical formula derivation is avoided in the modeling process. In terms of the EMS, a double Q learning algorithm with strong convergence is adopted. The algorithm separates the state action value function database used in derivation decisions and the state action value function-updated database brought by the decision, so as to avoid delay and shock within the convergence process caused by maximum deviation. After the improvement, the off-line training is carried out with a large number of flight data generated in the past. The simulation results demonstrate that the improved algorithm can show better performance with less learning cost than before by virtue of the search function strategy proposed in this paper. In the state space, time-based and residual fuel-based selection are carried out successively, and the convergence rate and application effect are compared and analyzed. The results show that the learning algorithm has stronger robustness and convergence speed due to the appropriate selection of state space under different types of operating cycles. After 120,000 cycles of training, the fuel economy of the improved algorithm in this paper can reach more than 90% of that of the optimal solution, and can perform stably in actual flight.


2020 ◽  
Vol 34 (5) ◽  
pp. 1531-1559
Author(s):  
Guiliang Liu ◽  
Yudong Luo ◽  
Oliver Schulte ◽  
Tarak Kharrat

Author(s):  
Hassab Elgawi Osman

This paper contributes on designing robotic self-optimizing memory controller for non-Markovian reinforcement tasks. Rather than holistic search for the whole memory contents the model adopts associated feature analysis to successively memorize a newly event state-action pair as an action of past experience. Actor-Critic learning is used to adaptively tuning the control parameters, while on-line variant of random forests (RF) learner is used as memory-capable to approximate the policy of Actor and the value function of Critic. Learning capability of the proposed model is experimentally examined through non-markovian cart-pole balancing task. The result shows that our self-optimizing memory controller acquired complex behaviors such as balancing two poles simultaneously, displays long-term planning and generalization capacity based on past experiences.


2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Qiangang Zheng ◽  
Zhihua Xi ◽  
Chunping Hu ◽  
Haibo ZHANG ◽  
Zhongzhi Hu

AbstractFor improving the response performance of engine, a novel aero-engine control method based on Deep Q Learning (DQL) is proposed. The engine controller based on DQL has been designed. The model free algorithm – Q learning, which can be performed online, is adopted to calculate the action value function. To improve the learning capacity of DQL, the deep learning algorithm – On Line Sliding Window Deep Neural Network (OL-SW-DNN), is adopted to estimate the action value function. For reducing the sensitivity to the noise of training data, OL-SW-DNN selects nearest point data of certain length as training data. Finally, the engine acceleration simulations of DQR and the Proportion Integration Differentiation (PID) which is the most commonly used as engine controller algorithm in industry are both conducted to verify the validity of the proposed method. The results show that the acceleration time of the proposed method decreased by 1.475 second while satisfied all of engine limits compared with the tradition controller.


Sign in / Sign up

Export Citation Format

Share Document