Inaccuracy of State-Action Value Function For Non-Optimal Actions in Adversarially Trained Deep Neural Policies

As the key element of urban transportation, taxis services significantly provide convenience and comfort for residents’ travel. However, the reality has not shown much efficiency. Previous researchers mainly aimed to optimize policies by order dispatch on ride-hailing services, which cannot be applied in cruising taxis services. This paper developed the reinforcement learning (RL) framework to optimize driving policies on cruising taxis services. Firstly, we formulated the drivers’ behaviours as the Markov decision process (MDP) progress, considering the influences after taking action in the long run. The RL framework using dynamic programming and data expansion was employed to calculate the state-action value function. Following the value function, drivers can determine the best choice and then quantify the expected future reward at a particular state. By utilizing historic orders data in Chengdu, we analysed the function value’s spatial distribution and demonstrated how the model could optimize the driving policies. Finally, the realistic simulation of the on-demand platform was built. Compared with other benchmark methods, the results verified that the new model performs better in increasing total revenue, answer rate and decreasing waiting time, with the relative percentages of 4.8%, 6.2% and −27.27% at most.

Download Full-text

A solution for the Elevators Group Dispatch by Multiagent Reinforcement Learning

10.5753/eniac.2019.9322 ◽

2019 ◽

Author(s):

Jordão Memória ◽

José Maia

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

The State ◽

Evaluation Function ◽

State Action ◽

Traffic Pattern ◽

Multiagent Reinforcement Learning ◽

Multi Agent ◽

Action Value

In this work, a modeling and algorithm based on multiagent reinforcement learning is developed for the problem of elevator group dispatch. The main advantage is that, along with the function approximation, this multi-agent solution leads to reduction of the state space, allowing complex states to be addressed with a synthesizing evaluation function. Each elevator is considered an agent that have to decide about two actions: answer or ignore the new call. With some iterations, the agents learn the weights of an evaluation function which approximate the state-action value function. The performance of solution (average waiting time - AWT), shown varying the traffic pattern, flow of people, number of elevators and number of floors, is comparable to other current proposals reported in the literature.

Download Full-text

Low-rank State-action Value-function Approximation

10.23919/eusipco54536.2021.9616008 ◽

2021 ◽

Author(s):

Sergio Rozada ◽

Victor Tenorio ◽

Antonio G. Marques

Keyword(s):

Function Approximation ◽

Value Function ◽

Low Rank ◽

Value Function Approximation ◽

State Action ◽

Action Value

Download Full-text

Value-Based Continuous Control Without Concrete State-Action Value Function

Lecture Notes in Computer Science - Advances in Swarm Intelligence ◽

10.1007/978-3-030-78811-7_34 ◽

2021 ◽

pp. 352-364

Author(s):

Jin Zhu ◽

Haixian Zhang ◽

Zhen Pan

Keyword(s):

Value Function ◽

Continuous Control ◽

State Action ◽

Action Value

Download Full-text

Energy Management of Hybrid UAV Based on Reinforcement Learning

Electronics ◽

10.3390/electronics10161929 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1929

Author(s):

Huan Shen ◽

Yao Zhang ◽

Jianguo Mao ◽

Zhiwei Yan ◽

Linwei Wu

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Energy Management ◽

Internal Combustion Engines ◽

Value Function ◽

Learning Algorithm ◽

The State ◽

Combustion Engines ◽

State Action ◽

Action Value

In order to solve the flight time problem of Unmanned Aerial Vehicles (UAV), this paper proposes a set of energy management strategies based on reinforcement learning for hybrid agricultural UAV. The battery is used to optimize the working point of internal combustion engines to the greatest extent while solving the high power demand issues of UAV and the response problem of internal combustion engines. Firstly, the decision-making oriented hybrid model and UAV dynamic model are established. Owing to the characteristics of the energy management strategy (EMS) based on reinforcement learning (RL), which is an intelligent optimization algorithm that has emerged in recent years, the complex theoretical formula derivation is avoided in the modeling process. In terms of the EMS, a double Q learning algorithm with strong convergence is adopted. The algorithm separates the state action value function database used in derivation decisions and the state action value function-updated database brought by the decision, so as to avoid delay and shock within the convergence process caused by maximum deviation. After the improvement, the off-line training is carried out with a large number of flight data generated in the past. The simulation results demonstrate that the improved algorithm can show better performance with less learning cost than before by virtue of the search function strategy proposed in this paper. In the state space, time-based and residual fuel-based selection are carried out successively, and the convergence rate and application effect are compared and analyzed. The results show that the learning algorithm has stronger robustness and convergence speed due to the appropriate selection of state space under different types of operating cycles. After 120,000 cycles of training, the fuel economy of the improved algorithm in this paper can reach more than 90% of that of the optimal solution, and can perform stably in actual flight.

Download Full-text

Deep soccer analytics: learning an action-value function for evaluating soccer players

Data Mining and Knowledge Discovery ◽

10.1007/s10618-020-00705-9 ◽

2020 ◽

Vol 34 (5) ◽

pp. 1531-1559

Author(s):

Guiliang Liu ◽

Yudong Luo ◽

Oliver Schulte ◽

Tarak Kharrat

Keyword(s):

Value Function ◽

Soccer Players ◽

Action Value

Download Full-text

Non-Markovian Reinforcement-Based on Self-Optimizing Memory Controller

Volume 3: ASME/IEEE 2009 International Conference on Mechatronic and Embedded Systems and Applications; 20th Reliability, Stress Analysis, and Failure Prevention Conference ◽

10.1115/detc2009-86326 ◽

2009 ◽

Author(s):

Hassab Elgawi Osman

Keyword(s):

Value Function ◽

Learning Capability ◽

Memory Controller ◽

State Action ◽

Proposed Model ◽

On Line ◽

Memory Contents ◽

Past Experiences ◽

The Value Function

This paper contributes on designing robotic self-optimizing memory controller for non-Markovian reinforcement tasks. Rather than holistic search for the whole memory contents the model adopts associated feature analysis to successively memorize a newly event state-action pair as an action of past experience. Actor-Critic learning is used to adaptively tuning the control parameters, while on-line variant of random forests (RF) learner is used as memory-capable to approximate the policy of Actor and the value function of Critic. Learning capability of the proposed model is experimentally examined through non-markovian cart-pole balancing task. The result shows that our self-optimizing memory controller acquired complex behaviors such as balancing two poles simultaneously, displays long-term planning and generalization capacity based on past experiences.

Download Full-text

A Research on Aero-engine Control Based on Deep Q Learning

International Journal of Turbo and Jet Engines ◽

10.1515/tjj-2020-0009 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Qiangang Zheng ◽

Zhihua Xi ◽

Chunping Hu ◽

Haibo ZHANG ◽

Zhongzhi Hu

Keyword(s):

Value Function ◽

Control Method ◽

Learning Algorithm ◽

Training Data ◽

Engine Control ◽

Q Learning ◽

Model Free ◽

Deep Learning Algorithm ◽

Aero Engine ◽

Action Value

AbstractFor improving the response performance of engine, a novel aero-engine control method based on Deep Q Learning (DQL) is proposed. The engine controller based on DQL has been designed. The model free algorithm – Q learning, which can be performed online, is adopted to calculate the action value function. To improve the learning capacity of DQL, the deep learning algorithm – On Line Sliding Window Deep Neural Network (OL-SW-DNN), is adopted to estimate the action value function. For reducing the sensitivity to the noise of training data, OL-SW-DNN selects nearest point data of certain length as training data. Finally, the engine acceleration simulations of DQR and the Proportion Integration Differentiation (PID) which is the most commonly used as engine controller algorithm in industry are both conducted to verify the validity of the proposed method. The results show that the acceleration time of the proposed method decreased by 1.475 second while satisfied all of engine limits compared with the tradition controller.

Download Full-text

Reinforcement learning in the environment where optimal action value function is partly discontinuous

2016 55th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE) ◽

10.1109/sice.2016.7749277 ◽

2016 ◽

Author(s):

Shingo Shibusawa ◽

Takeshi Shibuya

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Optimal Action ◽

Action Value

Download Full-text