Simulation of partially observed Markov decision process and dynamic quality improvement

1997 ◽  
Vol 32 (4) ◽  
pp. 691-700 ◽  
Author(s):  
Nancy Gautreau ◽  
Soumaya Yacout ◽  
Réjean Hall
Author(s):  
Kazuteru Miyazaki ◽  
◽  
Shigenobu Kobayashi ◽  

Exploitation-oriented learning (XoL) is a novel approach to goal-directed learning from interaction. Reinforcement learning is much more focused on learning and ensures optimality in Markov decision process (MDP) environments, XoL involves learning a rational policy that obtains rewards continuously and very quickly. PS-r*, a form of XoL, involves learning a useful rational policy not inferior to the random walk in the partially observed Markov decision process (POMDP) where reward types number one. PS-r*, however, requires O(MN2) memory where N is the number of sensory input types and M is an action. We propose PS-r#for learning a useful rational policy in the POMDP using O(MN) memory. PS-r#effectiveness is confirmed in numerical examples.


Sign in / Sign up

Export Citation Format

Share Document