scholarly journals Planning with Expectation Models

Author(s):  
Yi Wan ◽  
Muhammad Zaheer ◽  
Adam White ◽  
Martha White ◽  
Richard S. Sutton

Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm. 

Author(s):  
Jie Zhong ◽  
Tao Wang ◽  
Lianglun Cheng

AbstractIn actual welding scenarios, an effective path planner is needed to find a collision-free path in the configuration space for the welding manipulator with obstacles around. However, as a state-of-the-art method, the sampling-based planner only satisfies the probability completeness and its computational complexity is sensitive with state dimension. In this paper, we propose a path planner for welding manipulators based on deep reinforcement learning for solving path planning problems in high-dimensional continuous state and action spaces. Compared with the sampling-based method, it is more robust and is less sensitive with state dimension. In detail, to improve the learning efficiency, we introduce the inverse kinematics module to provide prior knowledge while a gain module is also designed to avoid the local optimal policy, we integrate them into the training algorithm. To evaluate our proposed planning algorithm in multiple dimensions, we conducted multiple sets of path planning experiments for welding manipulators. The results show that our method not only improves the convergence performance but also is superior in terms of optimality and robustness of planning compared with most other planning algorithms.


2021 ◽  
Vol 11 (6) ◽  
pp. 803
Author(s):  
Jie Chai ◽  
Xiaogang Ruan ◽  
Jing Huang

Neurophysiological studies have shown that the hippocampus, striatum, and prefrontal cortex play different roles in animal navigation, but it is still less clear how these structures work together. In this paper, we establish a navigation learning model based on the hippocampal–striatal circuit (NLM-HS), which provides a possible explanation for the navigation mechanism in the animal brain. The hippocampal model generates a cognitive map of the environment and performs goal-directed navigation by using a place cell sequence planning algorithm. The striatal model performs reward-related habitual navigation by using the classic temporal difference learning algorithm. Since the two models may produce inconsistent behavioral decisions, the prefrontal cortex model chooses the most appropriate strategies by using a strategy arbitration mechanism. The cognitive and learning mechanism of the NLM-HS works in two stages of exploration and navigation. First, the agent uses a hippocampal model to construct the cognitive map of the unknown environment. Then, the agent uses the strategy arbitration mechanism in the prefrontal cortex model to directly decide which strategy to choose. To test the validity of the NLM-HS, the classical Tolman detour experiment was reproduced. The results show that the NLM-HS not only makes agents show environmental cognition and navigation behavior similar to animals, but also makes behavioral decisions faster and achieves better adaptivity than hippocampal or striatal models alone.


2021 ◽  
Author(s):  
◽  
Yiming Peng

<p>Reinforcement Learning (RL) problems appear in diverse real-world applications and are gaining substantial attention in academia and industry. Policy Direct Search (PDS) is widely recognized as an effective approach to RL problems. However, existing PDS algorithms have some major limitations. First, many step-wise Policy Gradient Search (PGS) algorithms cannot effectively utilize informative historical gradients to accurately estimate policy gradients. Second, although evolutionary PDS algorithms do not rely on accurate policy gradient estimations and can explore learning environments effectively, they are not sample efficient at learning policies in the form of deep neural networks. Third, existing PGS algorithms often diverge easily due to the lack of reliable and flexible techniques for value function learning. Fourth, existing PGS algorithms have not provided suitable mechanisms to learn proper state features automatically.  To address these limitations, the overall goal of this thesis is to develop effective policy direct search algorithms for tackling challenging RL problems through technical innovations in four key areas. First, the thesis aims to improve the accuracy of policy gradient estimation by utilizing historical gradients through a Primal-Dual Approximation technique. Second, the thesis targets on surpassing the state-of-the-art performance by properly balancing the exploration-exploitation trade-off via Covariance Matrix Adaption Evolutionary Strategy (CMA-ES) and Proximal Policy Optimization (PPO). Third, the thesis seeks to stabilize value function learning via a self-organized Sandpile Model (SM) meanwhile generalize the compatible condition to support flexible value function learning. Fourth, the thesis endeavors to develop innovative evolutionary feature learning techniques that are capable of automatically extracting useful state features so as to enhance various cutting-edge PGS algorithms.  In the thesis, we explore the four key technical areas by studying policies with increasing complexity. First of all, we start the research from a simple linear policy representation, and then proceed to a complex neural network based policy representation. Next, we consider a more complicated situation where policy learning is coupled with a value function learning. Subsequently, we consider policies modeled as a concatenation of two interrelated networks, one for feature learning and one for action selection.  To achieve the first goal, this thesis proposes a new policy gradient learning framework where a series of historical gradients are jointly exploited to obtain accurate policy gradient estimations via the Primal-Dual Approximation technique. Under the framework, three new PGS algorithms for step-wise policy training have been derived from three widely used PGS algorithms; meanwhile, the convergence properties of these new algorithms have been theoretically analyzed. The empirical results on several benchmark control problems further show that the newly proposed algorithms can significantly outperform their base algorithms.  To achieve the second goal, this thesis develops a new sample efficient evolutionary deep policy optimization algorithm based on CMA-ES and PPO. The algorithm has a layer-wise learning mechanism to improve computational efficiency in comparison to CMA-ES. Additionally, it uses a performance lower bound based surrogate model for fitness evaluation to significantly reduce the sample cost to the state-of-the-art level. More importantly, the best policy found by CMA-ES at every generation is further improved by PPO to properly balance exploration and exploitation. The experimental results confirm that the proposed algorithm outperforms various cutting-edge algorithms on many benchmark continuous control problems.  To achieve the third goal, this thesis develops new value function learning methods that are both reliable and flexible so as to further enhance the effectiveness of policy gradient search. Two Actor-Critic (AC) algorithms have been successfully developed from a commonly-used PGS algorithm, i.e., Regular Actor-Critic (RAC). The first algorithm adopts SM to stabilize value function learning, and the second algorithm generalizes the logarithm function used by the compatible condition to provide a flexible family of new compatible functions. The experimental results show that, with the help of reliable and flexible value function learning, the newly developed algorithms are more effective than RAC on several benchmark control problems.  To achieve the fourth goal, this thesis develops innovative NeuroEvolution algorithms for automated feature learning to enhance various cutting-edge PGS algorithms. The newly developed algorithms not only can extract useful state features but also learn good policies. The experimental analysis demonstrates that the newly proposed algorithms can achieve better performance on large-scale RL problems in comparison to both well-known PGS algorithms and NeuroEvolution techniques. Our experiments also confirm that the state features learned by NeuroEvolution on one RL task can be easily transferred to boost learning performance on similar but different tasks.</p>


2021 ◽  
Vol 235 ◽  
pp. 03035
Author(s):  
jiaojiao Lv ◽  
yingsi Zhao

Recommendation system is unable to achive the optimal algorithm, recommendation system precision problem into bottleneck. Based on the perspective of product marketing, paper takes the inherent attribute as the classification standard and focuses on the core problem of “matching of product classification and recommendation algorithm of users’ purchase demand”. Three hypotheses are proposed: (1) inherent attributes of the product directly affect user demand; (2) classified product is suitable for different recommendation algorithms; (3) recommendation algorithm integration can achieve personalized customization. Based on empirical research on the relationship between characteristics of recommendation information (independent variable) and purchase intention (dependent variable), it is concluded that predictability and difference of recommendation information are not fully perceived and stimulation is insufficient. Therefore, SIS dynamic network model based on the distribution model of SIS virus is constructed. It discusses the spreading path of recommendation information and “infection” situation of consumers to enhance accurate matching of recommendation system.


Sign in / Sign up

Export Citation Format

Share Document