scholarly journals Surveys without Questions: A Reinforcement Learning Approach

Author(s):  
Atanu R Sinha ◽  
Deepali Jain ◽  
Nikhil Sheoran ◽  
Sopan Khosla ◽  
Reshmi Sasidharan

The ‘old world’ instrument, survey, remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing, the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers’ online interactions; failing to capture a customer’s interactions over time since the rating is a one-time snapshot; and inability to tie back customers’ ratings to specific interactions because ratings provided relate to all interactions. To overcome these deficiencies we extract proxy ratings from clickstream data, typically collected for every customer’s online interactions, by developing an approach based on Reinforcement Learning (RL). We introduce a new way to interpret values generated by the value function of RL, as proxy ratings. Our approach does not need any survey data for training. Yet, on validation against actual survey data, proxy ratings yield reasonable performance results. Additionally, we offer a new way to draw insights from values of the value function, which allow associating specific interactions to their proxy ratings. We introduce two new metrics to represent ratings - one, customer-level and the other, aggregate-level for click actions across customers. Both are defined around proportion of all pairwise, successive actions that show increase in proxy ratings. This intuitive customer-level metric enables gauging the dynamics of ratings over time and is a better predictor of purchase than customer ratings from survey. The aggregate-level metric allows pinpointing actions that help or hurt experience. In sum, proxy ratings computed unobtrusively from clickstream, for every action, for each customer, and for every session can offer interpretable and more insightful alternative to surveys.

Author(s):  
Humoud Alsabah ◽  
Agostino Capponi ◽  
Octavio Ruiz Lacedelli ◽  
Matt Stern

Abstract We introduce a reinforcement learning framework for retail robo-advising. The robo-advisor does not know the investor’s risk preference but learns it over time by observing her portfolio choices in different market environments. We develop an exploration–exploitation algorithm that trades off costly solicitations of portfolio choices by the investor with autonomous trading decisions based on stale estimates of investor’s risk aversion. We show that the approximate value function constructed by the algorithm converges to the value function of an omniscient robo-advisor over a number of periods that is polynomial in the state and action space. By correcting for the investor’s mistakes, the robo-advisor may outperform a stand-alone investor, regardless of the investor’s opportunity cost for making portfolio decisions.


2005 ◽  
Vol 24 ◽  
pp. 81-108 ◽  
Author(s):  
P. Geibel ◽  
F. Wysotzki

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.


Author(s):  
Zhuo Wang ◽  
Shiwei Zhang ◽  
Xiaoning Feng ◽  
Yancheng Sui

The environmental adaptability of autonomous underwater vehicles is always a problem for its path planning. Although reinforcement learning can improve the environmental adaptability, the slow convergence of reinforcement learning is caused by multi-behavior coupling, so it is difficult for autonomous underwater vehicle to avoid moving obstacles. This article proposes a multi-behavior critic reinforcement learning algorithm applied to autonomous underwater vehicle path planning to overcome problems associated with oscillating amplitudes and low learning efficiency in the early stages of training which are common in traditional actor–critic algorithms. Behavior critic reinforcement learning assesses the actions of the actor from perspectives such as energy saving and security, combining these aspects into a whole evaluation of the actor. In this article, the policy gradient method is selected as the actor part, and the value function method is selected as the critic part. The strategy gradient and the value function methods for actor and critic, respectively, are approximated by a backpropagation neural network, the parameters of which are updated using the gradient descent method. The simulation results show that the method has the ability of optimizing learning in the environment and can improve learning efficiency, which meets the needs of real time and adaptability for autonomous underwater vehicle dynamic obstacle avoidance.


Author(s):  
Nicholay Topin ◽  
Manuela Veloso

Though reinforcement learning has greatly benefited from the incorporation of neural networks, the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features, making it unsuitable for explaining a sequence of decisions. To address this need, we introduce Abstracted Policy Graphs, which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally, we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions, potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated, our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions, O(|F|2|tr samples|). By applying our method to a family of domains, we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.


Author(s):  
Ling Pan ◽  
Qingpeng Cai ◽  
Qi Meng ◽  
Wei Chen ◽  
Longbo Huang

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying the DBS operator, which outperforms DQN substantially in 40 out of 49 Atari games.


2020 ◽  
Vol 12 (21) ◽  
pp. 8883
Author(s):  
Kun Jin ◽  
Wei Wang ◽  
Xuedong Hua ◽  
Wei Zhou

As the key element of urban transportation, taxis services significantly provide convenience and comfort for residents’ travel. However, the reality has not shown much efficiency. Previous researchers mainly aimed to optimize policies by order dispatch on ride-hailing services, which cannot be applied in cruising taxis services. This paper developed the reinforcement learning (RL) framework to optimize driving policies on cruising taxis services. Firstly, we formulated the drivers’ behaviours as the Markov decision process (MDP) progress, considering the influences after taking action in the long run. The RL framework using dynamic programming and data expansion was employed to calculate the state-action value function. Following the value function, drivers can determine the best choice and then quantify the expected future reward at a particular state. By utilizing historic orders data in Chengdu, we analysed the function value’s spatial distribution and demonstrated how the model could optimize the driving policies. Finally, the realistic simulation of the on-demand platform was built. Compared with other benchmark methods, the results verified that the new model performs better in increasing total revenue, answer rate and decreasing waiting time, with the relative percentages of 4.8%, 6.2% and −27.27% at most.


2005 ◽  
Vol 17 (2) ◽  
pp. 335-359 ◽  
Author(s):  
Jun Morimoto ◽  
Kenji Doya

This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H∞ control, we consider a differential game in which a “disturbing” agent tries to make the worst possible disturbance while a “control” agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H∞ control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.


Sign in / Sign up

Export Citation Format

Share Document