Q-learning with Long-term Action-space Shaping to Model Complex Behavior for Autonomous Lane Changes

Author(s):  
Gabriel Kalweit ◽  
Maria Huegle ◽  
Moritz Werling ◽  
Joschka Boedecker
Author(s):  
Eugene Ie ◽  
Vihan Jain ◽  
Jing Wang ◽  
Sanmit Narvekar ◽  
Ritesh Agarwal ◽  
...  

Reinforcement learning methods for recommender systems optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. We demonstrate our methods in simulation, and validate the scalability and effectiveness of decomposed TD-learning on YouTube.


Aerospace ◽  
2021 ◽  
Vol 8 (4) ◽  
pp. 113
Author(s):  
Pedro Andrade ◽  
Catarina Silva ◽  
Bernardete Ribeiro ◽  
Bruno F. Santos

This paper presents a Reinforcement Learning (RL) approach to optimize the long-term scheduling of maintenance for an aircraft fleet. The problem considers fleet status, maintenance capacity, and other maintenance constraints to schedule hangar checks for a specified time horizon. The checks are scheduled within an interval, and the goal is to, schedule them as close as possible to their due date. In doing so, the number of checks is reduced, and the fleet availability increases. A Deep Q-learning algorithm is used to optimize the scheduling policy. The model is validated in a real scenario using maintenance data from 45 aircraft. The maintenance plan that is generated with our approach is compared with a previous study, which presented a Dynamic Programming (DP) based approach and airline estimations for the same period. The results show a reduction in the number of checks scheduled, which indicates the potential of RL in solving this problem. The adaptability of RL is also tested by introducing small disturbances in the initial conditions. After training the model with these simulated scenarios, the results show the robustness of the RL approach and its ability to generate efficient maintenance plans in only a few seconds.


2021 ◽  
Author(s):  
Yongmin Cho ◽  
Rachael A Jonas-Closs ◽  
Lev Y Yampolsky ◽  
Marc W Kirschner ◽  
Leonid Peshkin

We present a novel platform for testing the effect of interventions on life- and health-span of a short-lived semi transparent freshwater organism, sensitive to drugs with complex behavior and physiology - the planktonic crustacean Daphnia magna. Within this platform, dozens of complex behavioural features of both routine motion and response to stimuli are continuously accurately quantified for large homogeneous cohorts via an automated phenotyping pipeline. We build predictive machine learning models calibrated using chronological age and extrapolate onto phenotypic age. We further apply the model to estimate the phenotypic age under pharmacological perturbation. Our platform provides a scalable framework for drug screening and characterization in both life-long and instant assays as illustrated using long term dose response profile of metformin and short term assay of such well-studied substances as caffeine and alcohol.


2021 ◽  
Author(s):  
Danial Esmaeili Aliabadi ◽  
Katrina Chan

Abstract BackgroundAccording to sustainable development goals (SDGs), societies should have access to affordable, reliable, and sustainable energy. Deregulated electricity markets have been established to provide affordable electricity for end-users through advertising competition. Although these liberalized markets are expected to serve this purpose, they are far from perfect and are prone to threats, such as collusion. Tacit collusion is a condition, in which power generating companies (GenCos) disrupt the competition by exploiting their market power. MethodsIn this manuscript, a novel deep Q-network (DQN) model is developed, which GenCos can use to determine the bidding strategies to maximize average long-term payoffs using available information. In the presence of collusive equilibria, the results are compared with a conventional Q-learning model that solely relies on past outcomes. With that, this manuscript aims to investigate the impact of emerging DQN models on the establishment of collusive equilibrium in markets with repetitive interactions among players. Results and ConclusionsThe outcomes show that GenCos may be able to collude unintentionally while trying to ameliorate long-term profits. Collusive strategies can lead to exorbitant electric bills for end-users, which is one of the influential factors in energy poverty. Thus, policymakers and market designers should be vigilant regarding the combined effect of information disclosure and autonomous pricing, as new models exploit information more effectively.


Sensors ◽  
2018 ◽  
Vol 18 (11) ◽  
pp. 3606 ◽  
Author(s):  
Wanli Xue ◽  
Zhiyong Feng ◽  
Chao Xu ◽  
Zhaopeng Meng ◽  
Chengwei Zhang

Although tracking research has achieved excellent performance in mathematical angles, it is still meaningful to analyze tracking problems from multiple perspectives. This motivation not only promotes the independence of tracking research but also increases the flexibility of practical applications. This paper presents a significant tracking framework based on the multi-dimensional state–action space reinforcement learning, termed as multi-angle analysis collaboration tracking (MACT). MACT is comprised of a basic tracking framework and a strategic framework which assists the former. Especially, the strategic framework is extensible and currently includes feature selection strategy (FSS) and movement trend strategy (MTS). These strategies are abstracted from the multi-angle analysis of tracking problems (observer’s attention and object’s motion). The content of the analysis corresponds to the specific actions in the multidimensional action space. Concretely, the tracker, regarded as an agent, is trained with Q-learning algorithm and ϵ -greedy exploration strategy, where we adopt a customized rewarding function to encourage robust object tracking. Numerous contrast experimental evaluations on the OTB50 benchmark demonstrate the effectiveness of the strategies and improvement in speed and accuracy of MACT tracker.


2021 ◽  
Vol 17 (4) ◽  
pp. e1008847
Author(s):  
Michael Foley ◽  
Rory Smead ◽  
Patrick Forber ◽  
Christoph Riedl

Can egalitarian norms or conventions survive the presence of dominant individuals who are ensured of victory in conflicts? We investigate the interaction of power asymmetry and partner choice in games of conflict over a contested resource. Previous models of cooperation do not include both power inequality and partner choice. Furthermore, models that do include power inequalities assume a static game where a bully’s advantage does not change. They have therefore not attempted to model complex and realistic properties of social interaction. Here, we introduce three models to study the emergence and resilience of cooperation among unequals when interaction is random, when individuals can choose their partners, and where power asymmetries dynamically depend on accumulated payoffs. We find that the ability to avoid bullies with higher competitive ability afforded by partner choice mostly restores cooperative conventions and that the competitive hierarchy never forms. Partner choice counteracts the hyper dominance of bullies who are isolated in the network and eliminates the need for others to coordinate in a coalition. When competitive ability dynamically depends on cumulative payoffs, complex cycles of coupled network-strategy-rank changes emerge. Effective collaborators gain popularity (and thus power), adopt aggressive behavior, get isolated, and ultimately lose power. Neither the network nor behavior converge to a stable equilibrium. Despite the instability of power dynamics, the cooperative convention in the population remains stable overall and long-term inequality is completely eliminated. The interaction between partner choice and dynamic power asymmetry is crucial for these results: without partner choice, bullies cannot be isolated, and without dynamic power asymmetry, bullies do not lose their power even when isolated. We analytically identify a single critical point that marks a phase transition in all three iterations of our models. This critical point is where the first individual breaks from the convention and cycles start to emerge.


2012 ◽  
Vol 3 (2) ◽  
pp. 39-57 ◽  
Author(s):  
Ioan Sorin Comsa ◽  
Mehmet Aydin ◽  
Sijing Zhang ◽  
Pierre Kuonen ◽  
Jean–Frédéric Wagen

The use of the intelligent packet scheduling process is absolutely necessary in order to make the radio resources usage more efficient in recent high-bit-rate demanding radio access technologies such as Long Term Evolution (LTE). Packet scheduling procedure works with various dispatching rules with different behaviors. In the literature, the scheduling disciplines are applied for the entire transmission sessions and the scheduler performance strongly depends on the exploited discipline. The method proposed in this paper aims to discuss how a straightforward schedule can be provided within the transmission time interval (TTI) sub-frame using a mixture of dispatching disciplines per TTI instead of a single rule adopted across the whole transmission. This is to maximize the system throughput while assuring the best user fairness. This requires adopting a policy of how to mix the rules and a refinement procedure to call the best rule each time. Two scheduling policies are proposed for how to mix the rules including use of Q learning algorithm for refining the policies. Simulation results indicate that the proposed methods outperform the existing scheduling techniques by maximizing the system throughput without harming the user fairness performance.


Author(s):  
Shiyu Huang ◽  
Hang Su ◽  
Jun Zhu ◽  
Ting Chen

Deep reinforcement learning (DRL) has achieved surpassing human performance on Atari games, using raw pixels and rewards to learn everything. However, first-person-shooter (FPS) games in 3D environments contain higher levels of human concepts (enemy, weapon, spatial structure, etc.) and a large action space. In this paper, we explore a novel method which can plan on temporally-extended action sequences, which we refer as Combo-Action to compress the action space. We further train a deep recurrent Q-learning network model as a high-level controller, called supervisory network, to manage the Combo-Actions. Our method can be boosted with auxiliary tasks (enemy detection and depth prediction), which enable the agent to extract high-level concepts in the FPS games. Extensive experiments show that our method is efficient in training process and outperforms previous stateof-the-art approaches by a large margin. Ablation study experiments also indicate that our method can boost the performance of the FPS agent in a reasonable way.


Sign in / Sign up

Export Citation Format

Share Document