A View on Deep Reinforcement Learning in Imperfect Information Games

T.V. Pricope

doi:10.24193/subbi.2020.2.03

A View on Deep Reinforcement Learning in Imperfect Information Games

Studia Universitatis Babeș-Bolyai Informatica ◽

10.24193/subbi.2020.2.03 ◽

2020 ◽

Vol 65 (2) ◽

pp. 31

Author(s):

T.V. Pricope

Keyword(s):

Reinforcement Learning ◽

Imperfect Information ◽

Large Scale ◽

Traditional Approach ◽

Search Space ◽

Fictitious Play ◽

Learning Agents ◽

Real World Applications ◽

Imperfect Information Games ◽

Human Player

Many real-world applications can be described as large-scale games of imperfect information. This kind of games is particularly harder than the deterministic one as the search space is even more sizeable. In this paper, I want to explore the power of reinforcement learning in such an environment; that is why I take a look at one of the most popular game of such type, no limit Texas Hold’em Poker, yet unsolved, developing multiple agents with different learning paradigms and techniques and then comparing their respective performances. When applied to no-limit Hold’em Poker, deep reinforcement learning agents clearly outperform agents with a more traditional approach. Moreover, if these last agents rival a human beginner level of play, the ones based on reinforcement learning compare to an amateur human player. The main algorithm uses Fictitious Play in combination with ANNs and some handcrafted metrics. We also applied the main algorithm to another game of imperfect information, less complex than Poker, in order to show the scalability of this solution and the increase in performance when put neck in neck with established classical approaches from the reinforcement learning literature.

Download Full-text

Efficient Opponent Exploitation in No-Limit Texas Hold’em Poker: A Neuroevolutionary Method Combined with Reinforcement Learning

Electronics ◽

10.3390/electronics10172087 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2087

Author(s):

Jiahui Xu ◽

Jing Chen ◽

Shaofei Chen

Keyword(s):

Reinforcement Learning ◽

Imperfect Information ◽

Large Scale ◽

Higher Learning ◽

New Approach ◽

Hybrid Framework ◽

Gradient Based ◽

Novel Method ◽

Imperfect Information Games

In the development of artificial intelligence (AI), games have often served as benchmarks to promote remarkable breakthroughs in models and algorithms. No-limit Texas Hold’em (NLTH) is one of the most popular and challenging poker games. Despite numerous studies having been conducted on this subject, there are still some important problems that remain to be solved, such as opponent exploitation, which means to adaptively and effectively exploit specific opponent strategies; this is acknowledged as a vital issue especially in NLTH and many real-world scenarios. Previous researchers tried to use an off-policy reinforcement learning (RL) method to train agents that directly learn from historical strategy interactions but suffered from challenges of sparse rewards. Other researchers instead adopted neuroevolutionary (NE) method to replace RL for policy parameter updates but suffered from high sample complexity due to the large-scale problem of NLTH. In this work, we propose NE_RL, a novel method combing NE with RL for opponent exploitation in NLTH. Our method contains a hybrid framework that uses NE’s advantage of evolutionary computation with a long-term fitness metric to address the sparse rewards feedback in NLTH and retains RL’s gradient-based method for higher learning efficiency. Experimental results against multiple baseline opponents have proved the feasibility of our method with significant improvement compared to previous methods. We hope this paper provides an effective new approach for opponent exploitation in NLTH and other large-scale imperfect information games.

Download Full-text

Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5887 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4577-4584

Author(s):

Xian Yeow Lee ◽

Sambit Ghadai ◽

Kai Liang Tan ◽

Chinmay Hegde ◽

Soumik Sarkar

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Limited Resource ◽

Action Space ◽

Engineering Systems ◽

Physical Systems ◽

Learning Agents ◽

Look Ahead ◽

Real World Applications ◽

Temporal Dimensions

Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent's state space. Nonetheless, attacks on the RL agent's action space (corresponding to actuators in engineering systems) are equally perverse, but such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose the white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results showed that using the same amount of resources, the LAS attack deteriorates the agent's performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent's dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents.

Download Full-text

RLCard: A Platform for Reinforcement Learning in Card Games

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/764 ◽

2020 ◽

Author(s):

Daochen Zha ◽

Kwei-Herng Lai ◽

Songyi Huang ◽

Yuanpu Cao ◽

Keerthana Reddy ◽

...

Keyword(s):

Reinforcement Learning ◽

Research And Development ◽

Imperfect Information ◽

State Representation ◽

Research Opportunities ◽

Card Games ◽

Imperfect Information Games ◽

Learning Research ◽

Information Games

We present RLCard, a Python platform for reinforcement learning research and development in card games. RLCard supports various card environments and several baseline algorithms with unified easy-to-use interfaces, aiming at bridging reinforcement learning and imperfect information games. The platform provides flexible configurations of state representation, action encoding, and reward design. RLCard also supports visualizations for algorithm debugging. In this demo, we showcase two representative environments and their visualization results. We conclude this demo with challenges and research opportunities brought by RLCard. A video is available on YouTube.

Download Full-text

Solving Imperfect-Information Games via Discounted Regret Minimization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33011829 ◽

2019 ◽

Vol 33 ◽

pp. 1829-1836 ◽

Cited By ~ 1

Author(s):

Noam Brown ◽

Tuomas Sandholm

Keyword(s):

Imperfect Information ◽

Large Scale ◽

State Of The Art ◽

Iterative Algorithms ◽

Game Tree ◽

Regret Minimization ◽

Imperfect Information Games ◽

Improved Performance ◽

Pruning Techniques ◽

Prior State

Counterfactual regret minimization (CFR) is a family of iterative algorithms that are the most popular and, in practice, fastest approach to approximately solving large imperfectinformation games. In this paper we introduce novel CFR variants that 1) discount regrets from earlier iterations in various ways (in some cases differently for positive and negative regrets), 2) reweight iterations in various ways to obtain the output strategies, 3) use a non-standard regret minimizer and/or 4) leverage “optimistic regret matching”. They lead to dramatically improved performance in many settings. For one, we introduce a variant that outperforms CFR+, the prior state-of-the-art algorithm, in every game tested, including large-scale realistic settings. CFR+ is a formidable benchmark: no other algorithm has been able to outperform it. Finally, we show that, unlike CFR+, many of the important new variants are compatible with modern imperfect-informationgame pruning techniques and one is also compatible with sampling in the game tree.

Download Full-text

Deep Reinforcement Learning from Self-Play in No-limit Texas Hold'em Poker

Studia Universitatis Babeș-Bolyai Informatica ◽

10.24193/subbi.2021.2.04 ◽

2021 ◽

Vol 66 (2) ◽

pp. 51

Author(s):

T.-V. Pricope

Keyword(s):

Nash Equilibrium ◽

Imperfect Information ◽

Large Scale ◽

Adaptive Methods ◽

Single Step ◽

Random Factor ◽

Approximate Nash Equilibrium ◽

New Variant ◽

Imperfect Information Games ◽

Information Games

Imperfect information games describe many practical applications found in the real world as the information space is rarely fully available. This particular set of problems is challenging due to the random factor that makes even adaptive methods fail to correctly model the problem and find the best solution. Neural Fictitious Self Play (NFSP) is a powerful algorithm for learning approximate Nash equilibrium of imperfect information games from self-play. However, it uses only crude data as input and its most successful experiment was on the in-limit version of Texas Hold’em Poker. In this paper, we develop a new variant of NFSP that combines the established fictitious self-play with neural gradient play in an attempt to improve the performance on large-scale zero-sum imperfect information games and to solve the more complex no-limit version of Texas Hold’em Poker using powerful handcrafted metrics and heuristics alongside crude, raw data. When applied to no-limit Hold’em Poker, the agents trained through self-play outperformed the ones that used fictitious play with a normal-form single-step approach to the game. Moreover, we showed that our algorithm converges close to a Nash equilibrium within the limited training process of our agents with very limited hardware. Finally, our best self-play-based agent learnt a strategy that rivals expert human level.

Download Full-text

Model-Based Reinforcement Learning for Partially Observable Games with Sampling-Based State Estimation

Neural Computation ◽

10.1162/neco.2007.19.11.3051 ◽

2007 ◽

Vol 19 (11) ◽

pp. 3051-3087 ◽

Cited By ~ 5

Author(s):

Hajime Fujita ◽

Shin Ishii

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Imperfect Information ◽

Large Scale ◽

Computational Cost ◽

Sampling Technique ◽

Partial Observability ◽

Estimation And Prediction ◽

Model Based ◽

Partially Observable

Games constitute a challenging domain of reinforcement learning (RL) for acquiring strategies because many of them include multiple players and many unobservable variables in a large state space. The difficulty of solving such realistic multiagent problems with partial observability arises mainly from the fact that the computational cost for the estimation and prediction in the whole state space, including unobservable variables, is too heavy. To overcome this intractability and enable an agent to learn in an unknown environment, an effective approximation method is required with explicit learning of the environmental model. We present a model-based RL scheme for large-scale multiagent problems with partial observability and apply it to a card game, hearts. This game is a well-defined example of an imperfect information game and can be approximately formulated as a partially observable Markov decision process (POMDP) for a single learning agent. To reduce the computational cost, we use a sampling technique in which the heavy integration required for the estimation and prediction can be approximated by a plausible number of samples. Computer simulation results show that our method is effective in solving such a difficult, partially observable multiagent problem.

Download Full-text

An Evaluation Methodology for Interactive Reinforcement Learning with Simulated Users

Biomimetics ◽

10.3390/biomimetics6010013 ◽

2021 ◽

Vol 6 (1) ◽

pp. 13

Author(s):

Adam Bignold ◽

Francisco Cruz ◽

Richard Dazeley ◽

Peter Vamplew ◽

Cameron Foale

Keyword(s):

Reinforcement Learning ◽

Information Source ◽

Human Interaction ◽

Evaluation Methodology ◽

External Information ◽

Preliminary Evaluation ◽

Learning Agents ◽

Learning Agent ◽

Knowledge Bias ◽

The Impact

Interactive reinforcement learning methods utilise an external information source to evaluate decisions and accelerate learning. Previous work has shown that human advice could significantly improve learning agents’ performance. When evaluating reinforcement learning algorithms, it is common to repeat experiments as parameters are altered or to gain a sufficient sample size. In this regard, to require human interaction every time an experiment is restarted is undesirable, particularly when the expense in doing so can be considerable. Additionally, reusing the same people for the experiment introduces bias, as they will learn the behaviour of the agent and the dynamics of the environment. This paper presents a methodology for evaluating interactive reinforcement learning agents by employing simulated users. Simulated users allow human knowledge, bias, and interaction to be simulated. The use of simulated users allows the development and testing of reinforcement learning agents, and can provide indicative results of agent performance under defined human constraints. While simulated users are no replacement for actual humans, they do offer an affordable and fast alternative for evaluative assisted agents. We introduce a method for performing a preliminary evaluation utilising simulated users to show how performance changes depending on the type of user assisting the agent. Moreover, we describe how human interaction may be simulated, and present an experiment illustrating the applicability of simulating users in evaluating agent performance when assisted by different types of trainers. Experimental results show that the use of this methodology allows for greater insight into the performance of interactive reinforcement learning agents when advised by different users. The use of simulated users with varying characteristics allows for evaluation of the impact of those characteristics on the behaviour of the learning agent.

Download Full-text

Dynamic Dispatching for Large-Scale Heterogeneous Fleet via Multi-agent Deep Reinforcement Learning

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378191 ◽

2020 ◽

Author(s):

Chi Zhang ◽

Philip Odonkor ◽

Shuai Zheng ◽

Hamed Khorasgani ◽

Susumu Serita ◽

...

Keyword(s):

Reinforcement Learning ◽

Large Scale ◽

Heterogeneous Fleet ◽

Multi Agent ◽

Dynamic Dispatching

Download Full-text

Adaptive Traffic Signal Control for large-scale scenario with Cooperative Group-based Multi-agent reinforcement learning

Transportation Research Part C Emerging Technologies ◽

10.1016/j.trc.2021.103046 ◽

2021 ◽

Vol 125 ◽

pp. 103046

Author(s):

Tong Wang ◽

Jiahua Cao ◽

Azhar Hussain

Keyword(s):

Reinforcement Learning ◽

Large Scale ◽

Traffic Signal ◽

Signal Control ◽

Traffic Signal Control ◽

Cooperative Group ◽

Adaptive Traffic Signal Control ◽

Multi Agent

Download Full-text

Accelerated Sim-to-Real Deep Reinforcement Learning: Learning Collision Avoidance from Human Player

2021 IEEE/SICE International Symposium on System Integration (SII) ◽

10.1109/ieeeconf49454.2021.9382693 ◽

2021 ◽

Author(s):

Hanlin Niu ◽

Ze Ji ◽

Farshad Arvin ◽

Barry Lennox ◽

Hujun Yin ◽

...

Keyword(s):

Reinforcement Learning ◽

Collision Avoidance ◽

Human Player

Download Full-text