We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in countable classes of general stochastic environments. These environments can be non-Markovian, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that
(1) asymptotically its value converges in mean to the optimal value and
(2) given a recoverability assumption regret is sublinear.
We conclude with a discussion about optimality in reinforcement learning.