Key takeaways from ICML 2019

Pieter Abbeel’s Some Explorations of Exploration in Reinforcement Learning

The final presentation we’ll discuss from the 2019 International Conference on Machine Learning is Pieter Abbeel’s Some Explorations on Exploration in Reinforcement Learning.

Many of the reinforcement learning talks I attended were extremely specialized and probably tailored to a small sub-community within the reinforcement learning community at the conference. However, Pieter Abbeel gave a terrific review of different methods to implement exploration within a reinforcement learning algorithm. Abbeel has a handful of youtube videos (including ones from the UC-Berkeley Deep RL bootcamp) that do a terrific job of explaining general principles of reinforcement learning.

Pieter Abbeel presenting at ICML 2019.

He started his talk with the simplest methods to put into effect and ended with methods that were more difficult to use. The simplest to implement included Action noise (which included Epsilon-Greedy, Boltzmann, and Max Entropy), parameter noise, and Q-ensembles. The action (and parameter) noise routes do exactly what it sounds like, add some randomness to the agent’s actions using different formulations. In Q-ensembles, you learn an ensemble of different Q-functions by training multiple neural networks (the Q-function) with different initializations and different drop-outs. You can then sample the deep Q networks one member per rollout.

Exploration Bonus was then explained (which would be the next level up from “simplest to implement”). This again is exactly what it sounds like, giving a bonus/reward if the agent moves to an unexplored state. For tabular Markov Decision Processes (like tabular Q-Learning) a Bayesian optimal exploration bonus is typically proportional to 1/n0.5.  For high dimensional state-spaces state visitation counts are a little trickier (on hashes of learned embedding space). Curiosity and Variational Information Maximization were other Exploration Bonus methods. These last two approaches involve model-based RL and the exploration bonus is determined by the prediction error of the learned dynamics model.

Meta-learning methods for exploration was also discussed. Latent Exploration Spaces (LAE) is the use of generative models for exploration. This approach aims to learn better spaces to inject noise instead of simply adding noise to all of the action or parameter space. GANs and Variational Auto-Encoders (VAEs) can be used to learn a latent space of behaviors, which span the input task distribution. For Model-Agnostic Exploration with Structured Noise (MAESN), the stochastic policy would input a state and latent vector z and output the action (as opposed to just inputting the state). The latent vector z could potentially be learned from a VAE or GAN.    

Abbeel quickly touched on some of the more complex ways to implement exploration including Transfer + Bonuses, Behavior Diversity, and Hindsight Experience Replay (HER). Many of these methods would be useful to try out the next time we implement a reinforcement learning algorithm. 

Overall, I thought the ICML was very informative and provided the team and myself excellent ideas on how to improve upon our current machine learning models. Attending ICML was a great way to see how the bleeding edge of AI research is evolving and where we can implement the state-of-the-art models into our product. The Panoramic AI Research Team is looking forward to present at the ICML next year.