Introduction
There is a wide range of different patterns of reinforcement learning concepts, each slightly different from the previous — value iteration, policy iteration, PPO, TRPO, REINFORCE, A2C, A3C… If these are also confusing to you and muddle together into one blob of ideas, this is completely expected! Four decades of the proud tradition of categorization and naming have overloaded terms and letters, and made it impossible for new-comers to understand what’s going on.
The goal of this article is not to explain each of these terms— we would be stuck here forever! — but to provide to you a model for the constraints that create the categories that create a bundle of all of these policies. My goal is that, by the end of this, you could look at an algorithm, and be able to place it in a set of categories, understanding trade-offs and what works better for what domains.
Categorizations between RL Algorithms
Below is a flowchart of how to think of RL algorithms. I’ll warn you because you peruse it — it is a simplification. In truth, categories can be fuzzy (we’ll see an example later), and distinctions don’t necessarily follow a clean hierarchical structure. But for our first venture into this, this reduction will do:
Since the leaves of this flowchart describe quite a few algorithms (some familiar faces might include PPO, Q-Learning, and value iteration), I would like to instead focus a lot more on the branching locations, and how the constraints of our system have informed the kinds of algorithms we have developed.
Model-Based vs Model-Free
Do not get confused between the policy models we train and the “model” we are using here. The term is over-loaded. When we talk of models here, we talk of learning the environment and the rewards from it.
The core distinction between model-based and model-free is whether you have a learned model of the environment, or whether you are interacting with the environment directly and gleaning rewards from those experiences.
Imagine you are training a robot to play basketball. You can take as many shots at the basket as you’d like. You are not in a data-constrained regime, but simulating all the interactions between a basketball and the basket is not a trivial task. This is an excellent domain for model-free RL— training data is easy to collect directly.
But instead imagine you couldn’t actually collect that many training data points directly. You need to somehow be more sample efficient — get more bang for your buck every time your robot shoots. Instead, you’d teach the robot about gravity, and acceleration, and air resistance. Then, the robot will build a mental model of what will happen when it throws the ball. This will hopefully help it get more good shots. This is model-based RL.
Note that a lot of model-free methods still interact with simulations of environments — in self-driving, for example, models are trained in simulations. This is still model-free learning, since the model is not leveraging an environment model to get predictions about future rewards.
Here is the split: If the only way to learn the rewards from an action is to take that action, the algorithm is model-free. If instead you can call out to another environment model to get the value of an action, then it is model-based.
Learning Values vs Policies
In model-free training, you can really learn two things:
- What is the value of the current state I am in? That is, how close is this state to my goal?
- What is the value of the action I am considering? Will it get me closer to my goal?
And you can also learn both simultaneously.
What is the difference? When your action space is finite and discrete, such that you can assign a number to each state, you have clear final rewards, or perhaps you need guarantees about finding the optimal policy — value learning is your friend.
On the other hand, policy learning is for when you have continuous or high-dimensional action spaces and your search space is very large. The optimal policy might be stochastic, or the reward structure might be complex or sparse. Policy learning adapts to these as well.
On-Policy vs Off-Policy
On-policy learning means the agent learns directly from actions it took using its current policy. The policy being improved is the same one generating the behavior. Conversely, off-policy learning means the agent can learn from actions that came from a different policy (behavior policy) than the one being optimized (target policy).
This means that an off-policy algorithm can learn from old trajectories or from an expert’s trajectories, but an on-policy model must learn from their own actions.
Off-policy algorithms therefore allow re-using data or alternative data collection strategies, which can be helpful for bootstrapping models or learning more efficiently. However, off-policy algorithms can also be unstable if there is a mismatch between the behavior policy generating data and the target policy being learned. On-policy methods also naturally balance the exploration-exploitation tradeoff, but this needs to be carefully tailored in off-policy algorithms.
Pop Quiz: What is AlphaGo?
I think it’s helpful to try to categorize AlphaGo with everything we’ve learned today! So, how does AlphaGo work? It has two phases:
- Sampling Phase: You play a ton of games with a policy and value model and collect trajectories and final rewards (win/lose).
- Training Phase: You train the policy model upon the trajectories and the value model over their corresponding rewards.
There is an additional detail about using Monte-Carlo Tree Search at the sampling phase to 1/ make better moves by trying out a bunch of them, and 2/ get a scoring distribution over sets of next actions to use in fine-tuning — but we will skip these details for now to focus on our core task.
Is AlphaGo on-policy or off-policy? Note how the sampling phase and the fine-tuning phases are different. The policy that took those actions in the sampling phase is not the policy that will consume them at iteration 20 (for instance) of the training phase. Therefore, AlphaGo is an off-policy algorithm.
Does AlphaGo learn a policy directly or a value function? As you read above… both! It learns a value function over states and a policy function over next actions. So AlphaGo is both a policy learning and value learning algorithm.
Is AlphaGo Model-Based or Model-Free? Hint: Trick question. It’s both! For the model itself, we’re not leveraging the simulator for anything beyond taking many actions. However, when using Monte Carlo Tree Search on top, we’re retaining information about previous series of actions that educate the model on what to do next (”this branch is bad, this branch is good…”). So AlphaGo is both model-free and model-based.
Conclusion
I like AlphaGo as an example that these categories are helpful to think through tradeoffs, but not strict. For every trade-off, the current choice might be either, both, or neither, or one at first then moving to the second technique, and so on.
In a domain of research as storied and long-standing as reinforcement learning, almost every rule comes with a paper that is its exception.