📝 Introduction to RL

This is a series of notes associated with Steve Brunton’s Excellent Book “Data-Driven Science and Engineering”. I highly recommend his series of lectures available on YouTube as well as the full book itself. These notes only cover the RL sections (Chapter 11) of the book. Relatedly, some people might find my breakdown of Monte Carlo Tree Search helpful.

Introduction to Reinforcement Learning

RL Visualization

The core of RL is the system of interactions between the Agent and the Environment. Here it is, stated as simply as possible
- From the environment, the agent senses some state $s$.
- Given $s$, it takes actions according to policy $\pi$.
  - The policy $\pi$ is optimized through learning to maximize future rewards $r$, gotten from the environment.
- The action from the policy $a$ affects the environment, which creates some new state $s’$ for the agent, and the loop continues.
The RL agent is able to learn delayed rewards, which is critical for systems where the optimal solution involves a multi-step procedure.
- Rewards = sporadic and time-delayed labels
- This makes RL semi-supervised learning.
RL policies are typically learning in two phases:
- Exploration - where trial-and-error is used to learn the rules
- Exploitation - where a strategy is chosen and optimized within the learned rules

Mathematical Formalism

The policy is the probability of taking action $a$, given state $s$, to maximize the total future rewards.

$$ \pi(s, a) = Pr(a = a|s =s) $$
- It is typically parameterized by some lower-dimensional vector $\theta$ (represented $\pi_\theta$) which approximates it (this functional approximation is indeed the core of deep RL!)
The state is some partial measurement of a higher-dimensional environmental state (which is a stochastic, non-linear, dynamic system).
- For simplicity, we assume that state evolution is a Markov Decision Process— the current state is determined only by the previous state.
  
  $$ P(s’, s, a) = Pr(s_{k+1} = s’ | s_k = s, a_k = a) $$
- Even with this simplifying assumption, it can be hard to model the state evolution — if one cannot, model-free RL strategies can be used. If there is sufficient data to learn the MDP, then model-based RL strategies can be used.
The reward is another partial measurement assumed to be Markovian in nature — one only needs the current and the previous state to determine the current reward.

$$ R(s’, s, a) = Pr(r_{k+1}|s_{k+1} = s’, s_k = s, a_k = a) $$
The value function is the desirability of being in a given state, given policy $\pi$. It is modeled as the expected reward over some time steps $k$, subject to some discount factor $\gamma$.

$$ V_\pi(s) = E(\sum_{k} \gamma^kr_k | s_0 = s) $$
- Note that the value function can be written recursively; ie
  
  $$ V(s) = \max_\pi E(r_0 + \gamma V(s’)) $$
  - Assuming $s’$ is the best next state, and $V$ is the optimal value for the best next state.
  - This is the Bellman’s equation, based on Bellman’s principle of optimality. A simple way of understanding it is that, assuming V(s’) is optimal, the optimal V(s) can be found.
- So V can be thought of as the value of the node, assuming you use the best possible policy for $\pi$.
- Given a well-calibrated value function, we can extract the optimal policy as:
  
  $$ \pi = \argmax_\pi E(r_0 + \gamma V(s’)) $$

Goals and Challenges in RL

RL is the process of learning the policy, learning the value function, or jointly learning both.
RL might often require a large number of trials to be evaluated to determine an optimal policy. Thus, it is expensive to train, and not suitable for domains where testing a policy is expensive or potentially unsafe.
- Use RL when 1/ evaluating a policy is inexpensive, 2/ There are sufficient resources to perform a near brute-force optimization, 3/ No other control strategy works.
Many of the theoretical convergence results — and fundamental RL algorithms — only apply to finite MDPs. Many continuous dynamical systems can be approximated as a finite MDP through quantization or discretization.
- These might succumb to the curse of dimensionality. As the number of features or dimensions grows (needed in complex domains), the amount of data we need to generalize accuracy grows exponentially and thus becomes increasingly challenging.
Supervision can be quite weak in RL. A core challenge is that rewards are often rare and significantly delayed.
- This is the credit assignment problem — what action sequence was actually responsible for the reward ultimately received?

Taxonomy of RL techniques

RL Taxonomy

The biggest divide between techniques is when there is a known model for the environment and when there isn’t.
- When there is a known environment model, dynamic programming-like techniques for policy and value iteration can be used.
- When there is no model for the env, alternate strategies like Q-learning must be employed.

Something that confused me: model-based and model-free are not talking about the policy model. It is not even talking about a model in the neural network sense. It is simply a differentiation about whether there is a known way to model the changes in the environment, or not.

Model-Based RL

The simplest instance of this form of RL algorithm is dynamic programming — and most of this domain of RL relies on the bellman equation: That is to say, we can optimize smaller sub-problems on our path to optimize the core bigger problem that we care about.

Value Iteration

Use Bellman’s optimality conditions to, through iterations, optimize the value function through a known model.

$$ V(s) = \max_a \sum_{s’}P(s’|s, a) \underbrace{(R(s’, s, a) + \gamma V(s’))}_{\text{Future total rewards}} $$
Here are the steps:
- Create a table of all the optimal value function for all states. We can initialize these values to be random. Assume that you know the environment model $P(s’ | s, a)$ and assume that you know the value function $V(s’)$.
- Using our table (can be a bad estimate), compute the $V(s)$ for all actions and pick the best action. Then, update the value function with the maximum expected future reward.
- Even if the value function is initialized badly, each time the value function is updated — the value function keeps improving over time.
This only works, of course, if $P(s’|s,a)$ and $R(s’, s, a)$ are well-calibrated. But given that we know the environment model, this assumption is sound.
This also only works if you can list out and initialize all possible states.

Policy Iteration

Two-step optimization procedure to simultaneously find an optimal value function $V_\pi$ and the corresponding optimal policy $\pi$.
First, we lock in assumed best policy $\pi$. Similar to value iteration, but now we don’t consider every action $a$, we trust our policy to find the best action:

$$ V_\pi(s) = \sum_{s’} P(s’|s, a=\pi(s))(R(s’, s, \pi(s) + \gamma V_\pi(s)) $$
Then, we lock in the value function. Then, we sweep through the actions and update the policy, such that:

$$ \pi(s) = \argmax_{a} \underbrace{E(R(s’, s, a) + \gamma V_\pi(s))}_{\sum_{s’} P(s’|s, \pi(s))(R(s’, s, \pi(s)) + \gamma V_\pi(s))} $$
- For every state, find the action that maximizes the expected future rewards from that state.
This often converges faster.

Quality Function

In both iteration algorithms, we depend on the expectation of all future rewards — this is measured by the “quality function” (or the q-function). $$ Q(s, a) = \ E(R(s’, s, a) + \gamma V(s’)) $$ $$ = \sum_{s’} P(s’|s, \pi(s))(R(s’, s, \pi(s)) + \gamma V_\pi(s)) $$
In some sense, this is just renaming something we’ve already used.
But there is something critical to realize here — if we were to treat the quality function as a black box, we suddenly do not need to assume that we know what $P(s’|s, a)$ or $R(s’, s, a)$ is. They can be learned implicitly through trial and error. This motivates model-free RL — by trial and error, optimizing Q functions directly can teach you policy and value functions.

$$ V(s) = \max_{a} Q(s, a) \ \pi(s, a) = \argmax_a Q(s, a) $$

Model-Free RL

If we don’t have access to the environmental model, then we must rely on trial and error to learn the reward structure of the environment.

Monte Carlo Learning

This is considered the simplest approach to Model-Free RL. This is an episodic learning algorithm — it needs to run through many states before accumulating a reward. Games are a good example of a domain suited for this.
In Monte Carlo learning, the total cumulative reward at the end of the task is used to estimate either the value function V or the quality function Q by dividing the final reward equally among state-action pairs.
This can be quite sample inefficient, especially for problems with sparse rewards.
This can be expressed as the following for any given state-action pair (s, a), where Q is the old function and Q’ is the new function:

$$ Q’(s,a ) = Q(s, a) + \frac{1}{n}(R_{\Sigma} - Q(s, a)) \ V’(s) = V(s) + \frac{1}{n}(R_\Sigma - V(s)) $$

It is possible to use a learning rate $\alpha$ instead of $1/n$, which might be helpful as a discount factor instead.

Temporal-Difference (TD) Learning

While Monte Carlo learning assumes that the correct action could have existed at any point in the past series of actions, TD learning assumes that there is an optimal time period in which the right action would have happened.
Instead of being episodic like Monte Carlo learning, TD learning bootstraps using current estimates of the quality function. This makes it more sample efficient.
TD(0)

$$ V’(s) = V(s) + \alpha \overbrace{(\underbrace{r(s) + \gamma V(s’)}_{\text{TD target }R_\Sigma} - V(s))}^\text{TD error} $$
- This is similar to Monte Carlo learning but more step-wise / dynamic-programming flavored
TD(n)

$$ V’(s) = V(s) + \alpha (\underbrace{\sum_{j=0}^n \gamma^j r_j \gamma^{n + 1}V(s_n+1)}_{estimate} - V(s) $$
Monte Carlo typically has high variance but no bias, but TD learning has lower variance but introduces a bias because of the bootstrapping.

SARSA Learning

SARSA is a variation of TD learning that uses Q-functions instead of value functions.
The SARSA(0) versions remains effectively the same:

$$ Q’(s_k, a_k) = Q(s_k, a_k) + \alpha (r_k + \gamma Q(s_{k + 1}, a_{k+1}) - Q(s_k, a_k)) $$
For SARSA(n), we assume that the n step reward can be written as the follows:

$$ R_\Sigma^{(n)} = \sum_{j=0}^{n} \gamma^j r_{k + j} + \gamma^{n+1}Q(s_{k + n + 1}, a_{k + n+1}) $$
- Then
  
  $$ Q’(s_k, a_k) = Q(s_k, a_k) + \alpha(R_\Sigma^{(n)} - Q(s_k, a_k)) $$
SARSA is often safer and has better total reward while learning.

Q-Learning

Q-learning can be thought of as an off-policy TD learning scheme for the Q function. The update equation is as follows:

$$ Q’(s_k, a_k) = Q(s_k, a_k) + \alpha(\underbrace{r_k + \gamma \max_a Q(s_{k+1}, a)}_{\text{reward appx}} - Q(s_k, a_k)) $$
- It is called an off-policy algorithm because note how the next best Q-function is chosen by sampling all possible next actions and not based on some policy (which would make it “on-policy”)
Q-learning may take sub-optimal actions to explore, while still using the optimal action to update the Q function.
Q-learning converges faster than SARSA, but has lower variance in the solution.
In SARSA, using actions that are sub-optimal (based on current policy) will degrade the Q function since it will have a flawed estimate of future rewards based on a sub-optimal action. However, in Q-learning, since the action is optimized over the current Q function in the update, it is possible to learn from experience resulting from sub-optimal actions.
To introduce a bit of random exploration in our Q-learning, we sometimes use $\epsilon$-greedy actions, where the agent takes the current optimal action and random action with probability $\epsilon$.
Note that this off-policy behavior unlocks two important data features:
- Experience replay: Being able to replay actions from the past, even if the policy function has been updated since then.
- Imitation learning: Being able to learn from someone else’s gameplay by following their actions, even if our q-functions are currently not very well-calibrated

Policy Gradients

It is possible to use gradient optimization methods to optimize a parameterized policy and learn much faster than traditional iteration.
Here, instead of extracting the policy as the argument maximizing the value or quality functions, we want to directly optimize the parameters $\theta$.
Given a policy $\pi_\theta$, the cumulative reward is the weighted mu distribution (the asymptotic expected probability distribution of finding yourself in a state over a long time. So, the reward is a weighted sum over the states and the actions

$$ R_\Sigma = \sum_{s} \mu_\theta(s) \sum_{a} \pi_\theta(s, a)Q(s, a) \ $$
Then, the policy update rule looks like:

$$ \theta_{new} = \theta_{old} + \alpha \nabla_{\theta} R_{\Sigma, \theta} \ = \theta_{old} + \alpha \mathbb{E}(Q(s, a)\nabla_{\theta} \log (\pi_\theta(s, a)) $$

Deep RL

Deep Q-Learning

Learning a quality function with a neural network — $Q(s, a, \theta)$ — theta is lower dimensional than the entire search space; avoid the curse of dimensionality. The loss for the neural network looks like:

$$ L = \mathbb{E}[(r_k + \gamma \max_a Q(s_{k+1}, a, \theta) - Q(s_k, a_k, \theta))^2] $$

Advantage Networks

In Deep Dueling Q Networks (DDQN), we take the Q-function and we split it into two halves:

$$ Q(s, a, \theta) = V(s, \theta_1) + A(s, a, \theta_2) $$
- Where $A(.)$ marks the advantage that action $a$ provides over state $s$. This network is most helpful when the difference between various actions is quite subtle. So now you can train two separate networks.

Actor-Critic Networks

We have two learners: Actor is policy based, and the critic is value based. They each attempt to measure and optimize the other.

$$ \theta_{k+1} = \theta_k + \alpha(r_k + \gamma V(s_{k+1}) - V(s)) $$
- This is still the temporal difference error, but the value information helps update the policy instead.
In Deep Learning, there are Advantage Actor Critic Nets (A2C). Similarly:

$$ \theta_{k+1} = \theta_k + \alpha \nabla (\log\pi(s_k, a_k, \theta) Q(s_k, a_k, \theta_2)) $$
- Q-learning is purely value based; here you can jointly optimizing the value based information and updating the policy function.

Other Techniques

Reward Shaping

Designing customized proxy features that are indicative of a future reward that may be used as an intermediate reward signal
Requires expert human guidance to design; limits the upper end of the agent’s performance to that of the human expert.

Hindsight Experience Replay

Enriches the reward signal by taking failed trials and pretending they were successful at some other task.

Curiosity Driven Exploration

Problem: Agent may be stuck in a local minima, where it over-optimizes for a small region of state space.
One approach to this problem is to augment the reward signal with a novelty reward.
The difference between TD learning and curiosity rewards is that the rward discrepancy is used as feedback to improve the Q-function, where in curiosity exploration the discrepancy is explicitly used as an additional reward signal.