Monte Carlo methods require only experience - sample sequence of states, actions, and rewards, while we do not have complete knowledge of the environment.
5.1 Monte Carlo Prediction
This section is about learning the state-value function for a given policy. In other words, estimate $v_\pi(s)$
First-visit MC Method: only averages the first visits to $s$.
Every-visit MC Method: averages the returns of all visits to $s$.
Monte Carlo Method's advantage over Dynamic Programming:
- Learn from actural experience
- Learn from simulated experience
- Computation-friendly
5.2 Monte Carlo Estimation of Action Values
To solve the problem of maintaining exploration, we have the assumption of exploring starts - we assume that the episodes start in a state-action pair.
Note: the reason why we use $q(s,a)$ in Monte Carlo Methods instead of using $v(s)$ like we did in Dynamic Programming is that, using $v(s)$ is more computational-friendly than using $q(s,a)$. However, using $v(s)$ requires us to know the environment well, in other words, $p(s',r|s,a)$.
5.3 Monte Carlo Control
General policy iteration: a loop of policy evaluation and policy improvement.
2 approaches:
- Complete the whole policy evaluation
- Give up using the whole policy evaluation (one extreme case for this is value iteration).
Monte Carlo ES (Exploring Starts): An algorithm for estimating $\pi\approx\pi_\ast$. See page 99 of the book.
5.4 Monte Carlo Control without Exporing Starts
On-policy methods and off-policy methods:
Background: it is often unlikely to have exploring starts. As a result, we need to ensure all actions are selected infinitely often. There are 2 methods to ensure this: on-policy methods and off-policy methods.
On-policy methods: attempt to evaluate or improve the policy that is used to make decisions.
Off-policy methods: evaluate or improve a policy different from that used to generate the data. (Introduced in the next section).
Monte Carlo ES is one kind of on-policy method.
On-policy first-visit MC control (for $\varepsilon$-soft policies): An algorithm for estimating $\pi\approx\pi_\ast$. See page 101 of the book.
5.5 Off-policy Prediction via Importance Sampling
Idea: solve the dilemma of optimal behavior and exploration, using off-policy learning.
Target policy: the policy being learned about.
Behavior policy: the policy used to generate behavior.
On-policy:
- Simpler
Off-policy:
- Require additional concepts and notations
- Of greater variance and are slower to converge
- More powerful and general
- Have additional use in applications
One special case of off-policy methods is on-policy methods - where the target policy and the behavior policy are the same.
Prediction problem: in which the target and behavior are fixed.
Assumption of coverage: every action taken under $\pi$, it is also taken under $b$, the behavior policy. That is, $\pi(a|s)>0 \Rightarrow b(a|s)>0$.
Importance sampling: estimating expected values under one distribution given samples from another.
Importance-sampling ratio:
$$ \rho_{t:T-1} \doteq \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_k,A_k)}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} $$ Note: the transition probabilities are cancelled in the equation above, thus not needed anymore.
Connect $v_\pi(s)$ with $\pi$ and $b$: $$ \mathbb{E}[\rho_{t:T-1}G_t|S_t=s] = v_\pi(s) $$
[why?]()
Estimate $v_\pi(s)$ using ordinary importance sampling:
First visit: unbiased, unbounded variance
Every visit: biased
$$ V(s) \doteq \frac{ \sum_{t\in\mathcal{J}(s)} \rho_{t:T-1}G_t }{ |\mathcal{J}(s)| } $$ Estimate $v_\pi(s)$ using weighted importance sampling:
First visit: biased, bounded variance
Every visit: biased
$$ V(s) \doteq \frac{ \sum_{t\in\mathcal{J}(s)} \rho_{t:T-1}G_t }{ \rho_{t:T-1}G_t } $$