Chapter 5 Monte Carlo Methods

Monte Carlo methods require only experience - sample sequence of states, actions, and rewards, while we do not have complete knowledge of the environment.

5.1 Monte Carlo Prediction

This section is about learning the state-value function for a given policy. In other words, estimate $v_\pi(s)$

First-visit MC Method: only averages the first visits to $s$.

Every-visit MC Method: averages the returns of all visits to $s$.

Monte Carlo Method's advantage over Dynamic Programming:

  • Learn from actural experience
  • Learn from simulated experience
  • Computation-friendly

5.2 Monte Carlo Estimation of Action Values

To solve the problem of maintaining exploration, we have the assumption of exploring starts - we assume that the episodes start in a state-action pair.

Note: the reason why we use $q(s,a)$ in Monte Carlo Methods instead of using $v(s)$ like we did in Dynamic Programming is that, using $v(s)$ is more computational-friendly than using $q(s,a)$. However, using $v(s)$ requires us to know the environment well, in other words, $p(s',r|s,a)$.

5.3 Monte Carlo Control

General policy iteration: a loop of policy evaluation and policy improvement.

2 approaches:

  1. Complete the whole policy evaluation
  2. Give up using the whole policy evaluation (one extreme case for this is value iteration).

Monte Carlo ES (Exploring Starts): An algorithm for estimating $\pi\approx\pi_\ast$. See page 99 of the book.

5.4 Monte Carlo Control without Exporing Starts

On-policy methods and off-policy methods:

Background: it is often unlikely to have exploring starts. As a result, we need to ensure all actions are selected infinitely often. There are 2 methods to ensure this: on-policy methods and off-policy methods.

On-policy methods: attempt to evaluate or improve the policy that is used to make decisions.

Off-policy methods: evaluate or improve a policy different from that used to generate the data. (Introduced in the next section).

Monte Carlo ES is one kind of on-policy method.

On-policy first-visit MC control (for $\varepsilon$-soft policies): An algorithm for estimating $\pi\approx\pi_\ast$. See page 101 of the book.

5.5 Off-policy Prediction via Importance Sampling

Idea: solve the dilemma of optimal behavior and exploration, using off-policy learning.

Target policy: the policy being learned about.

Behavior policy: the policy used to generate behavior.


  • Simpler


  • Require additional concepts and notations
  • Of greater variance and are slower to converge
  • More powerful and general
  • Have additional use in applications

One special case of off-policy methods is on-policy methods - where the target policy and the behavior policy are the same.

Prediction problem: in which the target and behavior are fixed.

Assumption of coverage: every action taken under $\pi$, it is also taken under $b$, the behavior policy. That is, $\pi(a|s)>0 \Rightarrow b(a|s)>0$.

Importance sampling: estimating expected values under one distribution given samples from another.

Importance-sampling ratio:

$$ \rho_{t:T-1} \doteq \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_k,A_k)}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} $$ Note: the transition probabilities are cancelled in the equation above, thus not needed anymore.

Connect $v_\pi(s)$ with $\pi$ and $b$: $$ \mathbb{E}[\rho_{t:T-1}G_t|S_t=s] = v_\pi(s) $$


Estimate $v_\pi(s)$ using ordinary importance sampling:

First visit: unbiased, unbounded variance

Every visit: biased

$$ V(s) \doteq \frac{ \sum_{t\in\mathcal{J}(s)} \rho_{t:T-1}G_t }{ |\mathcal{J}(s)| } $$ Estimate $v_\pi(s)$ using weighted importance sampling:

First visit: biased, bounded variance

Every visit: biased

$$ V(s) \doteq \frac{ \sum_{t\in\mathcal{J}(s)} \rho_{t:T-1}G_t }{ \rho_{t:T-1}G_t } $$