Chapter 6 Temporal-Difference Learning

TD learning is a combination of Monte Carlo ideas and DP ideas.

6.1 TD Prediction

Comparison between Monte Carlo and Temporal-Difference

  • Monte Carlo (every visit, constant-$\alpha$ MC)

$$ V(S_t) \leftarrow V(S_t) + \alpha [G_t-V(S_t)] $$

  • TD method (TD(0), one-step TD)

$$ V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}-V(S_t))] $$

TD(0) is a bootstrappping method - it update in part on an existing estimate.

The difference between 2 methods for policy evaluation

  • DP: $v_\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma v_\pi(S_{t+1})|S_t=s]$
  • MC: $v_\pi(s) = \mathbb{E}_\pi [G_t|S_t=s]$
  • TD: Combines the sampling of MC with the bootstrapping of DP

TD Error

$$ \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) $$

where $\delta_t$ is only available at $t+1$.

6.2 Advantages of TD Prediction Methods

  • No need for knowledge of the environment
  • Online and fully incremental
  • MC must ignore experiemental processes, but TD do not need to

6.3 Optimality of TD(0)

Batch updating: updates are made only after processing each complete batch of training data.

Batch Monte Carlo: finds estimates that minimize MSE on the training set

Batch TD(0) finds the estimate that would be exactly correct for the maximum-likelihood model of Markov process. The maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is the greatest.

Certainty-Equivalence Estimate: assuming that the estimate of the underlying process was known rather than being approximated. TD(0) converges to certainty-equivalence estimate.

In batch form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence estimate.

6.4 Sarsa: On-policy TD Control

Update the action-value function $$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)] $$

This algorithm will converge if we use $\varepsilon$-soft policy and $\alpha=\frac1T$

6.5 Q-learning: Off-policy TD Control

Update the action-value function $$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)] $$

6.7 Maximization Bias and Double learning

Maximum bias: the maximum of the estimations of $Q(s,a)$ is biased. And, it is larger than the true value.

Double learning: the basic idea of double learning is, to use one $Q$ to determine the action, and use another $Q$ to estimate the value.

The following 2 function will take turns:

$$ A^\ast = \arg\max_a Q_1(a) $$ $$ Q_2(A^\ast) = Q_2(\arg\max_a Q_1(a)) $$

This is because $ \mathbb{E}[Q_2(A^\ast)] = q(A^\ast) $

An example of double Q-learning

$$ Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t) + \alpha[ R_{t+1} + \gamma Q_2 (S_{t+1}, \arg\max_a Q_1 (S_{t+1},a)) - Q_1(S_t,A_t) ] $$