# Chapter 6 Temporal-Difference Learning

TD learning is a combination of Monte Carlo ideas and DP ideas.

## 6.1 TD Prediction

Comparison between Monte Carlo and Temporal-Difference

• Monte Carlo (every visit, constant-$\alpha$ MC)

$$V(S_t) \leftarrow V(S_t) + \alpha [G_t-V(S_t)]$$

• TD method (TD(0), one-step TD)

$$V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}-V(S_t))]$$

TD(0) is a bootstrappping method - it update in part on an existing estimate.

The difference between 2 methods for policy evaluation

• DP: $v_\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma v_\pi(S_{t+1})|S_t=s]$
• MC: $v_\pi(s) = \mathbb{E}_\pi [G_t|S_t=s]$
• TD: Combines the sampling of MC with the bootstrapping of DP

TD Error

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

where $\delta_t$ is only available at $t+1$.

## 6.2 Advantages of TD Prediction Methods

• No need for knowledge of the environment
• Online and fully incremental
• MC must ignore experiemental processes, but TD do not need to

## 6.3 Optimality of TD(0)

Batch updating: updates are made only after processing each complete batch of training data.

Batch Monte Carlo: finds estimates that minimize MSE on the training set

Batch TD(0) finds the estimate that would be exactly correct for the maximum-likelihood model of Markov process. The maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is the greatest.

Certainty-Equivalence Estimate: assuming that the estimate of the underlying process was known rather than being approximated. TD(0) converges to certainty-equivalence estimate.

In batch form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence estimate.

## 6.4 Sarsa: On-policy TD Control

Update the action-value function $$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$$

This algorithm will converge if we use $\varepsilon$-soft policy and $\alpha=\frac1T$

## 6.5 Q-learning: Off-policy TD Control

Update the action-value function $$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

# 6.7 Maximization Bias and Double learning

Maximum bias: the maximum of the estimations of $Q(s,a)$ is biased. And, it is larger than the true value.

Double learning: the basic idea of double learning is, to use one $Q$ to determine the action, and use another $Q$ to estimate the value.

The following 2 function will take turns:

$$A^\ast = \arg\max_a Q_1(a)$$ $$Q_2(A^\ast) = Q_2(\arg\max_a Q_1(a))$$

This is because $\mathbb{E}[Q_2(A^\ast)] = q(A^\ast)$

An example of double Q-learning

$$Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t) + \alpha[ R_{t+1} + \gamma Q_2 (S_{t+1}, \arg\max_a Q_1 (S_{t+1},a)) - Q_1(S_t,A_t) ]$$