TD learning is a combination of Monte Carlo ideas and DP ideas.
6.1 TD Prediction
Comparison between Monte Carlo and Temporal-Difference
- Monte Carlo (every visit, constant-$\alpha$ MC)
$$ V(S_t) \leftarrow V(S_t) + \alpha [G_t-V(S_t)] $$
- TD method (TD(0), one-step TD)
$$ V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}-V(S_t))] $$
TD(0) is a bootstrappping method - it update in part on an existing estimate.
The difference between 2 methods for policy evaluation
- DP: $v_\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma v_\pi(S_{t+1})|S_t=s]$
- MC: $v_\pi(s) = \mathbb{E}_\pi [G_t|S_t=s]$
- TD: Combines the sampling of MC with the bootstrapping of DP
TD Error
$$ \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) $$
where $\delta_t$ is only available at $t+1$.
6.2 Advantages of TD Prediction Methods
- No need for knowledge of the environment
- Online and fully incremental
- MC must ignore experiemental processes, but TD do not need to
6.3 Optimality of TD(0)
Batch updating: updates are made only after processing each complete batch of training data.
Batch Monte Carlo: finds estimates that minimize MSE on the training set
Batch TD(0) finds the estimate that would be exactly correct for the maximum-likelihood model of Markov process. The maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is the greatest.
Certainty-Equivalence Estimate: assuming that the estimate of the underlying process was known rather than being approximated. TD(0) converges to certainty-equivalence estimate.
In batch form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence estimate.
6.4 Sarsa: On-policy TD Control
Update the action-value function $$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)] $$
This algorithm will converge if we use $\varepsilon$-soft policy and $\alpha=\frac1T$
6.5 Q-learning: Off-policy TD Control
Update the action-value function $$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)] $$
6.7 Maximization Bias and Double learning
Maximum bias: the maximum of the estimations of $Q(s,a)$ is biased. And, it is larger than the true value.
Double learning: the basic idea of double learning is, to use one $Q$ to determine the action, and use another $Q$ to estimate the value.
The following 2 function will take turns:
$$ A^\ast = \arg\max_a Q_1(a) $$ $$ Q_2(A^\ast) = Q_2(\arg\max_a Q_1(a)) $$
This is because $ \mathbb{E}[Q_2(A^\ast)] = q(A^\ast) $
An example of double Q-learning
$$ Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t) + \alpha[ R_{t+1} + \gamma Q_2 (S_{t+1}, \arg\max_a Q_1 (S_{t+1},a)) - Q_1(S_t,A_t) ] $$