Chapter 4 Dynamic Programming

Dynamic programming: a collection of algorithms that can be used to computer optimal policies given a perfect model of the environment as a MDP.

4.1 Policy Evaluation (Prediction)

Policy evaluation/prediction problem $$v_\pi(s)=\sum_a\pi(a|s)\left[ r(s,a) + \gamma\sum_{s'}p(s'|s,a) v_\pi(s') \right] $$

Iterative Policy Evaluation

As $k\rightarrow\infty$, we are sure that ${v_k}$ will converge to $v_\pi$.

$$v_{k+1}(s)=\sum_a\pi(a|s)\left[ r(s,a) + \gamma\sum_{s'}p(s'|s,a) v_{k}(s') \right]$$

Expected update: each operation which we update $v(s)$.

For updates, we can have either 2 arrays (one for old and one for new), or we can have 1 array (in place update). Usually, the latter one converges faster than the first one.

4.2 Policy Improvement

To evaluate wether action $a$ is better than $\pi(s)$, we have $$ q_\pi(s,a) = \sum_{s',r}p(s',r|s,a)\left[ r+\gamma v_\pi(s') \right] $$

Policy improvement theorem, policy $\pi'$ is better or as good as policy $\pi$ if:

$$ q_\pi(s,\pi'(s)) \geq v_\pi(s) $$

$$\text{or, } v_{\pi'}(s) \geq v_\pi(s) $$

Change to a new policy:

$$\pi'(s) \doteq \arg \max_a q_\pi(s,a) = \arg \max_a\sum_{s',r}p(s',r|s,a)\left[ r+\gamma v_\pi(s') \right]$$

4.3 Policy Iteration

Policy iteration: $$\pi_0 \xrightarrow{\text{E}} v_{\pi_0} \xrightarrow{\text{I}} \pi_1 \xrightarrow{\text{E}} ... \xrightarrow{\text{I}}\pi_\ast\xrightarrow{\text{E}} v_\ast $$ where $\xrightarrow{\text{E}}$ denotes policy evaluation, and $\xrightarrow{\text{I}}$ denotes policy improvement.

Example 4.2 Jack's Car Rental

(This example is important and would be discussed in class)

4.4 Value Iteration

One draw back of policy iteration: the long loop in policy evaluation.

Value iteration:

$$ v_{k+1}(s)\doteq \max_{a} \mathbb{E}\left[ R_{t+1}+\gamma v_k(S_{t+1}\middle| S_t=s, A_t=a) \right]$$

$$ v_{k+1}(s)\doteq \max_{a} \sum_{s',r}p(s',r|s,a)[r+\gamma v_k(s)]$$

4.5 Asynchronous Dynamic Programming