# Chapter 1 Introduction

## 1.1 Reinforcement Learning

• 2 distinguishing features of reinforcement leanring: trial-and-error search and dealayed reward.

• To call the problem of reinforcement learning, we say, the optimal control of incompletely-known Markov decision process.

• Reinforcement learning is different from supervised learning and unsupervised learning.

• Trade off between exploration and exploitation.

• Consider the whole problem a goal-directed agent.

• Of all forms of machine learning, reinforcement learning is the closet to the kind of learning that human and animals do.

## 1.3 Elements of Reinforcement Learning

Despite agent and environment, other elements are listed as follows:

### Policy

A policy defines the learning agent's way of behaving at a given time. Note: it is not the policy of how the environment works, it is the policy for the agent to decide.

### Reward Signal

A reward signal defines the goal of a reinforcement learning problem. It indicates what is good in an immediate sense.

### Value Function

A value function specifies what is good in a long run.

Value of a state means the total amount of reward an agent can expect to accumulate over time.

### Model of Environment (Optional)

Roughly speaking, it is a model build by the agent to mimic and predict the action of the environment. Modern reinforcement learning spans the spectrum from low-level, trial-and-error learning to high level, deliberative learning.

## 1.4 Limitations and Scope

• Most of the reinforcement learning methods we consider are structured around estimating value functions, but it is not always necessary.

• Evolutionary methods: do not estimate value functions; try different policies and obtain the most rewarded one, and generate its variants to carry over to the next run. These methods are effective if the space of policies are sufficiently small or can be structured so that good policiesare common or easy to find. But generally, evolutionary methods are relatively ineffective.

## 1.5 An Extended Example: Tic-Tac-Toe

Basic idea (temporal-difference): update the estimated value each time:

$$V(S_t) \leftarrow V(S_t) + \alpha [V(S_{t+1})-V(S_t)]$$

where $\alpha$ is a small positive function called step-size parameter, which influence the learning rate.

### Evolutionary Methods and Value Function Methods

Evolutionary methods: hold the policy fixed and estimate the probability of winning with that policy, and use this to direct the next policy. If one game wins, all behaviors get credit.

Value function methods: allow individual states to be evaluated.