## Reinforcement learning #

This is my summary of David Silver’s lectures. These are very information dense and may be worth a second watch to really get all the ideas (or read along and do some exercises).

### Introduction to Reinforcement Learning #

• Science of decision making.
• Sits at the intersection of ML / OR / Neuroscience / Psychology / Economics.
• No supervisor, only a “reward signal”, and feedback (may be) delayed. Hence, time plays an important role. Data is not i.i.d and the agent affects the data.
• Reward hypothesis: Any goal can be described by the maximisation of expected cumulative reward (reward $$R_t$$ is a scalar feedback signal). An interesting discussion of the reward hypothesis here.
• At each step:
• The agent executes an action and receives an updated observation and the reward
• The environment receives the action, and emits the observation and the reward
• The history is the sequence of observations, actions and rewards to now. What happens next depends on the history: based on the history the agent selects actions, and the environment selects observations and rewards.
• Typically we look at the current state: a concise summary of the history, the information we will use to determine what happens next.
• Formally, state is a function of the history: $$S_t=f(H_t)$$.
• Environment state. The data the environment uses to determine what the next observation/reward is. Not usually visible to the agent. A formalism to help us understand things but not usable (or particularly relevant) in RL algorithms.
• Agent state. The internal representation used by the agent to decide the next action.
• Information / Markov state. Contains “all useful” information from the history. A state $$S_t$$ is Markov iff $P(S_{t+1}\mid S_t) = P(S_{t+1}\mid S_1,\ldots,S_t).$ The future is independent of the past given the present. $$H_{1:t}\to S_t\to H_{t+1:\infty}$$. The state is a sufficient statistic of the future.
• Markov state of the helicopter: position, velocity, angular velocities etc. Non-Markov state would be just position, so you need to “go back in time” and look at the history to determine the velocities.
• The environment state is Markov.
• The history $$H_t$$ (considered as a state) is Markov.
• Fully observable environment. Agent state = environment state = information state. This is a Markov decision process.
• Partially observable environment.
• Most interesting examples.
• Partially observable Markov decision process.
• The agent must construct its own state representation.
• Simplest (quickly infeasible): $$S_t^a=H_t$$.
• Bayesian: keep a vector of probabilities of what the environment state is.
• Recurrent neural network $$S_t^a=\sigma(S_{t-1}^a W_s+O_tW_o)$$.
• Major components of an RL agent
• Policy: the behaviour function. A map from state to action $$a=\pi(s)$$ (if deterministic), $$\pi(a\mid s)=P(A=a\mid S=s)$$ (if stochastic).
• Value function: scoring states and actions. Prediction of future reward, used to evaluate goodness/badness of states. $v_\pi(s)=\mathbb{E}_\pi(R_t+\gamma R_{t+1}+\gamma^2+\ldots\mid S_t=s).$
• Model: representation of the environment, predicts what the environment will do next. Not a requirement (many examples in the course will be model-free). Two parts:
• Transition model (predicts the next state) $$P(S’=s’\mid S=s,A=a)$$
• Reward model (predicts the next (immediate) reward) $$\mathbb{E}(R\mid S=s, A=a)$$.
• Maze example.
• Rewards: -1 per time step.
• Actions: N, E, S, W
• States: Agent’s location
• Policy: what to do in each state (arrows on the blocks in the maze)
• Value function: distance from goal based on this policy
• Categorising RL agents
• Value based. No policy (implicit from the value function, just pick actions greedily), value function.
• Policy based. Policy, no value function.
• Actor critic. Both
• Model free. Policy and/or value function, no model.
• Model based.
• Two fundamental problems in sequential decision making.
• Reinforcement learning. Environment is initially unknown. The agent interacts with the environment and improves its policy.
• Planning. The model of the environment is known and the agent performs computations with its model and improves its policy.

### Markov Decision Processes #

#### a.external { background: url('/images/external.png') no-repeat 100% 0; background-size: 14px 14px; padding-right: 16px; } Markov processes#

• State transition probability $$P_{ss’}=P(S_{t+1}=s’\mid S_t=s)$$. A Markov process (or chain) is a tuple $$(S, P)$$: a finite set of states and the matrix of state transition probabilities.
• An episode is sampling from the chain from the start to a terminal state.

#### Markov reward processes #

• Add a reward function $$R_s=\mathbb{E}(R_{t+1}\mid S_t=s)$$
• Add a discount factor $$\gamma\in[0,1]$$. The present value of future rewards. 0 is near-sighted, 1 is far-sighted. The value of receiving reward $$R$$ after $$k+1$$ timesteps is $$\gamma^k R$$.
• The return $G_t=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^\infty \gamma^k R_{t+k+1}.$
• We discount because there is more uncertainty in the future, but mainly to keep the maths easier (sums converge). Also a decent cognitive model (proven with tests).
• If all sequences terminate it is possible to use undiscounted Markov reward processes.
• The value function $$v(s)$$ gives the long-term value of state $$s$$. Formally, $v(s) = \mathbb{E}(G_t\mid S_t=s),$ i.e. the value function is the expected return when starting from state $$s$$.
• The value function can be decomposed into two parts:
• the immediate reward $$R_{t+1}$$, and
• the discounted value of the successor state $$\gamma v(S_{t+1})$$.
• The Bellman equation: $v(s)= \mathbb{E}(R_{t+1}+\gamma v(S_{t+1})\mid S_t=s).$ A tautology essentially, describing what we mean by reward. A “one-step lookahead search”. Can be expressed concisely using matrices: $$\mathbf{v}=\mathcal{R}+\gamma\mathcal{P}\mathbf{v}$$. $$\mathcal{R}$$ is the reward vector, $$\mathcal{P}$$ is the transition matrix.
• This is a linear equation that can be solved directly: $\mathbf{v}=(I-\gamma\mathcal{P})^{-1}\mathcal{R}.$
• Complexity is $$O(n^3)$$ for $$n$$ states, so not usually tractable.

#### Markov decision processes #

• An MDP is a Markov reward process with decisions. We add a finite set of actions $$\mathcal{A}$$.

• $$\mathcal{P}$$, state transition probability matrix, now depends on which action we take $\mathcal{P}^a_{ss’}=P(S_{t+1}=s’\mid S_t=s, A_t=a).$

• The reward function also depends on the action.

• A policy $$\pi$$ is a distribution over actions given states, $\pi(a\mid s)=P(A_t=a\mid S_t=s).$

• A policy fully defines the behaviour of an agent.
• Policies depend on the current state only for an MDP (i.e not the history).
• Maximise the reward from now onwards.
• For a given MVP and a policy $$\pi$$, the state sequence $$S_1,S_2,\ldots$$ is a Markov process.

• The state and reward sequence is a Markov reward process.

• The state-value function $$v_\pi(s)$$ of an MDP is the expected return starting from state $$s$$ and then following policy $$\pi$$: $v_\pi(s)=\mathbb{E}_\pi(G_t\mid S_t=s).$

• The action-value function $$q_\pi(s,a)$$ is the expected return starting from state $$s$$, taking action a, and then following policy $$\pi$$: $q_\pi(s,a)=\mathbb{E}_\pi(G_t\mid S_t=s, A_t=a).$

• The state-value function and action-value function can again be decomposed into immediate reward plus discounted value of successor state with all the conditions on policy and action:

\begin{align*} v_\pi(s)&=\mathbb{E}_\pi(R_{t+1}+\gamma v_\pi(S_{t+1})\mid S_t=s)\\ q_\pi(s,a)&=\mathbb{E}_\pi(R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})\mid S_t=s, A_t=a). \end{align*}

• There is a similar analogy here with the one-step lookahead search. $v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a\mid s)\Bigl(\mathcal{R}_s^a+\gamma\sum_{s’\in S}\mathcal{P}_{ss’}^av_\pi(s’)\Bigr)$ and we can write the same thing for action-values. (Bellman expectation equation).

#### The optimal value function #

• The optimal state-value function $$v_*(s)=\mathrm{max}_\pi v_\pi(s)$$.
• The optimal action-value function $$q_*(s,a)=\mathrm{max}_\pi q_\pi(s,a)$$. If you know this, you’re done. An MDP is solved when we know the optimal value function.
• We define a partial ordering over policies by $$\pi\geq \pi’$$ if $$v_\pi(s)\geq v_{\pi’}(s)$$ for all $$s$$.
• Theorem. For any Markov decision process, there exists an optimal policy $$\pi_*$$ that is better than or equal to all other policies. All optimal policies achieve the optimal value function $$v_{\pi_*}(s)=v_*(s)$$ and the optimal action-value function $$q_{\pi_*}(s,a)=q_*(s,a)$$.
• We can find an optimal policy by maximising over $$q_*(s,a)$$: $\pi_*(a\mid s)=1, \quad\mbox{if } a=\mathrm{argmax}_{a\in\mathcal{A}}q_*(s,a)$ and $$0$$ otherwise.
• The optimal value functions are recursively related by the Bellman optimality equations: $v_*(s)=\max_a q_*(s,a).$
• Similarly for $$q_*$$: $q_*(s,a)=\mathcal{R}_s^a+\gamma\sum_{s’\in\mathcal{S}}\mathcal{P}^a_{ss’}v_*(s’).$
• And combining these: $v_*(s)=\max_a \mathcal{R}_s^a +\gamma\sum_{s’\in S}P^a_{ss’} v_*(s’).$
• These equations are non-linear, with no closed-form solution. Usually we use iterative solution methods (value iteration, policy iteration, Q-learning, Sarsa).

### Planning by Dynamic Programming #

#### Introduction #

• Dynamic programming: c.f. linear programming (a program in the mathematical sense). Dynamic means there is some sequential or temporal component to the problem; steps.
• A method for solving complex problems by breaking down into subproblems and then combining the solutions to the subproblems into a solution for the full problem.
• Needs two properties in the problem:
• Optimal substructure (optimal solutions from subproblems can be built into an optimal solution for the full problem: principle of optimality, see shortest path).
• Overlapping subproblems (subproblems occur again and again, so solutions can be cached and reused).
• MDPs satisfy both of these properties. The Bellman equation gives us the recursive decomposition. Value function gives us the cache and the ability to reuse solutions.
• Dynamic programming is used for planning in an MDP - assuming full knowledge.
• Two cases:
1. Prediction: input MDP and policy, output value function.
2. Control: input MDP, output optimal value function and policy.
• Used in many other types of algorithms (shortest path, Viterbi, lattice models, string algorithms).

#### Policy evalution #

• You are given an MDP and a policy: how much reward will we get (evalute the policy).
• Iterative application of the Bellman expectation equation.
• $$v_1\to v_2\to \cdots \to v_pi$$. Start with an initial arbitrary value function, then using synchronous backups:
• at each iteration $$k+1$$,
• for all states $$s$$
• update $$v_{k+1}(s)$$ from $$v_k(s’)$$
• where $$s’$$ is a successor state of $$s$$
• Prove convergence later (contraction mapping theorem a.k.a Banach fixed-point theorem).
• Small gridworld example. Very cool.

#### Policy iteration #

• Start with a given policy $$\pi$$.
• Evaluate the policy using the above policy evaluation method.
• Improve the policy by acting greedily with respect to this value function.
• In the small gridworld example it only took one iteration; in general it can take many but this process does always converge to the optimal policy $$\pi^*$$.
• Relationship to expectation maximisation?
• Consider a deterministic policy $$a=\pi(s)$$. We can improve the policy by acting greedily

$\pi’(s) = \mathrm{argmax}_{a\in\mathcal{A}} q_\pi(s, a).$

• This does indeed improve the value from any state $$s$$ over one step, since

$q_\pi(s, \pi’(s)) = \max_{a\in\mathcal{A}}q_\pi(s, a)\geq q_\pi(s, \pi(s))=v_\pi(s).$

• Therefore improves the value function $$v_{\pi’}(s)\geq v_\pi(s)$$ (telescoping argument).
• If improvements stop, i.e. we achieve equality, then the Bellman optimality equation has been satisfied by definition, and hence we have the optimal policy.
• Do we actually need to converge to the optimal policy? Maybe stopping conditions.
• Just do it once and then continue: value iteration.

#### Value iteration #

• Any optimal policy can be subdivided into two components: an optimal first action $$A^*$$ followed by an optimal policy from the successor state $$S’$$ (principle of optimality).
• Formulate into a theorem: a policy $$\pi(a\mid s)$$ achieves the optimal value from state $$s$$ if and only if for any state $$s’$$ reachable from $$s$$, $$\pi$$ achieves the optimal value from state $$s’$$ onwards.
• Use this to build a value iteration algorithm (“backwards induction”). If we know the solution to subproblems $$v_*(s’)$$, then the solution $$v_\pi(s)$$ can be found by one-step lookahead.
• The idea of value iteration is to apply these updates iteratively.
• Intuition: start with final rewards and work backwards. Even works with “loopy, stochastic” MDPs.
• $$v_1\to v_2\to \cdots \to v_*$$. Start with an initial arbitrary value function, then using iterative application of Bellman optimality:
• at each iteration $$k+1$$
• for all states $$s$$
• update $$v_{k+1}(s)$$ from $$v_k(s’)$$
• Convergence to $$v_*$$ will be proven later (Banach)
• Unlike policy iteration, there is no explicit policy: intermediate value functions may not be the value function of any real policy.
Problem Bellman equation Algorithm
Prediction Bellman expectation equation Iterative policy evaluation
Control Bellman expectation equation + greedy policy improvement Policy iteration
Control Bellman optimality equation Value iteration
• Toy example of a larger Gridworld.
• These ideas are actually quite intuitive, the notation makes it all a little cumbersome but it’s quite natural once you think it through. Silver’s explanation of the above table is a very coherent summary of this complicated-feeling lecture.
• Can also do similar things with the action value function $$q$$; these give rise to higher-complexity algorithms but they are definitely useful in the general RL problem.