Reinforcement learning #

This is my summary of David Silver’s lectures. These are very information dense and may be worth a second watch to really get all the ideas (or read along and do some exercises).

Introduction to Reinforcement Learning #

  • Science of decision making.
  • Sits at the intersection of ML / OR / Neuroscience / Psychology / Economics.
  • No supervisor, only a “reward signal”, and feedback (may be) delayed. Hence, time plays an important role. Data is not i.i.d and the agent affects the data.
  • Reward hypothesis: Any goal can be described by the maximisation of expected cumulative reward (reward \(R_t\) is a scalar feedback signal). An interesting discussion of the reward hypothesis here.
  • At each step:
    • The agent executes an action and receives an updated observation and the reward
    • The environment receives the action, and emits the observation and the reward
    • The history is the sequence of observations, actions and rewards to now. What happens next depends on the history: based on the history the agent selects actions, and the environment selects observations and rewards.
    • Typically we look at the current state: a concise summary of the history, the information we will use to determine what happens next.
    • Formally, state is a function of the history: \(S_t=f(H_t)\).
  • Environment state. The data the environment uses to determine what the next observation/reward is. Not usually visible to the agent. A formalism to help us understand things but not usable (or particularly relevant) in RL algorithms.
  • Agent state. The internal representation used by the agent to decide the next action.
  • Information / Markov state. Contains “all useful” information from the history. A state \(S_t\) is Markov iff \[P(S_{t+1}\mid S_t) = P(S_{t+1}\mid S_1,\ldots,S_t).\] The future is independent of the past given the present. \(H_{1:t}\to S_t\to H_{t+1:\infty}\). The state is a sufficient statistic of the future.
    • Markov state of the helicopter: position, velocity, angular velocities etc. Non-Markov state would be just position, so you need to “go back in time” and look at the history to determine the velocities.
    • The environment state is Markov.
    • The history \(H_t\) (considered as a state) is Markov.
  • Fully observable environment. Agent state = environment state = information state. This is a Markov decision process.
  • Partially observable environment.
    • Most interesting examples.
    • Partially observable Markov decision process.
    • The agent must construct its own state representation.
    • Simplest (quickly infeasible): \(S_t^a=H_t\).
    • Bayesian: keep a vector of probabilities of what the environment state is.
    • Recurrent neural network \(S_t^a=\sigma(S_{t-1}^a W_s+O_tW_o)\).
  • Major components of an RL agent
    • Policy: the behaviour function. A map from state to action \(a=\pi(s)\) (if deterministic), \(\pi(a\mid s)=P(A=a\mid S=s)\) (if stochastic).
    • Value function: scoring states and actions. Prediction of future reward, used to evaluate goodness/badness of states. \[v_\pi(s)=\mathbb{E}_\pi(R_t+\gamma R_{t+1}+\gamma^2+\ldots\mid S_t=s).\]
    • Model: representation of the environment, predicts what the environment will do next. Not a requirement (many examples in the course will be model-free). Two parts:
      • Transition model (predicts the next state) \(P(S’=s’\mid S=s,A=a)\)
      • Reward model (predicts the next (immediate) reward) \(\mathbb{E}(R\mid S=s, A=a)\).
  • Maze example.
    • Rewards: -1 per time step.
    • Actions: N, E, S, W
    • States: Agent’s location
    • Policy: what to do in each state (arrows on the blocks in the maze)
    • Value function: distance from goal based on this policy
  • Categorising RL agents
    • Value based. No policy (implicit from the value function, just pick actions greedily), value function.
    • Policy based. Policy, no value function.
    • Actor critic. Both
    • Model free. Policy and/or value function, no model.
    • Model based.
  • Two fundamental problems in sequential decision making.
    • Reinforcement learning. Environment is initially unknown. The agent interacts with the environment and improves its policy.
    • Planning. The model of the environment is known and the agent performs computations with its model and improves its policy.

Markov Decision Processes #

Markov processes #

  • State transition probability \(P_{ss’}=P(S_{t+1}=s’\mid S_t=s)\). A Markov process (or chain) is a tuple \((S, P)\): a finite set of states and the matrix of state transition probabilities.
  • An episode is sampling from the chain from the start to a terminal state.

Markov reward processes #

  • Add a reward function \(R_s=\mathbb{E}(R_{t+1}\mid S_t=s)\)
  • Add a discount factor \(\gamma\in[0,1]\). The present value of future rewards. 0 is near-sighted, 1 is far-sighted. The value of receiving reward \(R\) after \(k+1\) timesteps is \(\gamma^k R\).
  • The return \[ G_t=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^\infty \gamma^k R_{t+k+1}. \]
  • We discount because there is more uncertainty in the future, but mainly to keep the maths easier (sums converge). Also a decent cognitive model (proven with tests).
  • If all sequences terminate it is possible to use undiscounted Markov reward processes.
  • The value function \(v(s)\) gives the long-term value of state \(s\). Formally, \[ v(s) = \mathbb{E}(G_t\mid S_t=s), \] i.e. the value function is the expected return when starting from state \(s\).
  • The value function can be decomposed into two parts:
    • the immediate reward \(R_{t+1}\), and
    • the discounted value of the successor state \(\gamma v(S_{t+1})\).
  • The Bellman equation: \[ v(s)= \mathbb{E}(R_{t+1}+\gamma v(S_{t+1})\mid S_t=s). \] A tautology essentially, describing what we mean by reward. A “one-step lookahead search”. Can be expressed concisely using matrices: \(\mathbf{v}=\mathcal{R}+\gamma\mathcal{P}\mathbf{v}\). \(\mathcal{R}\) is the reward vector, \(\mathcal{P}\) is the transition matrix.
  • This is a linear equation that can be solved directly: \[ \mathbf{v}=(I-\gamma\mathcal{P})^{-1}\mathcal{R}. \]
  • Complexity is \(O(n^3)\) for \(n\) states, so not usually tractable.

Markov decision processes #

  • An MDP is a Markov reward process with decisions. We add a finite set of actions \(\mathcal{A}\).

  • \(\mathcal{P}\), state transition probability matrix, now depends on which action we take \[ \mathcal{P}^a_{ss’}=P(S_{t+1}=s’\mid S_t=s, A_t=a). \]

  • The reward function also depends on the action.

  • A policy \(\pi\) is a distribution over actions given states, \[ \pi(a\mid s)=P(A_t=a\mid S_t=s). \]

    • A policy fully defines the behaviour of an agent.
    • Policies depend on the current state only for an MDP (i.e not the history).
    • Maximise the reward from now onwards.
  • For a given MVP and a policy \(\pi\), the state sequence \(S_1,S_2,\ldots\) is a Markov process.

  • The state and reward sequence is a Markov reward process.

  • The state-value function \(v_\pi(s)\) of an MDP is the expected return starting from state \(s\) and then following policy \(\pi\): \[ v_\pi(s)=\mathbb{E}_\pi(G_t\mid S_t=s). \]

  • The action-value function \(q_\pi(s,a)\) is the expected return starting from state \(s\), taking action a, and then following policy \(\pi\): \[ q_\pi(s,a)=\mathbb{E}_\pi(G_t\mid S_t=s, A_t=a). \]

  • The state-value function and action-value function can again be decomposed into immediate reward plus discounted value of successor state with all the conditions on policy and action:

    \begin{align*} v_\pi(s)&=\mathbb{E}_\pi(R_{t+1}+\gamma v_\pi(S_{t+1})\mid S_t=s)\\ q_\pi(s,a)&=\mathbb{E}_\pi(R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})\mid S_t=s, A_t=a). \end{align*}

  • There is a similar analogy here with the one-step lookahead search. \[ v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a\mid s)\Bigl(\mathcal{R}_s^a+\gamma\sum_{s’\in S}\mathcal{P}_{ss’}^av_\pi(s’)\Bigr) \] and we can write the same thing for action-values. (Bellman expectation equation).

The optimal value function #

  • The optimal state-value function \(v_*(s)=\mathrm{max}_\pi v_\pi(s)\).
  • The optimal action-value function \(q_*(s,a)=\mathrm{max}_\pi q_\pi(s,a)\). If you know this, you’re done. An MDP is solved when we know the optimal value function.
  • We define a partial ordering over policies by \(\pi\geq \pi’\) if \(v_\pi(s)\geq v_{\pi’}(s)\) for all \(s\).
  • Theorem. For any Markov decision process, there exists an optimal policy \(\pi_*\) that is better than or equal to all other policies. All optimal policies achieve the optimal value function \(v_{\pi_*}(s)=v_*(s)\) and the optimal action-value function \(q_{\pi_*}(s,a)=q_*(s,a)\).
  • We can find an optimal policy by maximising over \(q_*(s,a)\): \[ \pi_*(a\mid s)=1, \quad\mbox{if } a=\mathrm{argmax}_{a\in\mathcal{A}}q_*(s,a) \] and \(0\) otherwise.
  • The optimal value functions are recursively related by the Bellman optimality equations: \[ v_*(s)=\max_a q_*(s,a). \]
  • Similarly for \(q_*\): \[ q_*(s,a)=\mathcal{R}_s^a+\gamma\sum_{s’\in\mathcal{S}}\mathcal{P}^a_{ss’}v_*(s’). \]
  • And combining these: \[ v_*(s)=\max_a \mathcal{R}_s^a +\gamma\sum_{s’\in S}P^a_{ss’} v_*(s’). \]
  • These equations are non-linear, with no closed-form solution. Usually we use iterative solution methods (value iteration, policy iteration, Q-learning, Sarsa).

Planning by Dynamic Programming #

Introduction #

  • Dynamic programming: c.f. linear programming (a program in the mathematical sense). Dynamic means there is some sequential or temporal component to the problem; steps.
  • A method for solving complex problems by breaking down into subproblems and then combining the solutions to the subproblems into a solution for the full problem.
  • Needs two properties in the problem:
    • Optimal substructure (optimal solutions from subproblems can be built into an optimal solution for the full problem: principle of optimality, see shortest path).
    • Overlapping subproblems (subproblems occur again and again, so solutions can be cached and reused).
  • MDPs satisfy both of these properties. The Bellman equation gives us the recursive decomposition. Value function gives us the cache and the ability to reuse solutions.
  • Dynamic programming is used for planning in an MDP - assuming full knowledge.
  • Two cases:
    1. Prediction: input MDP and policy, output value function.
    2. Control: input MDP, output optimal value function and policy.
  • Used in many other types of algorithms (shortest path, Viterbi, lattice models, string algorithms).

Policy evalution #

  • You are given an MDP and a policy: how much reward will we get (evalute the policy).
  • Iterative application of the Bellman expectation equation.
  • \(v_1\to v_2\to \cdots \to v_pi\). Start with an initial arbitrary value function, then using synchronous backups:
    • at each iteration \(k+1\),
    • for all states \(s\)
    • update \(v_{k+1}(s)\) from \(v_k(s’)\)
    • where \(s’\) is a successor state of \(s\)
  • Prove convergence later (contraction mapping theorem a.k.a Banach fixed-point theorem).
  • Small gridworld example. Very cool.

Policy iteration #

  • Start with a given policy \(\pi\).
  • Evaluate the policy using the above policy evaluation method.
  • Improve the policy by acting greedily with respect to this value function.
  • In the small gridworld example it only took one iteration; in general it can take many but this process does always converge to the optimal policy \(\pi^*\).
  • Relationship to expectation maximisation?
  • Consider a deterministic policy \(a=\pi(s)\). We can improve the policy by acting greedily

\[ \pi’(s) = \mathrm{argmax}_{a\in\mathcal{A}} q_\pi(s, a). \]

  • This does indeed improve the value from any state \(s\) over one step, since

\[ q_\pi(s, \pi’(s)) = \max_{a\in\mathcal{A}}q_\pi(s, a)\geq q_\pi(s, \pi(s))=v_\pi(s). \]

  • Therefore improves the value function \(v_{\pi’}(s)\geq v_\pi(s)\) (telescoping argument).
  • If improvements stop, i.e. we achieve equality, then the Bellman optimality equation has been satisfied by definition, and hence we have the optimal policy.
  • Do we actually need to converge to the optimal policy? Maybe stopping conditions.
  • Just do it once and then continue: value iteration.

Value iteration #

  • Any optimal policy can be subdivided into two components: an optimal first action \(A^*\) followed by an optimal policy from the successor state \(S’\) (principle of optimality).
  • Formulate into a theorem: a policy \(\pi(a\mid s)\) achieves the optimal value from state \(s\) if and only if for any state \(s’\) reachable from \(s\), \(\pi\) achieves the optimal value from state \(s’\) onwards.
  • Use this to build a value iteration algorithm (“backwards induction”). If we know the solution to subproblems \(v_*(s’)\), then the solution \(v_\pi(s)\) can be found by one-step lookahead.
  • The idea of value iteration is to apply these updates iteratively.
  • Intuition: start with final rewards and work backwards. Even works with “loopy, stochastic” MDPs.
  • \(v_1\to v_2\to \cdots \to v_*\). Start with an initial arbitrary value function, then using iterative application of Bellman optimality:
    • at each iteration \(k+1\)
    • for all states \(s\)
    • update \(v_{k+1}(s)\) from \(v_k(s’)\)
  • Convergence to \(v_*\) will be proven later (Banach)
  • Unlike policy iteration, there is no explicit policy: intermediate value functions may not be the value function of any real policy.
Problem Bellman equation Algorithm
Prediction Bellman expectation equation Iterative policy evaluation
Control Bellman expectation equation + greedy policy improvement Policy iteration
Control Bellman optimality equation Value iteration
  • Toy example of a larger Gridworld.
  • These ideas are actually quite intuitive, the notation makes it all a little cumbersome but it’s quite natural once you think it through. Silver’s explanation of the above table is a very coherent summary of this complicated-feeling lecture.
  • Can also do similar things with the action value function \(q\); these give rise to higher-complexity algorithms but they are definitely useful in the general RL problem.