A task is called episodic if it ends at some time step \(T\) when the game enters a terminal state. The episode is the sequence \(S_0, A_0, R_1, S_1, A_1, \ldots, R_T, S_T\). The opposite of an episodic task is a continuing task.
Reward hypothesis: All goals can be framed as the maximization of expected cumulative reward.
Markov Decision Process
The state space \(\mathcal{S}\) is the set of all nonterminal states. In continuing tasks this is equivalent to the set of all states. In episodic tasks \(\mathcal{S}^+\) is the set of all states. \(\mathcal{A}\) is the action space. \(\mathcal{A}(s)\) is the set of actions available in state \(s\). \(\mathcal{R}\) is the set of possible rewards.
The expected discounted return is \(G_t = \sum_{k=1}^\infty \gamma R_{t+k}\).
In an MDP onestep dynamics are specified by:
\(p(s', rs, a) = \mathbb{P}(S_{t+1} = s', R_{t+1}=r  S_t=s \land A_t = a)\)
A deterministic policy is a function \(\pi \colon \mathcal{S → A}\).
A stochastic policy is a function \(\pi\colon \mathcal{S\times A} → [0, 1]\).
The statevalue function is a function \(v_\pi(s) = \mathbb{E}_\pi[G_t  S_t = s]\). The optimal statevalue function is denoted \(v_\ast\).
A policy \(\pi'\) dominates a policy \(\pi\), denoted as \(\pi' \ge \pi\), if and only if
$$
\forall s \in \mathcal{S}\colon \, v_{\pi'}(s) \ge v_{\pi}(s)
$$
An optimal policy \(\pi_\ast\) satisfies \(\pi_\ast \ge \pi\) for all policies \(\pi\). Optimal policies are guaranteed to exist but may not be unique.
The actionvalue function is a function \(q_\pi(s, a) = \mathbb{E_\pi}[G_t  S_t = s, A_t = a]\) and yields the expected return if the agents starts in state \(s\), takes action \(a\), and then follows the policy \(\pi\). The optimal actionvalue function is denoted \(q^\ast\). An actionvalue function can simply be represented as a action \(\times\) value matrix.
Given an optimal action value function \(q^\ast\) we can find an optimal policy by setting \(\pi^\ast(s) = \arg \max_{a\in \mathcal{A}} q_\ast(s, a)\).
An \(\epsilon\)greedy policy is a policy of the form
$$
\pi(as) = \Bigg\{\begin{array}{lr}
1\epsilon + \epsilon/\mathcal{A}(s), & \text{if } a \text{ maximizes } Q(s,a)\\
\epsilon/\mathcal{A}(s) & \text{else}
\end{array}
$$
Control Problem: Estimate the optimal policy.
The Montecarlo control method alternates between policy evaluation and policy improvement to converge to the optimal policy \(\pi^\ast\).
We can use firstvisit MC or everyvisit MC which estimate \(q_\pi(s, a)\) as the average of returns of the first visit to \((s, a)\) or all visits to \((s, a)\) respectively. Note: everyvisit is biased whereas firstvisit is not; everyvisit has a lower MSE at first but firstvisit is better in the limit
In order for MC control to converge to the optimal policy the Greedy in the Limit with Infinite Exploration (GLIE) conditions must be met:
 every \((s, a)\) must be visited infinitely many times
 the policy converges to the greedy policy based on the actionvalue estimate \(Q\)
When specifying an \(\epsilon\)greedy policy both of the above conditions are satisfied if \(\forall i\colon\epsilon_i > 0\) and \(e_i → 0\).
Two variations of MC control:
 Incremental mean: update the Q table after every episode
 Constant alpha: use exponential averaging instead of (constant time) arithmentic mean
Bellman Equations
The four Bellman equations show that value functions satisfy certain recursive relationships.
The Bellman Expectation Equation for \(v_\pi\) expresses the value of a state in the following way:
$$
v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1})S_t = s]
$$
The expectation can be rewritten as
$$
v_\pi(s) = \sum_{s' \in S^+, r \in \mathcal{R}} p(s', r  s, \pi(s)) (r + \gamma v_\pi(s'))
$$
for a deterministic policy \(\pi\) or as
$$
v_\pi(s) = \sum_{s' \in S^+, r \in \mathcal{R}, a\in \mathcal{A}} \pi(as) p(s', r  s, a) (r + \gamma v_\pi(s'))
$$
for a stochastic policy \(\pi\).
// TODO: add other three bellman equations, sections 3.5 and 3.6 of textbook
Temporal Difference Learning
While MC control methods update their estimates only after each episode, TD learning continuously amends its predictions.
Sarsa, sarsamax and expected sarsa are all guaranteed to converge to the optimal actionvalue function \(q_\ast\) given that \(\alpha\) is sufficiently small and the GLIE conditions are met. In practice it is common to ignore GLIE conditions and still recover an optimal policy.
Sarsa and expected sarsa are onpolicy methods whereas qlearning is an offpolicy method. Onpolicy methods generally have better online performance.
Initializing Q values optimistically (to large values) can improve performance, see this paper.
Sarsa
The Sarsa(0) update rule is
$$
Q(S_t, A_t) ← Q(S_t, A_t)(1\alpha) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}))
$$
QLearning / Sarsamax
Qlearning uses the \(\max_{a \in \mathcal{A}} Q(S_{t+1}, a)\) instead of \(Q(S_{t+1}, A_{t+1})\) which means that depending on which action achieves the maximum the update step may not even depend on the action the agent is taking in the next step. (Basically based on statevalue function not actionvalue function)
See this paper for proof that Qlearning converges.
Expected Sarsa
Expected Sarsa replaces the max term with an expectation based on the probability that each action gets selected under the policy.
See this paper for more details on expected sarsa.
Continuous Spaces
To apply RL in continuous spaces we can use
 Uniformgrid based discretization
 Discretization via Tile Coding
 generalizes better than uniform grid

Discretization via Coarse Coding
 sparser than tile coding
 can adjust degree of generalization vs resolution (and change smoothness of qfunction)
 can user radial basis functions to implement continuous tile membership
 Function approximation (for V/Q functions)
 Linear function approximation (possibly with kernel functions like radial basis functions)
 Nonlinear function approximation (neural nets)
Deep QLearning
DeepMind Atari: input 84x84 grayscale pixels * 4 frames; convnet outputs action; replay memory 1 mio frames; each selected action is played for k=4 timesteps
Deep QLearning can suffer from two potentially harmful correlations:
 Correlation between sequential samples
 Correlation between Qvalues for similar (state, action) pairs
which are addressed by experience replay and fixed Qtargets respectively.
Experience Replay: keeping a replay buffer and training on samples randomly drawn from this buffer allows us to learn from each (possibly computationally expensive) sample multiple times. Furthermore the random sampling prevents oscillations or catastrophic divergence due to correlation between sequential samples.
Experience replay can be seen as reducing the reinforcement learning problem to supervised learning.
FixedQ Targets address the issue that the Qvalue estimate for a (state, action) pair is updated using another Qvalue estimate. This is the Qlearning update rule:
$$
\Delta w = \alpha(R + \gamma\max_a \hat q(S', a, w)  \hat q(S, A, w))\Delta_w \hat q(S, A, w)
$$
In particular the update of \(w\) is dependent on the TD target which in turn is dependent on \(w\). Instead with we want a fixed Qtarget parametrized by \(w^\) that is frozen and only updated (softly?) every few steps:
$$
\Delta w = \alpha(R + \gamma\max_a \hat q(S', a, w^)  \hat q(S, A, w))\Delta_w \hat q(S, A, w)
$$
See this video for a more detailed description.
This is the paper that introduced DQN.
Double QLearning
Double QLearning (or Double DQN = DDQN) addresses the issue of action value overestimation that can occur in basic DQN due to the unstable TD target:
$$
\gamma\max_a \hat q(S', a, w)
$$
Particularly in the early stages of training the action values in a given state can fluctuate wildly, taking the maximum of the already noisy values creates even more fluctuations and makes the network generally overestimate Qvalues.
For the exact reasoning see this paper.
Since the TD target can be reexpressed in the following way,
$$
R + \gamma \hat q(S', \arg \max_a \hat q(S'a, w), w)
$$
one way to reduces the overestimation is use different Qnetworks/set of weights for both \(\hat q()\) calls above (i.e. one network for selecting the maximum value action and one network for evaluating the action's value).
When using fixed Qtargets we can simply use the \(w^\) network for this purpose, the TD target is then:
$$
R + \gamma \hat q(S', \arg \max_a \hat q(S'a, w), w^)
$$
For a detailed description of DDQN see this paper.
Prioritized Experience Replay
The goal of prioritized experience replay is to sample rare or more important experiences more often. It's similar to techniques for dealing with classimbalances in supervised learning.
A transition's priority is either proportional to the TD error \(p_i = \delta_i+\epsilon\) or rank based \(p_i = \frac{1}{\text{rank}(i)}\) , where \(\text{rank}(i)\) is the rank of transition \(i\) when transitions are sorted by their TD error \(\delta\). The probability of sampling a transition from the replay buffer is \(P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}\) where alpha controls the degree of prioritization.
For an efficient implementation of the replay buffer and interesting extensions of PER see the paper above.
Note that "the estimation of the expectation of the expected value through stochastic updates relies on those updates corresponding to the same distribution as its expectation". For this reason additional importance sampling weights (and an additional hyperparameter \(\beta\)) have to be introduced to the update equation.
Dueling DQN
In dueling DQN the statevalue and advantage functions are estimated separately within a single network and then combined to receive the actionvalue estimates. Additional constraints are imposed to ensure that both subfunctions model their respective targets. (similar to residual nets because default advantage=0 is easily representable?)
The dueling DQN paper uses DDQN with (and without) rank based prioritized experience replay and uses the expected sarsa update rule.
Policy Based Methods
Policy based methods are simpler, allow us to learn true stochastic policies and are better suited for continuous action spaces (In value based methods we would need to find the continuous action \(a\) that maximizes \(Q(s, a)\)).
A trajectory is a stateaction sequence of the form \(\tau = (s_0, a_0, s_1, a_1, \ldots, s_H, a_H, s_{H+1})\) where \(H\) is the time horizon. The return over a trajectory \(\tau\) is denoted by \(R(\tau) = r_1 + r_2 + \ldots + r_H+r_{H+1}\)
The expected return of a policy with parameters \(\theta\) is \(U(\theta) = \sum_\tau P(\tau  \theta) R(\tau)\).
Maximizing expected return over trajectories lets us search for optimal policies in both episodic and continuing tasks.