<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://omkar-ranadive.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://omkar-ranadive.github.io/" rel="alternate" type="text/html" /><updated>2026-04-11T16:50:22-07:00</updated><id>https://omkar-ranadive.github.io/feed.xml</id><title type="html">Omkar Ranadive</title><subtitle>personal description</subtitle><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><entry><title type="html">Maximum Likelihood, Fisher Information, Cramer Rao Inequality</title><link href="https://omkar-ranadive.github.io/posts/stats-ml" rel="alternate" type="text/html" title="Maximum Likelihood, Fisher Information, Cramer Rao Inequality" /><published>2020-09-09T00:00:00-07:00</published><updated>2020-09-09T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/maximum_likelihood</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/stats-ml"><![CDATA[<hr />

<p><strong>Credits: All images used in this post are courtesy of <a href="https://www.physics.northwestern.edu/people/faculty/core-faculty/michael-schmitt.html">Prof. Michael Schmitt</a> and <a href="https://www.youtube.com/user/SpartacanUsuals">Ben Lambert</a></strong></p>
<hr />

<p><strong>Maximum likelihood:</strong></p>

<p>In simple terms, maximum likelihood is estimating a distribution using a
likelihood function (which is made up some parameters) such that the
likelihood of observed data being a part of that distribution is
maximized.</p>

<p><strong>For example:</strong></p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image1.png" alt="" /></p>

<p>The red points are the observed data samples. We can see that if we
estimate the distribution using parameter alpha, then only in the middle
diagram the distribution correctly fits the data (i.e, joint probability
of all the observed samples under that distribution is high).</p>

<p>The likelihood function can be defined as follows:</p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image2.png" alt="" /></p>

<p><strong>Likelihood is not probability and the integral of likelihood will have
no interpretation at all.</strong></p>

<p>If we assume that the set of samples are i.i.d then the likelihood
function can be written as a joint product of probabilities:</p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image3.png" alt="" /></p>

<p>Where f(xi, theta) is the probability to observe xi in the interval x +
dx.</p>

<p><strong>How to maximize the likelihood?</strong></p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image4.png" alt="" /></p>

<p>To maximize, we can take the first order derivative of the likelihood
function and set it to 0. In practice it’s more convenient to minimize
the negative log likelihood instead – as directly multiplying
probabilities could lead to extremely small values; the log function can
handle this situation.</p>

<p><strong>Taylor series of expansion of NLL:</strong></p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image5.png" alt="" /></p>

<p>The first team -lnL(theta_ml) is the minimum as by definition of
maximum likelihood, L(theta_ml) is the maximizing function, so the
negative log of that is the minimum. The second term vanishes as first
order derivative = 0.</p>

<p>So, we are left with the following:</p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image6.png" alt="" /></p>

<p><strong>So, we can see that when N is large, the negative log likelihood can
be represented using the equation of the parabola.</strong></p>

<p><strong>So, we can see that sharper the parabola, lower the variance.</strong></p>

<p><strong>But why does the second order derivative represent the curvature?</strong></p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image7.png" alt="" /></p>

<p>To understand it intuitively, first let’s look at what the second order
derivative means. It basically means taking the derivative of the
gradient; i.e. it represents the rate of change of gradient.</p>

<p>From the figure above we can see that the pink distribution has a faster
rate of change as compared to the yellow one (The curves are shaper for
the pink). Also, the pink distribution has a lower variance as it is
sharper as compared to the yellow one<strong>. So, we can roughly say that the
variance of the estimator is inversely proportional to the second order
derivative of likelihood function.</strong></p>

<p><strong>Fisher Information</strong></p>

<p>Fisher Information tells us how much information we can get about the
parameter we are trying to estimate given a data sample X. Basically, it
tells us the expected amount of information which sample X carries about
the parameter theta.</p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image8.png" alt="" /></p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image9.png" alt="" /></p>

<p>So, the variance is proportional to the inverse of Fisher information.
This idea is formalized in the form of Cramer Rao Inequality as follows:</p>

<p><img src="../../images/stats/Maximum_Likelihood/media/media/image10.png" alt="" /></p>

<p><strong>In brief, the Cramer Rao Inequality says that the variance cannot be
lower than the inverse of the Fisher Information.</strong></p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="statistics" /><category term="statistics" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Lecture 10 - Applying RL to Games [Notes]</title><link href="https://omkar-ranadive.github.io/posts/rl-l10-ds" rel="alternate" type="text/html" title="Lecture 10 - Applying RL to Games [Notes]" /><published>2020-05-05T00:00:00-07:00</published><updated>2020-05-05T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/rl-l10-ds</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/rl-l10-ds"><![CDATA[<hr />
<p><strong>Lecture Details</strong></p>
<ul>
  <li><strong>Title:</strong> Appling RL to Games</li>
  <li><strong>Description:</strong> The lecture notes are based on David Silver’s lecture video.</li>
  <li><strong>Video link:</strong> <a href="https://www.youtube.com/playlist?list=PLbPhAbAhvjUyrKlhnLEMyNmiF72ABB3Zh" target="_blank">RL Course by David Silver - Lecture 7</a></li>
  <li><strong>Lecture Slides:</strong>  <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html" target="_blank">Slides</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of David Silver</strong></p>

<hr />

<p><strong>Why study games?</strong></p>

<p><img src="../../images/rl/l10-ds/media/image1.png" alt="" /></p>

<p>Games have complicated rules and require logic to play. Playing games
optimally displays intelligence and acts a test of IQ.</p>

<p>Games are called Drosophila of AI. Drosophila is latin for fire bug –
i.e an insect used widely in biology for conducing tests</p>

<p>Therefore, as games require intelligence and logic they are a good
testing grounds for intelligent agents.</p>

<p><strong>Optimality in games:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image2.png" alt="" /></p>

<p>There are two main types of optimality in games.</p>

<ul>
  <li>
    <p><strong>Best response optimal policy</strong> is the optimal policy against some
fixed opponent policy. Example – Consider that in a game of Rock
Paper Scissors, the opponent always plays scissors. Then the best
response would be to always play rock. <strong>But remember this best
response is w.rt this specific opponent only.</strong></p>
  </li>
  <li>
    <p><strong>Nash equilibrium</strong> is a policy in which all players play
optimally. That is, we make agents play against themselves where
every agent is trying to give the best possible response against
every other agent. By improving the agents in such a scenario, the
agents eventually learn the optimal policy and this point is known
as the Nash equilibrium.</p>
  </li>
</ul>

<p>Note: The point of Nash equilibrium may or may not be unique.</p>

<p><strong>Single agent vs self-playing agents:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image3.png" alt="" /></p>

<p>In singe agent problem, we consider the other players as part of the
environment. Hence, this becomes a normal RL problem where the
environment outputs some states and rewards. Best response policy is
used to solve single-agent problems.</p>

<p>In Nash equilibrium we use self-play between agents. That is, agents
will play one another and eventually improve. Note that, a single agent
may play against itself other than having multiple agents play against
each other.</p>

<p><strong>Two player zero-sum games:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image4.png" alt="" /></p>

<p>A zero sum game is a game in which two players are trying to defeat each
other and hence, have opposite rewards. I.E A positive reward for player
1 is negative for player 2.</p>

<p><strong>Perfect and imperfect information games:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image5.png" alt="" /></p>

<p><strong>Note: There is no single perfect way of solving games.</strong>
<strong>Reinforcement Learning should be used with Searching to get best
possible results.</strong> <strong>That is, searching is an important process in game
agents and one shouldn’t solely rely on reinforcement learning.</strong></p>

<p><strong>Minimax:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image6.png" alt="" /></p>

<p>Minimax policy is basically the minimax algorithm where a tree is built
in which the two players choose the best possible move at that stage
(max/min). <strong>A minimax policy is a Nash equilibrium.</strong></p>

<p><strong>Minimax search example:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image7.png" alt="" /></p>

<p>We can see that the values at leaf node were obtained and propagated
upwards all the way to the root.</p>

<p><strong>Problem with minimax:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image8.png" alt="" /></p>

<p>The tree search is impractical in practice as the size of search tree
grows exponentially. Hence, in practice function approximators like
neural networks are used to approximate the value. That is, minimax
search is executed till its computationally feasible and that data is
used to approximate the final leaf-node value.</p>

<p><strong>Using pre-defined features: Binary Linear value functions</strong></p>

<p><img src="../../images/rl/l10-ds/media/image9.png" alt="" /></p>

<p>Games have many pre-defined set of features which are used to calculate
the “goodness” of a state. As we can see above, arrangements can be
shown using a binary vector where the vector gets a “1” if the
designated piece is actually at that place on board or 0 otherwise. Each
such piece is given a weightage. After we multiply, we get a single
scalar value telling us the goodness (value function) of that state.
<strong>Note</strong>: Here negative weights show harmful positions (i.e opponents
position)</p>

<p><strong>Deep Blue: The Chess agent</strong></p>

<p><img src="../../images/rl/l10-ds/media/image10.png" alt="" /></p>

<p>Deep Blue used a combination of pre-defined features and search
heuristics.</p>

<p><strong>Chinook: Checkers agent</strong></p>

<p><img src="../../images/rl/l10-ds/media/image11.png" alt="" /></p>

<p>The chinook agent along with binary-linear function (hand-crafted
features) and minimax search, <strong>used a retrograde analysis. In this, the
agent searches backwards from win positions.</strong></p>

<p><strong>Self-play agents: Reinforcement learning</strong></p>

<p><img src="../../images/rl/l10-ds/media/image12.png" alt="" /></p>

<p>Now coming to reinforcement learning, the algorithms are pretty much the
same we have seen before. Here, in case of MC v(S<sub>t,</sub> W) gives
the estimate of who would win from state S<sub>t</sub>. Then when the
game is finally over, it is updated towards the actual win (reward)
G<sub>t</sub>.</p>

<p>In case of TD, v(S<sub>t,</sub> W) gets updated towards the successor
state. That is, the agent has some estimate of V at state S<sub>t</sub>.
Then when the agent actually plays and reaches state St+1, there will be
some new estimate. We update the V(S<sub>t,</sub> W) towards this new
estimate.</p>

<p><strong>Policy improvements using afterstates:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image13.png" alt="" /></p>

<p>In deterministic games, we know the rules. That is, we know that a horse
in chess can move 2 and half places to the left or right and so on.
Hence, we know exactly where the piece will end up when that move it
played. Or in terms of agent, we know exactly the state which the game
will move to.</p>

<p>So, the policy can be improved by exploiting this idea of afterstates.
As we know the rules and the states which the game will move to, we can
simply try out these states in our mind and then choose the action which
maximises the after state value.</p>

<p><strong>Self-play TD: Othello in game Logistello</strong></p>

<p><img src="../../images/rl/l10-ds/media/image14.png" alt="" /></p>

<p>The game agent Othello was able to develop its own features using some
basic input features.</p>

<p><img src="../../images/rl/l10-ds/media/image15.png" alt="" /></p>

<p>It used generalized policy iteration to perform RL.</p>

<p><strong>TD Gammon: Non-linear approximator for Backgammon</strong></p>

<p><img src="../../images/rl/l10-ds/media/image16.png" alt="" /></p>

<p>The TD Gammon feature vector basically flattens out the board and counts
how many of the checkers are there on each location. This is then passed
to a neural network which spits out the state value function.</p>

<p>So the main difference between TD Gammon and previous agents is that TD
Gammon uses a non-linear function approximation (neural network) to
approximate the value function.</p>

<p><strong>Self-play TD in TD Gammon:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image17.png" alt="" /></p>

<p>Notice that, TD Gammon uses temporal difference learning with <strong>only
greedy policy improvement.</strong> That is, epsilon greedy or other techniques
of exploration are not used. Yet, the algorithm manages to converge in
practice. <strong>This is because Backgammon uses a dice. The element of dice
brings stochasticity to the game and thus, there is an implicit
exploration. Due to this, a greedy policy is able to reach optimality.</strong></p>

<p><img src="../../images/rl/l10-ds/media/image18.png" alt="" /></p>

<p>When this TD learning was combined with searching and hand-crafted
features, the agent was able to defeat the world champion.</p>

<p><strong>Combining reinforcement learning and minimax search:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image19.png" alt="" /></p>

<p>Like it was mentioned before, searching is an important step in gaming
agents. It shouldn’t be ignored.</p>

<p>One naïve way of combining RL with Minimax search is to first use TD to
update towards the successor value and then use this approximated value
function to perform minimax search.</p>

<p><strong>Results of using Simple TD:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image20.png" alt="" /></p>

<p>As we can see, for complicated games like Chess and Checkers it leads to
poor performance.</p>

<p><strong>TD root:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image21.png" alt="" /></p>

<p>In TD root, we first perform a minimax search from state S<sub>t</sub>
and store the outcome. Now, we play for real and reach the next state
S<sub>t+1</sub>. Now, we perform minimax search in this state
S<sub>t+1</sub> and get the outcome. <strong>The value function at
S<sub>t</sub> is updated towards the outcome of the minimax search of
state S<sub>t+1</sub>.</strong> The important thing to note is that we are
updating the value function towards the outcome of the minimax search
and not towards the value function S<sub>t+1</sub>. From the figure, the
state S<sub>t+1</sub> leads to scenario with green node. So we update
the S<sub>t</sub> node such that it goes closer to the green node.
<strong>That is, the node S<sub>t</sub> should be able to predict the green
outcome without performing the minimax search.</strong></p>

<p><strong>TD Leaf:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image22.png" alt="" /></p>

<p>In case of TD leaf, we again use minimax search at state S<sub>t</sub>
and state S<sub>t+1</sub>. But now, we also update the leaf value node
attained at state S<sub>t</sub> towards the leaf value node attained at
state S<sub>t+1</sub>. <strong>That is, we are not only creating a better
value estimate at state S<sub>t</sub> but also a better minimax search
estimate from state S<sub>t. </sub></strong></p>

<p><strong>Tree Strap:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image23.png" alt="" /></p>

<p>In TreeStrap we use update nodes at every level towards nodes at deeper
level. That is, the nodes at higher level should be able to predict what
the nodes at deeper level are predicting.</p>

<p>Note, this algorithm can only work well if they are a sufficient number
of real-world estimates within the search tree. Otherwise, we would just
be updating towards a biased fake optimal solution (as we are simulating
the search tree).</p>

<p><strong>Simulation based search:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image24.png" alt="" /></p>

<p>The UCT algorithm basically uses Monte-Carlo Tree Search seen in
previously lectures but additionally considers every node as a bandit.
That is, every node now has a upper confidence bound associated with it.</p>

<p><strong>Performance of Monte Carlo Tree Search and UCT:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image25.png" alt="" /></p>

<p><strong>Simple Monte-Carlo search in Scrabble: (Imperfect information game
problems)</strong></p>

<p><img src="../../images/rl/l10-ds/media/image26.png" alt="" /></p>

<p>Notice that scrabble is an imperfect information game as the other
player’s letters are not visible to us.</p>

<p><strong>Game tree search in imperfect games:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image27.png" alt="" /></p>

<p>Notice that in imperfect information games like Poker, <strong>every player
has a different search tree.</strong></p>

<p><strong>Solutions to imperfect information games:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image28.png" alt="" /></p>

<p><strong>Smooth UCT:</strong></p>

<p><img src="../../images/rl/l10-ds/media/image29.png" alt="" /></p>

<p>Smooth UCT is a variant of normal UCT algorithm where the current
experience is taken into consideration. The agent takes the current
opponent’s behaviour into consideration and learns to respond against
the <strong>average of it.</strong> The action is picked from UCT algorithm with a
high probability and with a small probability pick a action which plays
best against the average behaviour.</p>

<p><img src="../../images/rl/l10-ds/media/image30.png" alt="" /></p>

<p>In conclusion, a RL agent should be combined with searching methods,
function approximators and hand-crafted features to form a robust and
intelligent agent.</p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="reinforcement-learning" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[Lecture Details Title: Appling RL to Games Description: The lecture notes are based on David Silver’s lecture video. Video link: RL Course by David Silver - Lecture 7 Lecture Slides: Slides]]></summary></entry><entry><title type="html">Lecture 9 - Advanced Exploration [Notes]</title><link href="https://omkar-ranadive.github.io/posts/rl-l9-ds" rel="alternate" type="text/html" title="Lecture 9 - Advanced Exploration [Notes]" /><published>2020-05-04T00:00:00-07:00</published><updated>2020-05-04T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/rl-l9-ds</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/rl-l9-ds"><![CDATA[<hr />
<p><strong>Lecture Details</strong></p>
<ul>
  <li><strong>Title:</strong> Advanced Exploration</li>
  <li><strong>Description:</strong> The lecture notes are based on David Silver’s lecture video.</li>
  <li><strong>Video link:</strong> <a href="https://www.youtube.com/playlist?list=PLbPhAbAhvjUyrKlhnLEMyNmiF72ABB3Zh" target="_blank">RL Course by David Silver - Lecture 9</a></li>
  <li><strong>Lecture Slides:</strong>  <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html" target="_blank">Slides</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of David Silver</strong></p>

<hr />

<p>We have already studied the importance of exploration. In brief, the
idea is to try out new things in the hope of getting more reward.</p>

<p><img src="../../images/rl/l9-ds/media/image1.png" alt="" /></p>

<p><strong>While exploring, we make short-term sacrifices to reward for long-term
advantage.</strong></p>

<p><strong>Types of exploration:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image2.png" alt="" /></p>

<p>Till now, we used the naïve approach of epsilon-greedy to explore. That
is, we explored randomly with a small probability epsilon and acted
greedily with a large probability 1 – epsilon. <strong>However, randomly
exploring the environment is obviously not optimal.</strong></p>

<p><strong>Better approaches than naïve exploration:</strong></p>

<p><strong>Optimism in the face of uncertainty:</strong> The idea behind optimism in the
face of uncertainty is to be optimistic about the unknown. <strong>Example</strong>:
If there is a 70% chance of getting a reward of 100 and another action
may lead to a reward of 1000 with a 30% chance then we should explore
the action with 30% chance.</p>

<p><strong>Information state search:</strong> Information state search uses previous
information to make informed decisions. <strong>Example</strong>: If we are in a room
and we know what is behind some door vs if we are in a room and we don’t
know what is behind some door. The first case is much more useful as we
are using previously known information to our advantage.</p>

<p><strong>State action exploration vs parametric exploration:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image3.png" alt="" /></p>

<p>In state action-exploration we <strong>systematically</strong> try out new things.
<strong>Example</strong>: Consider we have been in the state s before and had taken a
right from that point. So, when we are in that same state again, we
would likely take a left in state action-exploration; that is,
systematically try out different things.</p>

<p>In parameter exploration, we control our agent using some parameterised
policy. Once we choose the parameters, <strong>we try it out for a while. This
introduces consistency.</strong></p>

<p><strong>Example</strong>: An agent which would explore based on some fixed policy
(parameters) is better than an agent which takes some random action at
different states. That is, taking random actions may not lead to useful
results but taking consistent actions while exploring may lead to better
results.</p>

<p><strong>Multi-arm bandits</strong></p>

<p><img src="../../images/rl/l9-ds/media/image4.png" alt="" /></p>

<p>The multi-arm bandits problem can be thought of as having many one-step
slot machines. That is, say we have 10 slot machines in front of us, and
get to pull the lever of one of the machines. Doing this leads to some
reward R. We need to maximise the cumulative reward by pulling these
levers one at a time. Hence, mathematically, a multi-arm bandit is a
tuple of action and reward (A, R). <strong>Notice that it is state-less.</strong></p>

<p><strong>Example</strong>: One of the slot machines may have a 70% chance of giving a
reward of 100, another may have 20% chance of giving a reward of 200
etc. We need to maximize the cumulative reward.</p>

<p><strong>Regret</strong></p>

<p><img src="../../images/rl/l9-ds/media/image5.png" alt="" /></p>

<p>Instead of maximizing the cumulative reward, expressing the problem as
minimizing the total regret has certain advantages.</p>

<p>Firstly, regret is the difference between the best we could have done
(V*) and the action which we took Q(a<sub>t</sub>). <strong><em>Note that here
we are assuming that we somehow know the value V* (the optimal
value).</em></strong></p>

<p><strong>Advantage of using regret instead of reward:</strong> By expressing the
problem in the form of regret, we get to compare different algorithms in
terms of exploration. That is, every algorithm will improve the
cumulative reward (curve will keep on increasing), but by comparing the
regret, we can find if the algorithm is decreasing/plateauing the curve.</p>

<p><strong>Counting regret:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image6.png" alt="" /></p>

<p>Regret can be expressed as shown above. Every time we take some action
a, we increase its count N<sub>t</sub>(a). The difference can be thought
of as a gap between the best and our action. <strong>So, basically, we want to
build an algorithm which would decrease the gaps as the count
increases.</strong></p>

<p><strong>Linear vs sublinear regret:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image7.png" alt="" /></p>

<p>As shown above, greedy and epsilon-greedy never end up plateauing. That
is, the <strong>cumulative</strong> <strong>regret</strong> keeps on increasing linearly. However,
decaying e-greedy works very well and ends up plateauing. That is,
eventually we end up taking the optimal action and close the gap.</p>

<p><strong>Analysis of greedy algorithm:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image8.png" alt="" /></p>

<p>In greedy algorithm as we select the max action every time, we may ned
up locking on a suboptimal path forever. Example: Say one machine has a
80% chance of giving a reward of 10, another has a 50% chance of giving
a reward of 100. Now, we try machine 1, get a reward of 10, try machine
2 get a reward of 0 (we got unlucky). Now, we will end up choosing
machine one every time (suboptimal action).</p>

<p><strong>Optimistic initialization:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image9.png" alt="" /></p>

<p>In optimistic initialization, the initial values of Q(a) for all a’s are
high. Hence, every action is highly likely to be chosen (as they have
high initial values). The values of bad actions will become smaller over
time. That is, say some bad action a has an initial value of 100. If the
agent takes it and finds that a paltry reward was received, the action
value will be decreased. That is say reward received was 10. Then we
have 100 + ½ * (10 – 100) = 100 – 45 = 55. However, 55 is still
relatively high. Hence, the action will be tried a few more times till
it becomes substantially small. That is, we are trying every action
enough times before determining that it is shit.</p>

<p>However, if we are really unlucky, that is a good action ends up giving
bad rewards for say 4-5 tries then we will never try it again and we may
end up locking into a suboptimal solution.</p>

<p>Epsilon-greedy:</p>

<p><img src="../../images/rl/l9-ds/media/image10.png" alt="" /></p>

<p>In epsilon greedy as we continue exploring <strong>forever (as it is
non-decaying scenario), we will end up accumulating regret in each
turn.</strong></p>

<p><strong>Decaying epsilon-greedy:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image11.png" alt="" /></p>

<p>Consider the decaying schedule shown above. At every time step, we
choose d, that is the difference between the <strong>best action and the
second-best action.</strong> Intuitively, if the difference is large then the
term c|A|/d^2*t will become smaller, i.e we would explore the second
best action and subsequent actions a lot less as they are significantly
worse than the best action. On the other hand if the difference is
small, the c|A|/d^2*t term will be larger. That is, we would explore
these actions with a higher probability.</p>

<p><strong>Note</strong>: This cannot be done in practice as it requires advanced
knowledge of gaps (we need to know the optimal value V* for each
action).</p>

<p><strong>Lower bound for regret:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image12.png" alt="" /></p>

<p>It can be proved using KL divergence that the lower bound of regret is
logarithmically asymptote. <strong>That is, the optimal regret curve will be a
logarithmic curve.</strong></p>

<p><strong>Hence, the best algorithm will be that which leads to a logarithmic
curve of regret. (Example – decaying e-greedy).</strong></p>

<p><strong>Optimism in the face of uncertainty:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image13.png" alt="" /></p>

<p>Consider the three action value distributions (Q(a1), Q(a2), Q(a3)). We
can see that action Q(a1) covers a larger area and has a small chance of
getting the highest reward amongst the three actions. Hence, an
optimistic agent will choose Q(a1).</p>

<p><img src="../../images/rl/l9-ds/media/image14.png" alt="" /></p>

<p>Consider that after choosing Q(a1) (blue curve), we got a small reward.
Now, we would update the distribution and now we would be less likely to
choose Q(a1) again.</p>

<p><strong>Problem: We need to find some way of forming these distributions or
obviating their need</strong></p>

<p><strong>Upper Confidence Bounds:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image15.png" alt="" /></p>

<p>Upper confidence bound obviates the need for forming the distribution.
We define an upper confidence U<sub>t</sub>(a). So, basically, the
action value has a reward value anywhere between Q_cap(A) to U(a). <strong>As
we try out the action more and more, the value of U(a) will decrease,
that is we become less and less uncertain about our choices.</strong></p>

<p><strong>Hoeffding’s inequality:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image16.png" alt="" /></p>

<p>Hoeffding’s inequality says that the probability of the sample mean
X<sub>t</sub> being greater than the actual mean E[X] by u has an
upper cap of e<sup>-2tu^2</sup>.</p>

<p>We can use this to represent our upper confidence bound and find the
value of U(a).</p>

<p><strong>Calculating the upper confidence bounds:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image17.png" alt="" /></p>

<p>We start by selecting a large probability p. Say p = 0.95. That is, the
initial uncertainty is high. We can then calculate U<sub>t</sub>(a) as
shown above (we have taken log on both sides and solved for
U<sub>t</sub>(a)). As we can see, the N<sub>t</sub>(a) in the
denominator ensures that U<sub>t</sub>(a) becomes smaller as the number
of times the action gets chosen increases.</p>

<p>We should also reduce the value of p for faster convergence.</p>

<p><img src="../../images/rl/l9-ds/media/image18.png" alt="" /></p>

<p>By using this strategy, we achieve logarithmic total regret.</p>

<p><strong>Bayesian Bandits:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image19.png" alt="" /></p>

<p>In Bayesian bandits<strong>, we exploit</strong> <strong>prior knowledge of rewards.</strong> That
is, we have a history of knowledge involving some actions and their
respective rewards. Using this, we can build up distributions for each
Q(a).</p>

<p><img src="../../images/rl/l9-ds/media/image20.png" alt="" /></p>

<p>Assume a Gaussian distribution. We can make use of the prior knowledge
to build up distributions as shown above. Then, we can estimate the
upper confidence bound by using standard deviation.</p>

<p>I.E c*standard_deviation/(sqrt(N(a)) is the upper confidence bound.
Now, we simply maximize using the UCB1 algorithm.</p>

<p><strong>This algorithm will only work well if the prior knowledge is
accurate.</strong></p>

<p><strong>Probability matching:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image21.png" alt="" /></p>

<p>In probability matching, the actions are picked in proportion to their
probability. Example: If there are two actions, one of which has a 70%
chance of being the best and other has a 30% chance of being the best
then the one with 70% chance will be picked 70% of the times and other
action will be picked 30% of the times.</p>

<p><strong>Thompson Sampling:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image22.png" alt="" /></p>

<p>Thompson sampling is one of the earliest and simplest of ideas. In it,
we sample from each of the distributions and pick the maximum. For
example, consider the following distribution:</p>

<p><img src="../../images/rl/l9-ds/media/image23.png" alt="" /></p>

<p>We will sample a value from Q(a1), Q(a2) and Q(a3). That is, we get
three values from those three distributions (actions). Now, we simply
pick the max of the values.</p>

<p><strong>Value of information:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image24.png" alt="" /></p>

<p>The value of information can be thought of as how much is taking an
uncertain exploratory action worth? Mathematically, it can be thought of
as long-term reward after getting the information – immediate reward.</p>

<p><strong>Information State-Space:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image25.png" alt="" /></p>

<p>In information-state space, we store information about the environment
and use it to explore the environment better. The bandit problem can be
converted to a sequential decision making problem from a one-step
decision making problem. Hence, we define an information state S_tilde
and a probability matrix P_tilde. By definition they satisfy the Markov
property and hence, we can now express the problem as an MDP.</p>

<p>Example: Bernoulli Bandit</p>

<p><img src="../../images/rl/l9-ds/media/image26.png" alt="" /></p>

<p>Bernoulli Bandit is a special case of multi-arm bandit which issues a
reward of 1 with probability p and a reward of 0 with a probability 1 –
p.</p>

<p>Consider the problem where we win a game with probability u. For this
problem, we maintain a simple information state which is basically a
count of times we won and loss. Alpha counts the times we lost and Beta
counts the times we won.</p>

<p><img src="../../images/rl/l9-ds/media/image27.png" alt="" /></p>

<p>Once we do this for countably infinite times, we can use any algorithm
to solve the MDP.</p>

<p><strong>Gittins indices (Bayes Adaptive RL):</strong></p>

<p><img src="../../images/rl/l9-ds/media/image28.png" alt="" /></p>

<p>Gittins indices is a dynamic programming solution to the bandits
problem. We basically form a tree consisting of different scenarios and
use it to update our information state. Basically, at each node of the
tree, we are summarizing everything we know about the actions using
distributions. We can see that we started with two initial drugs with
some initial distribution and they got updated at every node. This is
the Bayes Adaptive approach. <strong>Note that, solving Bayes Adaptive MDP
using Dp is called Gittins index.</strong></p>

<p><img src="../../images/rl/l9-ds/media/image29.png" alt="" /></p>

<p>In reality, exact solution cannot be found in tractable time.</p>

<p><strong>Summary:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image30.png" alt="" /></p>

<p><strong>Contextual bandits:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image31.png" alt="" /></p>

<p>In contextual bandits, we also introduce the idea of states (context).
So, now the tuple becomes (A, S, R) instead of (A, R).</p>

<p>Example: Consider the problem of ad-placement. The ads will be placed
based on the user who has entered the website. That is, if the user is
an Indian, male then there will be some particular placement of ads etc.
Basically, we are taking actions based on the context (S).</p>

<p><strong>Using linear regression to estimate the value function:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image32.png" alt="" /></p>

<p>We can use a linear approximation of the action-value function which
would improve over time (the parameters will improve).</p>

<p><strong>UCB:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image33.png" alt="" /></p>

<p>We can also estimate the variance to calculate the upper confidence
bound U.</p>

<p><strong>Geometric interpretation:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image34.png" alt="" /></p>

<p>We are essentially defining a ellipsoid around the parameter theta. This
ellipsoid will account for the uncertainty (upper confidence bound).</p>

<p>Hence, we get:</p>

<p><img src="../../images/rl/l9-ds/media/image35.png" alt="" /></p>

<p><strong>Extending the algorithms of bandits to MDP:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image36.png" alt="" /></p>

<p><strong>UCB in MDPs</strong></p>

<p><img src="../../images/rl/l9-ds/media/image37.png" alt="" /></p>

<p><strong>Problem:</strong> When we are dealing with MDPs the Q(s, a) value itself
keeps improving as the policy improves. So, there is not only
uncertainty w.r.t U(s, a) but also Q(s, a). Hence, the problem becomes
much harder in case of MDPs.</p>

<p><strong>Example 2: Optimistic initialization in MDPs</strong></p>

<p><img src="../../images/rl/l9-ds/media/image38.png" alt="" /></p>

<p>The idea of the rmax algorithm is to build an optimistic model by
imagining that every transition leads to heaven (best scenario). Once we
actually start solving, we find that many of those states are actually
bad and we can reduce the values appropriately.</p>

<p><strong>Information state space MDP:</strong></p>

<p><img src="../../images/rl/l9-ds/media/image39.png" alt="" /></p>

<p>We basically combine the actual state s with the information state
s_tilde into an augmented state S_tilde.</p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="reinforcement-learning" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[Lecture Details Title: Advanced Exploration Description: The lecture notes are based on David Silver’s lecture video. Video link: RL Course by David Silver - Lecture 9 Lecture Slides: Slides]]></summary></entry><entry><title type="html">Lecture 8 - Integrating Learning and Planning [Notes]</title><link href="https://omkar-ranadive.github.io/posts/rl-l8-ds" rel="alternate" type="text/html" title="Lecture 8 - Integrating Learning and Planning [Notes]" /><published>2020-05-03T00:00:00-07:00</published><updated>2020-05-03T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/rl-l8-ds</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/rl-l8-ds"><![CDATA[<hr />
<p><strong>Lecture Details</strong></p>
<ul>
  <li><strong>Title:</strong> Integrating Learning and Planning</li>
  <li><strong>Description:</strong> The lecture notes are based on David Silver’s lecture video.</li>
  <li><strong>Video link:</strong> <a href="https://www.youtube.com/playlist?list=PLbPhAbAhvjUyrKlhnLEMyNmiF72ABB3Zh" target="_blank">RL Course by David Silver - Lecture 8</a></li>
  <li><strong>Lecture Slides:</strong>  <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html" target="_blank">Slides</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of David Silver</strong></p>

<hr />

<p>Till now, we have learned the policy/value directly from the experience.
But it is also possible to learn the model of the environment from the
experience. Such a model helps in planning. This planning helps
construct a value/policy function.</p>

<p><img src="../../images/rl/l8-ds/media/image1.png" alt="" /></p>

<p><strong>Model Based RL:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image2.png" alt="" /></p>

<p>In model based RL, we have a simulated representation of the
environment. This simulated representation can be used for planning
future actions. <strong>Basically, a model based RL agent will learn the
probability transition matrix and the rewards i.e it will learn the
MDP.</strong></p>

<p><strong>Advantages and Disadvantages of Model Based RL:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image3.png" alt="" /></p>

<p>The model of the environment is learned by understanding from the real
world experiences. That is, the agent first works in the real-world,
gets some <strong>real-world experience (S, A, R, P)</strong> and uses these tuples
to build the model. So, this is a supervised learning problem and hence,
any supervised learning technique can be used.</p>

<p>Because we know what we know about the environment, that is, our model
is our knowledge about the environment, we also know the things we are
uncertain about and hence, we can handle uncertainty better.</p>

<p><strong>Disadvantages:</strong> Here, we are performing a two-step process. First we
are learning a model and then learning the value/policy as opposed to
the previous model-free techniques where we only learned the
value/policy. Hence, the approximation error increases.</p>

<p><strong>Formal definition of a model:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image4.png" alt="" /></p>

<p>So, a model is a representation of an MDP (S, A, P, R). That is, given a
state S<sub>t</sub> and action A<sub>t,</sub> the model can estimate the
next state S<sub>t+1</sub> and the reward R<sub>t+1</sub>.</p>

<p><strong>Note:</strong> Here we are assuming that the state space S and action space A
are known. But in more convoluted problems, it is possible that this is
not the case and the state/action space would also need to be estimated.</p>

<p><strong>Why even do model based RL?</strong></p>

<p>Consider a maze game where a new maze is formed each new episode. A
model free agent will have extreme difficulty navigating the environment
as it won’t be able to learn the value functions properly (as the maze
changes every episode). On the other hand, the model based agent can
simply learn the rules of the game, i.e going up makes me go north,
going left makes me go west etc. By learning such trivial rules, the
model based agent would be able to solve the maze much better.</p>

<p><strong>How to learn a model?</strong></p>

<p><img src="../../images/rl/l8-ds/media/image5.png" alt="" /></p>

<p>As we previously saw, the model learning problem is a supervised
learning problem.</p>

<p><strong>Types of models which can be learned:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image6.png" alt="" /></p>

<p>There are endless possibilities of models which can be learned. It is up
to the programmer to choose the right model representation.</p>

<p><strong>Table lookup model:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image7.png" alt="" /></p>

<p>The simplest way to “learn” a model is by creating a table. The
probability transition matrix can be created by counting the transitions
actually experienced and dividing by total. Similarly, the rewards can
be the average of all rewards.</p>

<p>Example:</p>

<p><img src="../../images/rl/l8-ds/media/image8.png" alt="" /></p>

<p>Consider the simple problem shown above. Here, the transition from state
A to state B is seen once (first tuple). Hence, the transition matrix
will show it with 100% probability. There are 8 B’s leading to a
terminal state, 6 of which get a reward of 1 and 2 of which get a reward
of 0. Hence, we can form a model which says that 6/8 (75%) of the times,
we get r = 1 and 25% (2/8) of the times we get r = 0.</p>

<p><strong>Problem of exploration:</strong></p>

<p>One evident issue with this idea is that if our real world experiences
aren’t a good representation of the environment then the model
constructed will have a heavy bias. That is, in our example, we have
only seen one experience (tuple) with state A. Based on that single
tuple, we learned a model which says that A -&gt; B has a 100% chance of
occurrence. But in reality, this is unlikely to be true and may simply
be a result of learning from too few real-world experiences. Hence, the
amount of exploration needs to be appropriately handled.</p>

<p><strong>Planning with a model:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image9.png" alt="" /></p>

<p><strong>Once we have a model representation, we can use this model to generate
new tuples (experiences) and learn from those experiences.</strong> Hence, it’s
important to have a good simulated representation as we are learning
from the generated simulated experiences. This learning can be done
using any planning algorithm like value iteration, policy iteration etc.</p>

<p><strong>Sample based planning:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image10.png" alt="" /></p>

<p>The simplest approach is to use the model to generate new samples of
data and apply model-free RL algorithms like MC, SARSA to those samples.</p>

<p><strong>Example:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image11.png" alt="" /></p>

<p>Once we create the model shown above, we can use that model to generate
samples (sampled experiences) and <strong>apply RL algorithms to those sampled
experiences.</strong></p>

<p>In this case, as samples A, 0, B, 1 was generated twice by the model,
the V(A) becomes 1 as starting in state A lead to a final reward of 1.
(<strong>Note that we aren’t considering the real world experience A, 0, B,
0).</strong> The real world experiences are only considered to create the model
and then the learning of value function is based on the sampled
experiences from the model.</p>

<p><strong>What if our model sucks?</strong></p>

<p><img src="../../images/rl/l8-ds/media/image12.png" alt="" /></p>

<p>Again, a big problem is that the learned algorithm is only as good as
the model which we have learned. If the simulated environment
representation doesn’t represent the real-world dynamics well, then the
policy learned won’t be optimal. One way to solve this is to jettison
the idea of using model-based algorithm and another way is to use
sophisticated approaches like Deep Belief Networks which would weight
the model based on uncertainty.</p>

<p><strong>Integrated architectures:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image13.png" alt="" /></p>

<p>A more robust solution is to use both the real experience and simulated
experience. If used properly, the resulting agent performs better than
normal.</p>

<p><strong>DYNA:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image14.png" alt="" /></p>

<p>Dyna is an integrate architecture which learns the model from real
experience and does the planning of value function using <strong>both</strong> real
and simulated experience.</p>

<p><strong>Architecture:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image15.png" alt="" /></p>

<p>Here we can see that the value/policy part is formed using planning via
the learned model and directly via the real experiences.</p>

<p><strong>Algorithm: Dyna Q</strong></p>

<p><img src="../../images/rl/l8-ds/media/image16.png" alt="" /></p>

<p>Basically, in Dyna we execute an action A in the real environment and
observe the reward R and state S’. The Q-value is updated using the
general Bellman’s equation. This (S, A, S’, R) tuple is also used to
improve the model. <strong>Now, we again update the Q-value using the tuple
sampled from the model.</strong></p>

<p>Hence, in every iteration, the first update comes from real-world
experience and then n updates are from the samples generated from the
learned model.</p>

<p><strong>Interesting problems:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image17.png" alt="" /></p>

<p>Consider the environment shown above. Here the gray tiles represent a
wall and hence, the agent cannot travel through that space. Now, the
solution is to go all the way right from state S and go all the way up
(as per the first diagram). But say that we change the environment
mid-way. Now, if the agent follows the same solution it won’t get the
optimal reward. It will take a lot of time to find the new correct way.
This can be solved by an improved architecture called the Dyna Q+.
<strong>Dyna Q+ rewards exploration.</strong> Hence, in dynamic environments which
may change, algorithms like Dyna Q+ will perform better than Dyna Q.</p>

<p><img src="../../images/rl/l8-ds/media/image18.png" alt="" /></p>

<p>The same is true even if the changed environment is easier. Dyna Q will
keep following the same previous path even though the new path is
shorter and better.</p>

<p><strong>Simulation-based search:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image19.png" alt="" /></p>

<p>Forward search algorithm <strong>emphasises on the present state.</strong> That is,
we don’t care about some random sample from any random state (as we have
seen till now) but we care about the sample starting from the present
state S<sub>t</sub>. So, we are looking ahead from the present state
S<sub>t</sub>. Now, this state S<sub>t</sub> may be some intermediate
state. <strong>That is fine, as just solving a sub-MDP we are still improving
our understanding and policy.</strong></p>

<p><strong>Simulated forward search:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image20.png" alt="" /></p>

<p>This idea of forward search can be executed by sampling from our learned
model. Then we can apply model free RL techniques to these simulated
episodes.</p>

<p>Formally,</p>

<p><img src="../../images/rl/l8-ds/media/image21.png" alt="" /></p>

<p>Where k = episode</p>

<p><strong>Monte-Carlo search:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image22.png" alt="" /></p>

<p>Monte-Carlo search basically samples episodes from the model and
evaluates those using the MC algorithm. <strong>Notice that, in simple
monte-carlo search we aren’t improving the simulation policy.</strong></p>

<p><strong>Monte-Carlo tree search:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image23.png" alt="" /></p>

<p><strong>In tree search, we build a search table for every state visited. In
case of simple Monte-Carlo we were only updating for the root state
S<sub>t. </sub></strong></p>

<p><img src="../../images/rl/l8-ds/media/image24.png" alt="" /></p>

<p>The simulation policy is improved epsilon greedily. There might also be
states which we have never explored. The actions to such states can be
taken randomly.</p>

<p><strong>Example: Game of Go</strong></p>

<p><strong>Rules:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image25.png" alt="" /></p>

<p>In go, we essentially try to surround as much territory as possible.
Also, if we surround an opposite colored stone, that stone will be
eliminated. That is, if blacks surround a white stone, the white stone
gets eliminated.</p>

<p><strong>Evaluating the positions:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image26.png" alt="" /></p>

<p><strong>Notice that, max min v<sub>pi</sub>(s) says that the agent needs to
consider the optimal play (mini-max algorithm).</strong></p>

<p><strong>Applying forwards search:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image27.png" alt="" /></p>

<p>We can see that by simulating the outcomes starting from some current
position s, we can start approximating the value function. In the
example shown above, as 2 in 4 outcomes lead to a black victory our V(s)
will be 2/4 = 0.5.</p>

<p><strong>Tree evaluation</strong></p>

<p><img src="../../images/rl/l8-ds/media/image28.png" alt="" /></p>

<p>Consider that we start by sampling some branch which lead to the black
winning (terminal state has reward 1). Hence, we write the root as 1/1.
(1 win in 1 episode)</p>

<p>Executing another sample:</p>

<p><img src="../../images/rl/l8-ds/media/image29.png" alt="" /></p>

<p>Now, as the terminal state = 0, we write ½ in the root and 0/1in its
child. Notice that we are slowly building the tree.</p>

<p>Executing another sample:</p>

<p><img src="../../images/rl/l8-ds/media/image30.png" alt="" /></p>

<p>Now, we explore another branch and black wins. So, we update the values
at the explored nodes.</p>

<p>This way we keep adding new information to the tree policy and go in the
direction of the best branch. That is, the tree policy will choose the
likely branch (epsilon greedily) and hence, we will follow the best
known path. Once we are out of the tree policy domain, we choose the
actions using some default policy.</p>

<p><strong>Advantages of MC:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image31.png" alt="" /></p>

<p><strong>TD search:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image32.png" alt="" /></p>

<p>Instead of MC, it is also possible to use TD search and use the idea of
TD for tree evaluation etc.</p>

<p><img src="../../images/rl/l8-ds/media/image33.png" alt="" /></p>

<p>Formally,</p>

<p><img src="../../images/rl/l8-ds/media/image34.png" alt="" /></p>

<p><strong>Dyna-2 architecture:</strong></p>

<p><img src="../../images/rl/l8-ds/media/image35.png" alt="" /></p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="reinforcement-learning" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[Lecture Details Title: Integrating Learning and Planning Description: The lecture notes are based on David Silver’s lecture video. Video link: RL Course by David Silver - Lecture 8 Lecture Slides: Slides]]></summary></entry><entry><title type="html">Lecture 7 - Policy Gradients [Notes]</title><link href="https://omkar-ranadive.github.io/posts/rl-l7-ds" rel="alternate" type="text/html" title="Lecture 7 - Policy Gradients [Notes]" /><published>2020-05-02T00:00:00-07:00</published><updated>2020-05-02T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/rl-l7-ds</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/rl-l7-ds"><![CDATA[<hr />
<p><strong>Lecture Details</strong></p>
<ul>
  <li><strong>Title:</strong> Policy Gradients</li>
  <li><strong>Description:</strong> The lecture notes are based on David Silver’s lecture video.</li>
  <li><strong>Video link:</strong> <a href="https://www.youtube.com/playlist?list=PLbPhAbAhvjUyrKlhnLEMyNmiF72ABB3Zh" target="_blank">RL Course by David Silver - Lecture 7</a></li>
  <li><strong>Lecture Slides:</strong>  <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html" target="_blank">Slides</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of David Silver</strong></p>

<hr />

<p>Instead of estimating the value functions and then reaching the optimal
policy by following the updates epsilon greedily, we can directly tinker
with the parameters of the policy function.</p>

<p><img src="../../images/rl/l7-ds/media/image1.png" alt="" /></p>

<p>The types can be categorized as follows:</p>

<p><img src="../../images/rl/l7-ds/media/image2.png" alt="" /></p>

<p>As we can see, value based has <strong>an implicit policy, that is by
following the values epsilon greedily, we would reach the optimal
solution without stating any explicit policy.</strong> On the other hand,
policy based algorithms have an explicit policy which our agent will
improve over time.</p>

<p>The actor-critic model takes the best of both worlds and the agent
navigates the environment using policy based approach while action
values are updated using value-based approach.</p>

<p><strong>What’s the point of policy-based approach if Value based exists?</strong></p>

<p><img src="../../images/rl/l7-ds/media/image3.png" alt="" /></p>

<p>In value based approaches, we follow the values greedily and eventually
reach convergence. However, in some cases following it greedily may be
slower than directly tweaking the parameters and hence, policy based
approaches have better convergence property.</p>

<p>Value based approach deal with action-value functions. So, if the action
space has many dimensions or is continuous, the updates can be slow and
hence, policy based is more effective in this case.</p>

<p>As value based takes the max, the policy is always deterministic (even
though the probable actions are in terms of probabilities, we are always
choosing a single output). Policy based approach can give stochastic
policies.</p>

<p><strong>Advantages of stochastic policies:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image4.png" alt="" /></p>

<p>Consider the game of rock-paper-scissors. In this, by having a
deterministic policy like always choosing rock, the opponent can easily
exploit it by always choosing paper.</p>

<p>Hence, this is one of the scenarios where the optimal policy is a
uniform random policy. Hence, a policy gradient algorithm would have
found a better way of playing rock-paper-scissors over a value based
algorithm.</p>

<p><strong>Alias world: Incomplete MDPs</strong></p>

<p><img src="../../images/rl/l7-ds/media/image5.png" alt="" /></p>

<p>Policy gradient techniques are also useful in the case the environment
is spitting out incomplete MDPs. (i.e all information is not known)</p>

<p>Consider the example of alias world. Here, the gray states cannot be
differentiated by the agent; that is, the agent is unable to understand
them.</p>

<p>If we solve the problem using value functions, we get the following
solution:</p>

<p><img src="../../images/rl/l7-ds/media/image6.png" alt="" /></p>

<p>Both gray states get either a left arrow (W direction) or a right arrow
(E direction). This is not optimal as the first gray arrow should have
been right while the second should be left. But as the agent cannot
differentiate, they will get the same direction.</p>

<p>On the other hand, a policy gradient approach leads to the following
solution:</p>

<p><img src="../../images/rl/l7-ds/media/image7.png" alt="" /></p>

<p>As stochastic policies are allowed, the agent may now move left with a
50% chance and right with a 50% chance. Hence, the solution will be
reached much faster than the value-based approach.</p>

<p><strong>Measuring quality of policy based objective function:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image8.png" alt="" /></p>

<p><strong>Here the cost function cannot be the error function.</strong> We want to
measure the quality of a policy hence it needs to be something
different. If we know the starting state then it can be the total
expected return. In continuing environments (it never terminates), we
can choose the average value.</p>

<p><strong>Here d<sup>pi_theta</sup> (s) is the probability of being in state s
with policy pi and parameters theta.</strong></p>

<p><img src="../../images/rl/l7-ds/media/image9.png" alt="" /></p>

<p>As gradient based methods provide the greatest efficiency, we optimise
policies using those. However, it’s not necessary to restrict policy
optimization to those. Anything can be used.</p>

<p>Exploiting the sequential structure is updating the policy right after
getting a few sequential pieces of information instead of waiting till
the end of the episode to do so.</p>

<p><strong>Policy gradient:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image10.png" alt="" /></p>

<p><strong>Note that, in the case of value based gradients we were performing
gradient descent as we were trying to minimize the error (hence trying
to find the minimum of the function). In case of policy gradient, we are
trying to maximize the score function. Hence, we perform gradient
ascent.</strong></p>

<p><strong>Computing gradients using finite differences:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image11.png" alt="" /></p>

<p><strong>Score function:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image12.png" alt="" /></p>

<p>We assume that policy is differentiable whenever it is non-zero. <strong>This
means that the policy need NOT be differentiable everywhere. It only
needs to be differentiable at the right places</strong>.</p>

<p>Here, the gradient of the policy is rewritten in the form of a log
function. This is done as the output produced is equivalent and more
importantly log will simplify the derivative calculation.</p>

<p>That is, in case of softmax, gaussian updates, the e^(x) terms will be
simplified as x*log<sub>e</sub>e and hence calculating the gradient of
x becomes much simpler.</p>

<p><strong>Softmax policy:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image13.png" alt="" /></p>

<p>Here, the gradient can be interpreted as the value of action the agent
took (phi(s, a)) – the average value of all actions (E[phi(s, .]).
Basically, how much greater or smaller was the phi(s, a) than the
expected (average) value. Update the gradient that strongly in the
direction of the action.</p>

<p><strong>Softmax policy is usually used for discrete action spaces.</strong></p>

<p><strong>Gaussian policy:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image14.png" alt="" /></p>

<p><strong>One-step MDPs:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image15.png" alt="" /></p>

<p>Consider a rudimentary scenario where the episode lasts only for a
single step. After taking one step, it is terminated, and a reward r is
obtained. The policy problem is to find a policy which would maximise
this reward.</p>

<p>We can use likelihood ratios to compute the policy gradients as shown
above. For the computation, remember the log trick.</p>

<p>We know,
<img src="../../images/rl/l7-ds/media/log_der.jpg" alt="" /></p>

<p>So, we can get rid of the policy distribution using the log trick. The
reason we want to get rid of it is because we don’t have direct
knowledge about the policy distribution pi (shown above).</p>

<p>So, to get rid of it, we can divide and multiply by the policy
distribution in the gradient of the cost function. That is,</p>

<p><img src="../../images/rl/l7-ds/media/grad.jpg" alt="" /></p>

<p>If we compare this with the derivative of the log function, we can see
how we got the final gradient shown in the image above. The d(s)
left in the equation will become 1 by law of large numbers, hence, we
are simply calculating the expected value now!</p>

<p><strong>Generalization of this idea:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image16.png" alt="" /></p>

<p>The policy gradient theorem states that by simply replacing the one-step
instantaneous reward r by the total long-term value Q, we get an optimal
gradient policy update.</p>

<p><strong>Monte-Carlo policy gradient:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image17.png" alt="" /></p>

<p>Here, the long-term value Q will be the unbiased return at the end of
the episode.</p>

<p><strong>Reducing variance:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image18.png" alt="" /></p>

<p>The problem with the previous policy-gradient update is that there is
still a lot of variance. This can be reduced using the actor-critic
model.</p>

<p>Here, the actor is the one who actually takes the decisions and performs
the action. The critic is only their to evaluate. An actor will navigate
the environment, take some action and get some reward. Then critic then
evaluates how good/bad the action taken was and will update the
action-value function accordingly. The actor then updates the policy in
the direction suggested by the critic.</p>

<p><strong>Job of the critic:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image19.png" alt="" /></p>

<p>The job of the critic is to perform policy evaluation. This can be done
using MC evaluation, TD evaluation or TD lambda.</p>

<p><strong>Action value actor critic:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image20.png" alt="" /></p>

<p>Hence, the critic uses linear TD(0) to approximate the action-value
function and update it while the policy gets updated using policy
gradient.</p>

<p><strong>Bias in actor-critic algorithms:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image21.png" alt="" /></p>

<p>As we are approximating the gradient (notice that true Q value is not
used, it is a linear approximation in actor-critic model), lot of bias
is introduced in the algorithm. Hence, the right solution may not be
reached.</p>

<p><strong>This problem can be solved using compatible approximation theorem:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image22.png" alt="" /></p>

<p><strong>Basically, over here the features are the score of the policy.</strong></p>

<p><strong>Reducing variance:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image23.png" alt="" /></p>

<p>One way of reducing variance is to use the baseline function. The
baseline function has an expected return of 0, i.e the gradient of
policy is 0. We can see that this because B(s) and gradient operator can
be taken out of the summation and summation of pi(s, a) = 1 and then
gradient(1) = 0.</p>

<p>Now, a state value function will have a gradient of 0 (or near zero) as
the state-value is the supposed to be the actual representative value of
that state.</p>

<p>So, we can subtract the action value (Q) from that state value V. The
subtracted value basically tells us how much “advantage” we are gaining
by taking action a in state s.</p>

<p>Then the score function gradient can be updated by considering this
advantage function.</p>

<p><strong>How to estimate the advantage function?</strong></p>

<p><img src="../../images/rl/l7-ds/media/image24.png" alt="" /></p>

<p>One way is to use two function approximates and update both over time to
get better approximations.</p>

<p><strong>Representing advantage function in the form of TD error:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image25.png" alt="" /></p>

<p>Advantage function can also be thought of as the TD error. This is
because Q value is simply r + gamma*V(s’) (as per bellman’s equation)
and error is this Q value – V(s).</p>

<p>Critics at different time scales:</p>

<p><img src="../../images/rl/l7-ds/media/image26.png" alt="" />Value function can be estimated at different
time steps (scales) using the techniques shown above.</p>

<p><strong>Actors at different time steps:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image27.png" alt="" /></p>

<p>Similarly, actors can perform at different time steps.</p>

<p><strong>Actor update with eligibility traces:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image28.png" alt="" /></p>

<p><strong>Alternative policy gradient directions:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image29.png" alt="" /></p>

<p>One of the problems with policy gradients is that the policy itself is
getting reparametrized (updated). So, we aren’t following the “True
gradient”. Hence, the convergence may take a lot of time.</p>

<p><strong>Natural policy gradients:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image30.png" alt="" /></p>

<p>Natural policy gradients is the idea of starting off with a
deterministic policy. This idea would minimize the issue caused by noise
as a deterministic function will have little to no noise.</p>

<p><strong>Natural actor-critic:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image31.png" alt="" /></p>

<p><strong>Summary:</strong></p>

<p><img src="../../images/rl/l7-ds/media/image32.png" alt="" /></p>

<p>Hence, we can see that policy gradient has many different forms. The
different forms will reduce variance etc differently.</p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="reinforcement-learning" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[Lecture Details Title: Policy Gradients Description: The lecture notes are based on David Silver’s lecture video. Video link: RL Course by David Silver - Lecture 7 Lecture Slides: Slides]]></summary></entry><entry><title type="html">Lecture 6 - Value Function Approximation [Notes]</title><link href="https://omkar-ranadive.github.io/posts/rl-l6-ds" rel="alternate" type="text/html" title="Lecture 6 - Value Function Approximation [Notes]" /><published>2020-05-01T00:00:00-07:00</published><updated>2020-05-01T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/rl-l6-ds</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/rl-l6-ds"><![CDATA[<hr />
<p><strong>Lecture Details</strong></p>
<ul>
  <li><strong>Title:</strong> Value function approximation</li>
  <li><strong>Description:</strong> The lecture notes are based on David Silver’s lecture video.</li>
  <li><strong>Video link:</strong> <a href="https://www.youtube.com/playlist?list=PLbPhAbAhvjUyrKlhnLEMyNmiF72ABB3Zh" target="_blank">RL Course by David Silver - Lecture 6</a></li>
  <li><strong>Lecture Slides:</strong>  <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html" target="_blank">Slides</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of David Silver</strong></p>

<hr />

<p><strong>Why are function approximators required?</strong></p>

<p>Complex reinforcement learning problems like learning the game of Go
have huge state-space (10^170 for Go). Finding the exact value of all
such states is not computationally feasible. Hence, function
approximators are required to solve real-world, large scale problems.</p>

<p><img src="../../images/rl/l6-ds/media/image1.png" alt="" /></p>

<p>One huge advantage of function approximators is that we <strong>can generalize
from seen states to unseen states. That is, we don’t need to visit all
the states to estimate their values. Once we approximate a function well
enough, any state can be approximated well.</strong></p>

<p><strong>Types of function approximator:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image2.png" alt="" /></p>

<p>Function approximators may take only the state as input or the state
action pair (s, a) as input. Then we can output the state-value
function, action value or action values for all actions as shown above.</p>

<p><strong>Note – Here w = weight matrix</strong></p>

<p><strong>Different approximators:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image3.png" alt="" /></p>

<p><img src="../../images/rl/l6-ds/media/image4.png" alt="" /></p>

<p>Neural networks and linear combinations of features are widely used as
<strong>they are differentiable</strong>.</p>

<p>Note that, in RL, the data is non-stationary; that is we are learning
while exploring the environment and it is also non-iid (iid =
independent and identical distributed); that is, the time sequence of
data matters.</p>

<p><strong>Incremental methods:</strong></p>

<p><strong>Basics of gradient descent:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image5.png" alt="" /></p>

<p><strong>Value function approximation using stochastic gradient descent:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image6.png" alt="" /></p>

<p><strong>Here, assume that the actual value v<sub>pi</sub>(S) is known to us.</strong>
Then we are simply calculating the squared error between the predicted
value v_cap and the actual known value.</p>

<p><strong>Feature vectors:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image7.png" alt="" /></p>

<p>A state can be represented using features. This is useful because we can
now pass features to the neural network to better approximate the value
function.</p>

<p><strong>Linear value function approximation:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image8.png" alt="" /></p>

<p>The value function can be represented as a linear combination of the
features x(S) and weight matrix w. (x(S)<sup>T</sup>*W)</p>

<p><strong>Intuitive thinking:</strong> By representing it in such a way, we can see
that the squared error will become quadratic in nature. Hence, the plot
of J(W) will be a quadratic curve. We know that quadratic functions have
a global optimum and hence, we can say that this algorithm will converge
to the global optimum.</p>

<p><strong>Table lookup:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image9.png" alt="" /></p>

<p>The case of table lookup can also be shown in the form of features.
<strong>The feature vector in this case will have rows = number of states and
the ith entry will 1 if the current state is Si else it will be 0.</strong></p>

<p><strong>Note this is only to show the relationship between the previous table
lookup algorithm and current neural net implementation.</strong></p>

<p><strong>Incremental Prediction algorithm:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image10.png" alt="" /></p>

<p>Till now we assumed that Vpi(S) was known to us. But this won’t be the
case in reality. <strong>Hence, we approximate it using return G<sub>t</sub>
for Monte Carlo and the usual TD estimate for TD. Similarly, the lambda
return G<sub>t</sub> is used for TD(lambda).</strong></p>

<p><strong>Why isn’t the derivative of v_cap(S<sub>t+1</sub> , W) calculated in
TD(0)?</strong></p>

<p>The interesting thing is that in TD(0) the “actual” value Vpi is
estimated using R<sub>t+1</sub> + lambda*V_cap(S<sub>t+1</sub>, W).
<strong>Here the V_cap entry is the value spitted by the neural network
itself.</strong> Hence, we are using the neural network’s approximation to
improve the neural network. This works over time as R<sub>t+1</sub> is
the actual reward. Hence, by updating it every time step, we slowly
bring it closer to the true estimate. But notice that we are ignoring
the derivative of V_cap(S<sub>t+1</sub>, W) and only calculating for
V_cap(S<sub>t</sub> , W). <strong>This is because, we want to move forward.</strong>
When we calculate the fastest rate of change from state S at time step
t, we get which direction to move forward in. At the same time if we
calculate for S at time step t+1, we would be kind of pulling it in both
directions.</p>

<p><strong>However, in some cases taking both derivatives may provide better
results.</strong></p>

<p><strong>Monte-Carlo with value-function approximation:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image11.png" alt="" /></p>

<p>Remember in Monte Carlo we first run through the entire episode. Hence,
we would collect tuples (S1, G1), (S2, G2)..(St, Gt) at the end of each
episode. Then these tuples can be used to perform an update in the right
direction. Hence, these tuples can be thought of as training data and
the problem reduces down to a supervised learning problem per episode.</p>

<p><strong>TD learning for value function approximation:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image12.png" alt="" /></p>

<p>In case of TD learning, we aren’t getting the actual rewards. It’s an
estimate hence, the training data will also be an estimate. <strong>Also, the
update will be performed each time step.</strong></p>

<p><strong>TD(lambda) with value-function approximations:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image13.png" alt="" /></p>

<p>Notice that in Backward linear TD, the eligibility trace at time step t
is decaying trace at time step t-1 + <strong>x(St). Here are consider the
features at step t. (for linear). Note this is basically, the gradient
of v_cap(St, w) which in the case of linear combination decomposes to
x(St).</strong></p>

<p><strong>Control with value-function approximation:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image14.png" alt="" /></p>

<p>As we saw previously, action-value functions need to be used over
state-value functions in case of model-free environments. Hence, we
would instead approximate the action-value function in such cases.</p>

<p><img src="../../images/rl/l6-ds/media/image15.png" alt="" /></p>

<p><strong>Linear action-value representation:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image16.png" alt="" /></p>

<p><strong>Incremental algorithms:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image17.png" alt="" /></p>

<p><strong>Bootstrapping:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image18.png" alt="" /></p>

<p>The graphs show that in most of the cases, bootstrapping (choosing TD
lambda with lambda between 0 and 1) is <strong>usually a good idea.</strong></p>

<p><strong>Convergence of prediction algorithms:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image19.png" alt="" /></p>

<p>It’s important to understand which algorithm may not converge as in some
cases, the derivatives may shoot in the wrong direction and give
catastrophic results.</p>

<p><strong>Improvements: Gradient TD</strong></p>

<p><img src="../../images/rl/l6-ds/media/image20.png" alt="" /></p>

<p>Remember that in TD, we took derivative of Rt+1 + lambda*q_cap(St+1,
a, W) where q_cap was approximated by the neural network itself. Hence,
it didn’t follow the true gradient. Gradient TD solves this problem by
following the true gradient of projected Bellman error.</p>

<p><strong>Convergence of Control: (Note that control algorithms will optimal
solution)</strong></p>

<p><img src="../../images/rl/l6-ds/media/image21.png" alt="" /></p>

<p><strong>Batch reinforcement learning:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image22.png" alt="" /></p>

<p>In incremental reinforcement learning we were using the (S, A) tuples
<strong>only once.</strong> After updating, we were throwing away that tuple.
Updating the gradient once is not enough to squeeze out all information
from the tuple.</p>

<p><strong>Example:</strong> A game may have different levels. After starting level 2
which may be different from level 1, our agent will start losing
information of level 1 as it will be overshadowed and forgotten due to
the current incoming tuples of level 2.</p>

<p><img src="../../images/rl/l6-ds/media/image23.png" alt="" /></p>

<p>This can be solved by using experience replay where we store all the
tuples and then choose a random sample from it at every time step.</p>

<p><img src="../../images/rl/l6-ds/media/image24.png" alt="" /></p>

<p><strong>Experience replay also converges to least square solution.</strong></p>

<p><strong>Experience Replay in Deep Q-networks (DQN):</strong></p>

<p><img src="../../images/rl/l6-ds/media/image25.png" alt="" /></p>

<p>State-of-art DQN use experience replay to solve the problem of
forgetting the previous tuples and squeezing the maximum information out
of each tuple by keeping the tuples in memory and sampling a batch from
them in every iteration. <strong>This also helps mitigate extreme co-relation
of data.</strong></p>

<p><strong>Fixed Q-targets:</strong></p>

<p>The other improvement used is the fixed Q targets. This is like the off
policy learning of Q learning where we had two policies: behavior and
target. The q-value of state s’ was chosen from target policy while the
current action was chosen from the behavior policy.</p>

<p><strong>Similarly, here we keep a copy of the old q-learning targets. That is
there are two networks. Old DQN and the present DQN. After every n
iterations, say 1000 iterations, Old will be set to present.</strong></p>

<p><strong>But within those iterations the Q(s’, a’, w_) will be chosen from the
old DQN. This helps stabilize the network.</strong></p>

<p><strong>Linear least square prediction:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image26.png" alt="" /></p>

<p>For fairly small problems (where the number of features are small), we
can instead use linear algebra to directly get the approximate values
instead of using a neural network.</p>

<p><img src="../../images/rl/l6-ds/media/image27.png" alt="" /></p>

<p>As we can see, the w matrix is calculated by taking the matrix inverse
of the linear combination multiplied by the sum of X(s)*Vt. <strong>This only
works where N (features) are small.</strong></p>

<p>In practice:</p>

<p><img src="../../images/rl/l6-ds/media/image28.png" alt="" /></p>

<p><img src="../../images/rl/l6-ds/media/image29.png" alt="" /></p>

<p>Convergence:</p>

<p><img src="../../images/rl/l6-ds/media/image30.png" alt="" /></p>

<p><strong>Hence, linear algorithms will lead to the global optimum.</strong></p>

<p><strong>Least square policy evaluation:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image31.png" alt="" /></p>

<p>We can use Q-learning with least squared error between the q values for
evaluating policies.</p>

<p><img src="../../images/rl/l6-ds/media/image32.png" alt="" /></p>

<p><strong>Least square control:</strong></p>

<p><img src="../../images/rl/l6-ds/media/image33.png" alt="" /></p>

<p><img src="../../images/rl/l6-ds/media/image34.png" alt="" /></p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="reinforcement-learning" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[Lecture Details Title: Value function approximation Description: The lecture notes are based on David Silver’s lecture video. Video link: RL Course by David Silver - Lecture 6 Lecture Slides: Slides]]></summary></entry><entry><title type="html">Importance Sampling</title><link href="https://omkar-ranadive.github.io/posts/stats-IS" rel="alternate" type="text/html" title="Importance Sampling" /><published>2020-04-27T00:00:00-07:00</published><updated>2020-04-27T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/importance-sampling</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/stats-IS"><![CDATA[<hr />
<p><strong>Original Video link:</strong></p>
<ul>
  <li><a href="https://youtu.be/S3LAOZxGcnk">Importance Sampling: Intro</a></li>
  <li><a href="https://youtu.be/3Mw6ivkDVZc">Importance Sampling: Intuition</a></li>
  <li><a href="https://youtu.be/gYvlnu5AAzE">Importance Sampling: Normalizing constants</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of <a href="https://www.youtube.com/channel/UCcAtD_VYwcYwVbTdvArsm7w">Mathematical Monk</a></strong></p>
<hr />

<p><strong>Importance Sampling: Introduction</strong></p>

<p><img src="../../images/stats/Importance_Sampling/media/image1.png" alt="" /></p>

<p>Sampling is actually a misnomer. Using Importance Sampling we are
essentially approximating the <strong>expected value</strong> of some distribution
p(X) using another distribution q(X).</p>

<p>We use importance sampling when it is difficult to grab samples from
original distribution p(x), so we estimate it using q(x). We might also
use importance sampling when we want to give “importance” to certain
areas of original distribution. Basically, say we want to grab more
samples from areas from original distribution which occur rarely. We can
design our q(x) such that we grab more samples from this region.</p>

<p>Another important thing to note is, <strong>even though we are assuming it’s
difficult to grab samples from p(x), we still should be able to
calculate value of p(x) given some x.</strong></p>

<p>Mathematically, we just multiply and divide the expected value formula
with q(x) as shown above. (Here distribution q(x) should be equal to
zero when p(x) = 0; it’s called absolute continuity).</p>

<p>The p(x)/q(x) term can be basically thought of as a “weight” term.
Hence, the formula becomes:</p>

<p><img src="../../images/stats/Importance_Sampling/media/image2.png" alt="" /></p>

<p><strong>Can importance sampling estimate even better than original P(X)?</strong></p>

<p><img src="../../images/stats/Importance_Sampling/media/image3.png" alt="" /></p>

<p>So, the cool thing about importance sampling is that if we choose our
q(x) correctly then we might even be able to estimate better than
directly estimating from p(x). <strong>We do this by reducing the variance
term.</strong></p>

<p><strong>Intuition behind choosing a “good” Q:</strong></p>

<p><img src="../../images/stats/Importance_Sampling/media/image4.png" alt="" /></p>

<p>Consider the figure shown above. Let the red line denote the return
which we get (here return is just the values represented by our
probability distribution p(x)). Now, p(x) represents our probability
distribution. As we can see, most of the density is away from the huge
negative spike in return. So, if we use something like Monte Carlo
sampling, we won’t be able to estimate the true average properly as we
will be grabbing samples from the dense area. So, it’s very unlikely
that we get a f(x) where we experience that huge negative spike. But
from the equation we can see that the huge negative spike is greatly
affecting the expected value because even though p(x) is small, |f(x)|
is significantly large hence the overall expected value will be
influenced by such f(x)*p(x) entries. If we choose q(x) as shown above
we can solve this problem.</p>

<p><strong>Example of a bad q(x):</strong></p>

<p><img src="../../images/stats/Importance_Sampling/media/image5.png" alt="" /></p>

<p>In this example, we can see that q(x) covers irrelevant regions and so
it will be a bad estimate of the actual expected value.</p>

<p><strong>Looking at this from another perspective:</strong> However, this also shows
the power of importance sampling. If for some reason we want to sample
more from these “irrelevant” regions then we can simply design our q(x)
such that we end up sampling from these regions. So, using importance
sampling we are able to choose regions of importance to sample from.</p>

<p><strong>But in general, we choose q(x) such that |f(x)|*p(x) is large when
q(x) is large.</strong></p>

<p><strong>Importance sampling without normalization</strong></p>

<p><img src="../../images/stats/Importance_Sampling/media/image6.png" alt="" /></p>

<p>Till now we were assuming that we could calculate f(x) and w(x)
(p(x)/q(x)) efficiently for all values of x. However, in reality, we
might know p(x) and q(x) only up to a normalizing constant. So, how do
we calculate w(x) efficiently in this case?</p>

<p>Look at the image shown above. We can see how p(x) and q(x) can be
expressed in terms of normalizing constants. Here we say Zp and Zq are
unknown to us. The right hand side integrals are true because integral
of p(x) and q(x) must be 1, hence q_tilde(x) and p_tilde(x) should
integrate to Zp and Zq as the fraction should equate to 1.</p>

<p>So, let’s start by rewriting the equations in terms of normalizing
constants:</p>

<p><img src="../../images/stats/Importance_Sampling/media/image7.png" alt="" /></p>

<p>But we still don’t know Zq/Zp.</p>

<p>However, we can perform Monte Carlo approximations of these as follows:</p>

<p><img src="../../images/stats/Importance_Sampling/media/image8.png" alt="" /></p>

<p>So, the final equation becomes:</p>

<p><img src="../../images/stats/Importance_Sampling/media/image9.png" alt="" /></p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="statistics" /><category term="statistics" /><summary type="html"><![CDATA[Original Video link: Importance Sampling: Intro Importance Sampling: Intuition Importance Sampling: Normalizing constants]]></summary></entry><entry><title type="html">Variational Lower Bounds (ELBO)</title><link href="https://omkar-ranadive.github.io/posts/stats-ELBO" rel="alternate" type="text/html" title="Variational Lower Bounds (ELBO)" /><published>2020-04-20T00:00:00-07:00</published><updated>2020-04-20T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/ELBO</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/stats-ELBO"><![CDATA[<p><strong>Original Video link:</strong></p>
<ul>
  <li><a href="https://youtu.be/pStDscJh2Wo">Variational Lower Bounds</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of <a href="https://www.youtube.com/channel/UCiDouKcxRmAdc5OeZdiRwAg">Hugo Larochelle</a></strong></p>
<hr />

<p>The idea of putting a lower bound on log of a function is based on idea
of log concavity:</p>

<p><img src="../../images/stats/ELBO/media/image1.png" alt="" /></p>

<p>As logarithmic function is concave, it is true that log(sum(wi*ai)) &gt;=
sum(wi*log(ai)).</p>

<p>We exploit this idea to have lower bounds on our likelihood functions.</p>

<p><img src="../../images/stats/ELBO/media/image2.png" alt="" /></p>

<p>Here h<sup>(1)</sup> is the latent variable; i.e., we say that
probability of our data x is based on latent variables h (variables
which we cannot directly observe but we can use them to inference about
our data).</p>

<p>As we can see, we have simply applied the log concavity idea to get the
&gt;= shown above.</p>

<p><strong>Also, note that here the distribution q(h|x) is any arbitrary
distribution. The choice of this distribution is up to us.</strong></p>

<p><img src="../../images/stats/ELBO/media/image3.png" alt="" /></p>

<p>We can see an interesting property. If the chosen distribution q(h|x) is
same as p(h|x) then the right hand side equation reduces to log(p(x));
hence we are just calculating the likelihood directly.</p>

<table>
  <tbody>
    <tr>
      <td>**Now, we know p(x, h) = p(x</td>
      <td>h)*p(h), hence, log(p(x, h)) will be</td>
    </tr>
    <tr>
      <td>log(p(x</td>
      <td>h))*p(h).**</td>
    </tr>
  </tbody>
</table>

<p>We use this idea as follows:</p>

<p><img src="../../images/stats/ELBO/media/image4.png" alt="" /></p>

<p>Then, we get the following:</p>

<p><img src="../../images/stats/ELBO/media/image5.png" alt="" /></p>

<p>Which can be refactored as follows:</p>

<p><img src="../../images/stats/ELBO/media/image6.png" alt="" /></p>

<p><img src="../../images/stats/ELBO/media/image7.png" alt="" /></p>

<p><strong>Note:</strong> The negative sign in front of KL divergence is because we need
to invert log(p(x)/q(x)) to</p>

<p>-log(q(x)/p(x)) to bring it in KL divergence form.</p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="statistics" /><category term="statistics" /><summary type="html"><![CDATA[Original Video link: Variational Lower Bounds]]></summary></entry><entry><title type="html">Lecture 4 - Model Free Techniques - MC and TD[Notes]</title><link href="https://omkar-ranadive.github.io/posts/rl-l4-ds" rel="alternate" type="text/html" title="Lecture 4 - Model Free Techniques - MC and TD[Notes]" /><published>2020-03-10T00:00:00-07:00</published><updated>2020-03-10T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/rl-l4-ds</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/rl-l4-ds"><![CDATA[<hr />
<p><strong>Lecture Details</strong></p>
<ul>
  <li><strong>Title:</strong> Model Free Techniques - Monte Carlo and Temporal Difference</li>
  <li><strong>Description:</strong> The lecture notes are based on David Silver’s lecture video.</li>
  <li><strong>Video link:</strong> <a href="https://www.youtube.com/playlist?list=PLbPhAbAhvjUyrKlhnLEMyNmiF72ABB3Zh" target="_blank">RL Course by David Silver - Lecture 4</a></li>
  <li><strong>Lecture Slides:</strong>  <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html" target="_blank">Slides</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of David Silver</strong></p>

<hr />

<p><strong>Model-free reinforcement learning:</strong></p>

<p>In model free techniques, the model of the environment is not known.
<strong>Hence, we have no knowledge about the MDP’s transitions/rewards.</strong>
Such an environment is closer to what actual complex problems will have.</p>

<p>Hence, we are trying to solve an <strong>unknown</strong> MDP.</p>

<p><strong>Monte-Carlo Reinforcement Learning:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image1.png" alt="" /></p>

<p>In Monte-Carlo, the agent learns through episodes. Episodes are one
complete sample of the environment. That is, going through some states
which eventually <strong>leads to a terminating state</strong>. Hence, Monte-Carlo
methods learn from <strong>complete (terminating)</strong> episodes only.</p>

<p>To get the value of different states ,we simply calculate the mean
returns at the <strong>end of episode.</strong></p>

<p><strong>Policy Evaluation using Monte Carlo:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image2.png" alt="" /></p>

<p>One important point in MC policy evaluation is that <strong>we don’t calculate
the estimated return. We are calculating the actual (empirical) mean
return.</strong></p>

<p><strong>Types of Monte-Carlo:</strong></p>

<p><strong>First-visit Monte-Carlo:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image3.png" alt="" /></p>

<p>For <strong>every</strong> episode, we update the state s only the <strong>first</strong> time. We
do so by incrementing the counter and the total return. Also note that
G<sub>t</sub> is the empirical (actual) reward in this case. This would
be obtained by visiting some sample of states in that episode.</p>

<p>It means that if the same state s is encountered multiple times in the
same episode then we won’t be updating it. Example: If we go left, then
go right again we would end up in the same state but it won’t be updated
the second time.</p>

<p>Again, note that this will be done for many episodes and the values of
counter, total return etc will persist across the different episodes.</p>

<p><strong>Every-visit Monte Carlo:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image4.png" alt="" /></p>

<p>In every-visit Monte Carlo, the state s is updated every time it is
visited in the same episode.</p>

<p>Monte Carlo Black Jack example:</p>

<p><img src="../../images/rl/l4-ds/media/image5.png" alt="" /></p>

<p>Consider, the simplified version of Black Jack. Here, we are defining
two actions – stick and twist. The reward function is also defined by
us. Notice that there is no probability transition matrix as we do not
know the working of the environment.</p>

<p>For simplicity, we are only taking an action if our current sum is
between 12 or 21 otherwise, we are automatically twisting if sum of
cards &lt; 12 (because there is no point in sticking and showing hand if
sum is small). We are also considering whether we have a usable ace or
not and looking whether the dealer’s current card is an ace or not.
<strong>(Note: Usable Ace can take a value of 11)</strong></p>

<p><img src="../../images/rl/l4-ds/media/image6.png" alt="" /></p>

<p>The value function after Monte Carlo learning can be seen in the diagram
shown above. We can see that the eventually after 500 episodes the graph
peaks at player sum = 21 (i.e it gives a reward of +1) and it is flat in
other regions.</p>

<p><strong>Incremental Mean:</strong></p>

<p>We can rewrite the mean as shown below:</p>

<p><img src="../../images/rl/l4-ds/media/image7.png" alt="" /></p>

<p>That is, mean of k points can be written as mean of k-1 points + the
difference between present point (k) and the previous mean.</p>

<p>More intuitively, we are expecting the value x<sub>k</sub> to be near
u<sub>k-1,</sub> so Xk – Uk-1 can be thought of as the error or
difference between the estimate. If it is completely same, then the
U<sub>k =</sub> U<sub>k-1</sub> otherwise it will change based on the
error difference.</p>

<p><img src="../../images/rl/l4-ds/media/image8.png" alt="" /></p>

<p>The incremental mean can be used to rewrite the value function update in
MC. Then we can replace 1/N(S<sub>t</sub>) with alpha. This alpha helps
us change the equation to an exponential decay form where we can control
how much of the old episodes we want to remember.</p>

<p><strong>Temporal Difference Learning:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image9.png" alt="" /></p>

<p>The main difference between Monte Carlo method and TD methods is that in
TD the <strong>update is done while the episode is ongoing.</strong> That is, we can
learn from incomplete episodes. <strong>This is done by estimating the
remainder rewards instead of actually getting them. This idea is called
bootstrapping.</strong></p>

<p>Example: Consider the exit door of a classroom is the end state. In
Monte-Carlo, all episodes must end with this state and the states would
be updated only when the episode has ended. In TD Learning, we may
travel halfway through the classroom and estimate the reward for the
remainder distance.</p>

<p><img src="../../images/rl/l4-ds/media/image10.png" alt="" /></p>

<p>As we can see, in temporal difference learning, R<sub>t+1</sub> is the
actual reward which we get at time step t, and the
gamma*V(S<sub>t+1</sub>) is the <strong>estimated future reward.</strong></p>

<p><strong>Driving Home Example:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image11.png" alt="" /></p>

<p>Consider the driving from office to home example. Here, we are
travelling from office to home and there are 6 states. As we can see,
the predicted time estimate changes with every state; that is we are
updating the estimate while running through the episode.</p>

<p><img src="../../images/rl/l4-ds/media/image12.png" alt="" /></p>

<p>In Monte Carlo, we can only update after the end of the episode. So, we
wait till the episode has ended (arrive home state reached) before
updating the state value estimates. In case of temporal difference
learning we can update after every state. That is, after leaving the
leaving office state we can immediately update our value estimates based
on the rewards we got till now.</p>

<p><strong>Advantages and Disadvantages of MC and TD:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image13.png" alt="" /></p>

<p><strong>Bias/Variance trade-off:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image14.png" alt="" /></p>

<p>As the returns in Monte Carlo are actual returns, they are completely
unbiased. That is, they are true values obtained from the environment.
However, this means that it has high variance as it will be a perfect
return (returns will include any noise/outlier obtained during the
episodes).</p>

<p>On the other hand, the TD estimate is a rough expectation and hence it
will be biased. As it’s an approximate estimate it will have low
variance.</p>

<p>Therefore:</p>

<p><strong>TD – high bias, low variance</strong></p>

<p><strong>MC – zero bias, high variance</strong></p>

<p><strong>Advantages/Disadvantages continued:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image15.png" alt="" /></p>

<p>Here, function approximation is the idea of approximating the value
functions instead of calculating them as it is very time consuming to
actually calculate value functions for complex problems.</p>

<p><strong>Random-walk example:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image16.png" alt="" /></p>

<p>To compare MC and TD, consider a scenario where we select the same
sample episode k and run both MC and TD(0) (Here 0 specifies that update
will be done at every time step).</p>

<p><img src="../../images/rl/l4-ds/media/image17.png" alt="" /></p>

<p>Consider the above sample episode. Here, as there are 8 instances of B,
6 1s and 2 0s, we can say that V(B) for both MC and TD will be 6/8.</p>

<p>TD would form an implicit MDP as follows:</p>

<p><img src="../../images/rl/l4-ds/media/image18.png" alt="" /></p>

<p>For TD, we will use the update rule: 
R(T) + gamma * V(St+1)  = 0 + 1*(6/8) = 0.75 <br />
For MC, the only observation was V(A) = 0, hence V(A) = 0. To put it more concretely, see
the image below:</p>

<p><img src="../../images/rl/l4-ds/media/image19.png" alt="" /></p>

<p>Here, we can see that as MC minimizes the mean squared error,
V(S<sub>t</sub>) will take a value of 0. As G<sub>t</sub> -
V(S<sub>t</sub>) will be minimal when V(S<sub>t</sub>) is 0 (Because A =
0).</p>

<p>On the other hand, TD(0) converges to max likelihood of Markov Model.
Notice the 1 in the double summations. It is an indicative function
which indicates that the transition from s to s’ actually exists.</p>

<p>So here, V(A) will be 0 + 1 + 1 + 1 + 1 + 1 + 1 + 0 = 6/8 as we are
basically, calculating the transition in each episode. (Remember it’s
based on the equation R<sub>t+1</sub> + gamma*V(S<sub>t+1</sub>).</p>

<p><img src="../../images/rl/l4-ds/media/image20.png" alt="" /></p>

<p><strong>Backups:</strong></p>

<p><strong>Monte-Carlo Backup:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image21.png" alt="" /></p>

<p>In Monte-Carlo we are basically traversing one random path of states
which eventually leads to a terminating state. Hence, it will traverse
through the <strong>depth</strong> and end with a terminating state.</p>

<p><strong>TD Backup:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image22.png" alt="" /></p>

<p>In TD, we only look one step ahead and then estimate the rest. That is
R<sub>t+1</sub> + gamma*(V(S<sub>t+1</sub>).</p>

<p><strong>Dynamic programming backup:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image23.png" alt="" /></p>

<p>In DP, we used to consider <strong>all possible states</strong> one level ahead, i.e
the entire breadth of level+1.</p>

<p>As opposed to this, in MC and TD we are only considering a limited
space.</p>

<p>Summary:</p>

<p><img src="../../images/rl/l4-ds/media/image24.png" alt="" /></p>

<p>Remember that bootstrapping is estimating the future rewards.</p>

<p><strong>Unified view of RL:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image25.png" alt="" /></p>

<p>As we can see, TD Learning has shallow backups while Montel Carlo has
deep backups (it goes all the way till the terminating state).</p>

<p>Using TD(lambda) we can adjust this level of backups as per our need.</p>

<p><img src="../../images/rl/l4-ds/media/image26.png" alt="" /></p>

<p>Instead of looking ahead by 1-step we can look ahead by any arbitrary n
steps.</p>

<p><strong>N-step return:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image27.png" alt="" /></p>

<p>Now, in n-step TD we would be updating the value function after n-steps.
(that is, we would be backing up after n steps). Hence, the reward
return G<sub>t</sub> can be rewritten as shown above.</p>

<p><strong>Averaging n-step returns:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image28.png" alt="" /></p>

<p>Instead of selecting a fixed value of n (we can’t decisively decide the
best value of n), it’s better to average these returns. For example, we
can decide to calculate 2-step return and 4-step return and then average
them together to update the value of the value function.</p>

<p><strong>Lambda return:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image29.png" alt="" /></p>

<p>Just like this, we can get the information for all n steps, i.e 1 step,
2 step, 3 step etc using lambda return. We write the lambda in the form
of geometric series. Writing it in the form of geometric series is
computationally efficient and will allow us to solve it for nearly the
same time as TD(0).</p>

<p><img src="../../images/rl/l4-ds/media/image30.png" alt="" /></p>

<p>It is a decaying function; hence, the larger steps are given smaller
weightage.</p>

<p><strong>Forward-view TD(lambda)</strong></p>

<p><img src="../../images/rl/l4-ds/media/image31.png" alt="" /></p>

<p>One disadvantage of TD(lambda) is that we need to wait till the end of
the episode to perform any update. This is because we are considering
all time steps n, where (nth step = last time step) and hence, the
update would only be performed at the end of the episode like MC.</p>

<p>This can be solved using Backward pass using eligibility traces.</p>

<p><strong>Backward pass: Eligibility traces</strong></p>

<p><img src="../../images/rl/l4-ds/media/image32.png" alt="" /></p>

<p>Consider the three bell, one light, shock scenario. Did the shock happen
because the bell rang three times or because of the light? It depends on
whether we are giving importance to frequency of occurrence or recency
of occurrence. Eligibility traces combines both these heuristics.
Everytime a state is visited, the value of the trace is increased,
otherwise it will start decaying.</p>

<p><img src="../../images/rl/l4-ds/media/image33.png" alt="" /></p>

<p>So, basically we will be updating every state with the information
obtained till now in the form of eligibility traces.</p>

<p>The original equation was V(S) + alpha*(S<sub>t</sub>), now we include
the eligibility trace to it.</p>

<p>Hence, we don’t need to wait till the end of the episode to update. We
can keep updating at every time step.</p>

<p><img src="../../images/rl/l4-ds/media/image34.png" alt="" /></p>

<p>When lambda = 0, the eligibility trace will always be 1, as we will
always be in the same state. Hence, the update will be same as TD(0).</p>

<p>Note that we are talking about the following lambda:</p>

<p><img src="../../images/rl/l4-ds/media/image35.png" alt="" /></p>

<p><strong>TD(lambda) and MC:</strong></p>

<p><img src="../../images/rl/l4-ds/media/image36.png" alt="" /></p>

<p>Another interesting property is that TD(1) will have the same total
update as MC.</p>

<p><strong>(Refer to PPT for proof)</strong></p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="reinforcement-learning" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[Lecture Details Title: Model Free Techniques - Monte Carlo and Temporal Difference Description: The lecture notes are based on David Silver’s lecture video. Video link: RL Course by David Silver - Lecture 4 Lecture Slides: Slides]]></summary></entry><entry><title type="html">Lecture 5 - Optimizing Model Free Techniques (Model Free Control) [Notes]</title><link href="https://omkar-ranadive.github.io/posts/rl-l5-ds" rel="alternate" type="text/html" title="Lecture 5 - Optimizing Model Free Techniques (Model Free Control) [Notes]" /><published>2020-03-10T00:00:00-07:00</published><updated>2020-03-10T00:00:00-07:00</updated><id>https://omkar-ranadive.github.io/posts/rl-l5-ds</id><content type="html" xml:base="https://omkar-ranadive.github.io/posts/rl-l5-ds"><![CDATA[<hr />
<p><strong>Lecture Details</strong></p>
<ul>
  <li><strong>Title:</strong> Optimizing Model Free Techniques</li>
  <li><strong>Description:</strong> The lecture notes are based on David Silver’s lecture video.</li>
  <li><strong>Video link:</strong> <a href="https://www.youtube.com/playlist?list=PLbPhAbAhvjUyrKlhnLEMyNmiF72ABB3Zh" target="_blank">RL Course by David Silver - Lecture 5</a></li>
  <li><strong>Lecture Slides:</strong>  <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html" target="_blank">Slides</a></li>
</ul>

<p><strong>Credits: All images used in this post are courtesy of David Silver</strong></p>

<hr />

<p><strong>Why is model free control even required?</strong></p>

<p>Many real-world problems are too complex and thus, their MDP is unknown
or the MDP might be known but it will be too large/complex. Hence, model
free techniques (like Monte Carlo, TD Learning) can be used to solve
this problem.</p>

<p><strong>On Policy vs Off Policy learning:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image1.png" alt="" /></p>

<p>Learning can be segregated into two types. On-policy learning is
learning on the job, that is, the agent learns by exploring the
environment and understanding the experiences by itself. Mathematically,
it uses policy pi to explore the environment and also improves policy pi
itself.</p>

<p>On the other hand, off-policy learning is learning using multiple
policies. Example: A human may demonstrate how to perform a particular
task, then the agent can understand the human’s policy and them use it
to create its own policy. Mathematically, the agent learns policy pi
based on some other policy mu.</p>

<p><strong>Policy Iteration:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image2.png" alt="" /></p>

<p>In Dynamic Programming techniques we had studied policy iteration
algorithm which evaluates a policy and then improves it <strong>greedily</strong>.</p>

<p><strong>Policy Iteration: For model free environments</strong></p>

<p><img src="../../images/rl/l5-ds/media/image3.png" alt="" /></p>

<p>For evaluating the policy, can we replace the iterative policy
evaluation with monte-carlo policy evaluation?</p>

<p>The answer is we can’t do that as state-value functions require the
probability transition matrix and in model-free environments we don’t
know the probability transition matrix. It can be explained more
concretely as follows:</p>

<p><img src="../../images/rl/l5-ds/media/image4.png" alt="" /></p>

<p>As we can see that greedily improving over state-value functions is not
possible as P<sub>ss’</sub> is required; which is not available in a
model-free environment. Instead we can use action-value function to
improve over the best action to take.</p>

<p>So, now our updated algorithm looks as follows:</p>

<p><img src="../../images/rl/l5-ds/media/image5.png" alt="" /></p>

<p>The second question is, can we use greedy policy improvement for
model-free techniques? The answer is no as we are running through <strong>a
sample of states. That is, we do not explore every possibility in each
episode, in terms of Monte Carlo we explore only one branch the tree.</strong></p>

<p>So, we might end up missing the actual optimal policy if we keep
improving the policy greedily.</p>

<p><strong>Example</strong>:</p>

<p><img src="../../images/rl/l5-ds/media/image6.png" alt="" /></p>

<p>Consider the example with two doors. The agent starts by randomly
choosing an action (as it is the first action) and opens the left door.
The reward is 0. Now, if the agent tries the right door and gets a
reward of +1 then as per the greedy policy the agent will choose the
right door. (becase left = 0, right = 1) hence, the greedy choice is
right. Similarly, if we keep getting +3, +2 in the right door the agent
keeps choosing it. However, this is not the correct policy as we have
explored the left door only once. We are comparing against the 0 reward
which we got once. It might be that it could produce a large reward like
+100 in subsequent explorations. But our agent will fail to explore it
in a greedy policy improvement.</p>

<p><strong>Hence, complete exploitation doesn’t work. The agent perform
exploration too.</strong></p>

<p><strong>Epsilon greedy exploration:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image7.png" alt="" /></p>

<p>The idea in epsilon greedy exploration is to choose greedily with a high
probability but to still explore with a low probability.</p>

<p>Example: if epsilon = 0.2, and there are 4 actions (m):</p>

<p>Then we will use greedy improvement with 0.2/4 + 1 – 0.2 = 0.85 or 85%
probability</p>

<p>Note that, we are dividing the epsilon by m to take the effect of
actions into account. If there are many actions, we would be exploring
more often.</p>

<p><strong>Proof that epsilon greedy leads to policy improvement:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image8.png" alt="" /></p>

<p>Consider that pi’ is the new policy determined by the e-greedy
algorithm. So, now the action value function q(s, pi’(s)) can be
expressed into two parts (one with exploration e/m and other with
exploitation component (1-e)).</p>

<p>So, the idea is that taking the max of q(s, a) will be better or at
least equal to taking any other weighted sum of q(s, a). That is, the
new policy is indeed a better policy than the original as it is based on
taking max.</p>

<p>Hence, our improved policy iteration now looks as follows:</p>

<p><img src="../../images/rl/l5-ds/media/image9.png" alt="" /></p>

<p><strong>Improving the algorithm further:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image10.png" alt="" /></p>

<p>There is no need to evaluate the action-value function <strong>fully</strong> to
improve the policy. We can evaluate and improve the policy on <strong>every
episode.</strong> That is, we can update the policy even without considering
all scenarios of the value function.</p>

<p><strong>Problems with exploration:</strong></p>

<p>While we saw that greedily evaluating a policy won’t lead to optimal
improvement, eventually we want to reduce the exploration to null as by
definition optimal policy is a policy where there shouldn’t be
randomness (unlike exploration). That is, we want to decrease the
probability of exploring as we gain more and more knowledge.</p>

<p><strong>Solution: GLIE</strong></p>

<p><img src="../../images/rl/l5-ds/media/image11.png" alt="" /></p>

<p>GLIE says that if we set e = 1/k and as long as all state-action pairs
are explored infinitely many times, the policy will eventually converge
to a greedy policy (that is, the probability of exploration will reduce
to 0).</p>

<p>Basically, if e = 1/1, ½, 1/3…the probability of exploration keeps
decreasing.</p>

<p>GLIE Monte-Carlo control can be shown as follows:</p>

<p><img src="../../images/rl/l5-ds/media/image12.png" alt="" /></p>

<p><strong>Using TD instead of Monte-Carlo:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image13.png" alt="" /></p>

<p><strong>Updating action value functions using SARSA in TD Learning:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image14.png" alt="" /></p>

<p>In TD Learning, we are looking ahead by one time step instead of waiting
for an entire episode to finish. Hence, we can update the action-value
function and <strong>improve the policy every time step.</strong></p>

<p>The update process can be called as SARSA. Where (S, A) are the original
state and action pair. That is, given the state S, when the agent takes
action A, it will get a reward R and it will end up in some state S’.
Now the agent will take some action A’ from that state S’.</p>

<p><strong>Algorithm:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image15.png" alt="" /></p>

<p>Convergence of SARSA:</p>

<p><img src="../../images/rl/l5-ds/media/image16.png" alt="" /></p>

<p>The Robbins-Monro first condition says that the step size should be
sufficiently large such that the Q values can be updated/moved to the
desired point.</p>

<p>The second condition states that eventually moving around won’t change
the Q values; that is, eventually the Q values will reach their optimal
value and stop changing.</p>

<p><strong>Note:</strong> In practice, we use SARSA even if these conditions aren’t met</p>

<p>Example: Windy Grid world</p>

<p><img src="../../images/rl/l5-ds/media/image17.png" alt="" /></p>

<p>Consider the gridworld game where S is the start point and G is the
goal. The arrows represent wind and thus if the agent falls within that
area, it will be blown “upwards”. The amount of steps it will be blown
upwards is indicated by the number shown along the X axis. I.e 2 says
that the agent will be blown two step upwards.</p>

<p><strong>After applying SARSA (Using standard moves not King’s move)</strong></p>

<p><img src="../../images/rl/l5-ds/media/image18.png" alt="" /></p>

<p>From the graph we can see that initially, the agent required a lot of
timesteps to complete episodes. This makes sense as initially the agent
is unaware of the environment. But as time went by, the number of
episodes completed increased drastically as the agent is now aware of
the environment.</p>

<p><strong>n-step SARSA:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image19.png" alt="" /></p>

<p>Instead of updating the policy iteration every step, we can do it every
n step.</p>

<p><strong>Forward view SARSA(lamba)</strong></p>

<p><img src="../../images/rl/l5-ds/media/image20.png" alt="" /></p>

<p>Just like we had studied in the TD learning process, we can take the
information from all steps into account by TD(lambda). The steps are
represented using a geometric function for efficient update and
calculation.</p>

<p>Problem with forward view: As all n steps need to be calculated before
updating the Q values, we need to <strong>wait till the end of the episode</strong>
to actually update the values.</p>

<p>To solve this problem, we use the <strong>backward view SARSA:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image21.png" alt="" /></p>

<p><strong>Backward SARSA algorithm:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image22.png" alt="" /></p>

<p><strong>Example:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image23.png" alt="" /></p>

<p>Consider the above example. Let the first image be the path taken in an
episode. If we use one-step SARSA then only the grid box right below the
reward (Asterix point) will get updated. This is because, the rest of
the path lead to no reward (with respect to one step). Then in the next
episode, the grid point next to it will get updated and so on. <strong>Point
being, the update w.r.t one-step SARSA is much slower than
SARSA(lambda).</strong></p>

<p>With SARSA Lambda we can see that the entire path was updated in the
direction of reward. The grid points closer to the goal were updated
more strongly (due to eligibility traces) and those before were updated
little less strongly.</p>

<p><strong>Off policy learning:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image24.png" alt="" /></p>

<p>The main idea of off-policy learning is to observe some other
policy/policies and use that information to update our target policy.</p>

<p>This could be useful in the following scenarios:</p>

<ul>
  <li>
    <p>Learning from other humans. That is, the agent can observe the
policy which a human uses and accordingly optimize its target
policy.</p>
  </li>
  <li>
    <p>The old policies which the agent might have used need not be
completely discarded. The important bits from those old policies can
be used to form the new policy.</p>
  </li>
  <li>
    <p>Learning optimal policy while following exploratory policy. That is,
we follow some exploratory policy, i.e explore a lot and use that
information to update our target policy optimally. <strong>(used in Q
learning)</strong></p>
  </li>
</ul>

<p><strong>Importance Sampling:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image25.png" alt="" /></p>

<p>Given that we know the expectation over some distribution P(X) we can
calculate the expectation over some other distribution Q(X) by
multiplying and dividing it<strong>. Basically, the closer P(X) and Q(X) are,
the closer the value of P(X)/Q(X) will be to 1. The more different they
are, the more different the value.</strong></p>

<p><strong>Importance Sampling for off-policy Monte Carlo:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image26.png" alt="" /></p>

<p>The idea is to use policy u and use the returns generated by u to
evaluate policy pi. That is, based on the observations attained through
policy u, we are <strong>correcting</strong> our target policy pi. <strong>But this is a
bad way because this technique has extremely high variance. Hence,
importance sampling shouldn’t be used with Monte Carlo in practice.</strong></p>

<p><strong>Importance Sampling for Off-policy TD learning:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image27.png" alt="" /></p>

<p>In case of TD Learning, the importance sampling will only be <strong>done over
1 step</strong> instead of the entire episode (all time steps) <strong>hence, the
variance is much lower.</strong></p>

<p><strong>Q-learning:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image28.png" alt="" /></p>

<p>In Q-Learning, we have two policies: The <strong>behavior policy</strong> and the
target policy.</p>

<p>From a state s, we choose some action A using the <strong>behavior policy</strong>.
After choosing the action, we end up in state S’ (S<sub>t+1</sub>). Now,
from this state we choose a <strong>successor</strong> action A’ from our <strong>target
policy</strong>.</p>

<p>Then we update the Q(S, A) towards the <strong>value of the alternate action
A’.</strong></p>

<p><img src="../../images/rl/l5-ds/media/image29.png" alt="" /></p>

<p><strong>Basically, Q Learning uses an exploratory policy to find the optimal
policy.</strong></p>

<p>That is, the behavior policy is epsilon greedy (exploratory) while the
target policy is completely greedy.</p>

<p>Therefore, the behavior policy will help us explore scenarios while the
target policy is updated strictly based on the maximum values.</p>

<p><strong>Convergence:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image30.png" alt="" /></p>

<p><strong>Algorithm:</strong></p>

<p><img src="../../images/rl/l5-ds/media/image31.png" alt="" /></p>]]></content><author><name>Omkar Ranadive</name><email>omkar.ranadive@u.northwestern.edu</email></author><category term="reinforcement-learning" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[Lecture Details Title: Optimizing Model Free Techniques Description: The lecture notes are based on David Silver’s lecture video. Video link: RL Course by David Silver - Lecture 5 Lecture Slides: Slides]]></summary></entry></feed>