Omkar Ranadive

Maximum Likelihood, Fisher Information, Cramer Rao Inequality

2020-09-09T00:00:00-07:00

Credits: All images used in this post are courtesy of Prof. Michael Schmitt and Ben Lambert

Maximum likelihood:

In simple terms, maximum likelihood is estimating a distribution using a likelihood function (which is made up some parameters) such that the likelihood of observed data being a part of that distribution is maximized.

For example:

The red points are the observed data samples. We can see that if we estimate the distribution using parameter alpha, then only in the middle diagram the distribution correctly fits the data (i.e, joint probability of all the observed samples under that distribution is high).

The likelihood function can be defined as follows:

Likelihood is not probability and the integral of likelihood will have no interpretation at all.

If we assume that the set of samples are i.i.d then the likelihood function can be written as a joint product of probabilities:

Where f(xi, theta) is the probability to observe xi in the interval x + dx.

How to maximize the likelihood?

To maximize, we can take the first order derivative of the likelihood function and set it to 0. In practice it’s more convenient to minimize the negative log likelihood instead – as directly multiplying probabilities could lead to extremely small values; the log function can handle this situation.

Taylor series of expansion of NLL:

The first team -lnL(theta_ml) is the minimum as by definition of maximum likelihood, L(theta_ml) is the maximizing function, so the negative log of that is the minimum. The second term vanishes as first order derivative = 0.

So, we are left with the following:

So, we can see that when N is large, the negative log likelihood can be represented using the equation of the parabola.

So, we can see that sharper the parabola, lower the variance.

But why does the second order derivative represent the curvature?

To understand it intuitively, first let’s look at what the second order derivative means. It basically means taking the derivative of the gradient; i.e. it represents the rate of change of gradient.

From the figure above we can see that the pink distribution has a faster rate of change as compared to the yellow one (The curves are shaper for the pink). Also, the pink distribution has a lower variance as it is sharper as compared to the yellow one. So, we can roughly say that the variance of the estimator is inversely proportional to the second order derivative of likelihood function.

Fisher Information

Fisher Information tells us how much information we can get about the parameter we are trying to estimate given a data sample X. Basically, it tells us the expected amount of information which sample X carries about the parameter theta.

So, the variance is proportional to the inverse of Fisher information. This idea is formalized in the form of Cramer Rao Inequality as follows:

In brief, the Cramer Rao Inequality says that the variance cannot be lower than the inverse of the Fisher Information.

Lecture 10 - Applying RL to Games [Notes]

2020-05-05T00:00:00-07:00

Lecture Details

Title: Appling RL to Games
Description: The lecture notes are based on David Silver’s lecture video.
Video link: RL Course by David Silver - Lecture 7
Lecture Slides: Slides

Credits: All images used in this post are courtesy of David Silver

Why study games?

Games have complicated rules and require logic to play. Playing games optimally displays intelligence and acts a test of IQ.

Games are called Drosophila of AI. Drosophila is latin for fire bug – i.e an insect used widely in biology for conducing tests

Therefore, as games require intelligence and logic they are a good testing grounds for intelligent agents.

Optimality in games:

There are two main types of optimality in games.

Best response optimal policy is the optimal policy against some fixed opponent policy. Example – Consider that in a game of Rock Paper Scissors, the opponent always plays scissors. Then the best response would be to always play rock. But remember this best response is w.rt this specific opponent only.
Nash equilibrium is a policy in which all players play optimally. That is, we make agents play against themselves where every agent is trying to give the best possible response against every other agent. By improving the agents in such a scenario, the agents eventually learn the optimal policy and this point is known as the Nash equilibrium.

Note: The point of Nash equilibrium may or may not be unique.

Single agent vs self-playing agents:

In singe agent problem, we consider the other players as part of the environment. Hence, this becomes a normal RL problem where the environment outputs some states and rewards. Best response policy is used to solve single-agent problems.

In Nash equilibrium we use self-play between agents. That is, agents will play one another and eventually improve. Note that, a single agent may play against itself other than having multiple agents play against each other.

Two player zero-sum games:

A zero sum game is a game in which two players are trying to defeat each other and hence, have opposite rewards. I.E A positive reward for player 1 is negative for player 2.

Perfect and imperfect information games:

Note: There is no single perfect way of solving games. Reinforcement Learning should be used with Searching to get best possible results. That is, searching is an important process in game agents and one shouldn’t solely rely on reinforcement learning.

Minimax:

Minimax policy is basically the minimax algorithm where a tree is built in which the two players choose the best possible move at that stage (max/min). A minimax policy is a Nash equilibrium.

Minimax search example:

We can see that the values at leaf node were obtained and propagated upwards all the way to the root.

Problem with minimax:

The tree search is impractical in practice as the size of search tree grows exponentially. Hence, in practice function approximators like neural networks are used to approximate the value. That is, minimax search is executed till its computationally feasible and that data is used to approximate the final leaf-node value.

Using pre-defined features: Binary Linear value functions

Games have many pre-defined set of features which are used to calculate the “goodness” of a state. As we can see above, arrangements can be shown using a binary vector where the vector gets a “1” if the designated piece is actually at that place on board or 0 otherwise. Each such piece is given a weightage. After we multiply, we get a single scalar value telling us the goodness (value function) of that state. Note: Here negative weights show harmful positions (i.e opponents position)

Deep Blue: The Chess agent

Deep Blue used a combination of pre-defined features and search heuristics.

Chinook: Checkers agent

The chinook agent along with binary-linear function (hand-crafted features) and minimax search, used a retrograde analysis. In this, the agent searches backwards from win positions.

Self-play agents: Reinforcement learning

Now coming to reinforcement learning, the algorithms are pretty much the same we have seen before. Here, in case of MC v(S_t, W) gives the estimate of who would win from state S_t. Then when the game is finally over, it is updated towards the actual win (reward) G_t.

In case of TD, v(S_t, W) gets updated towards the successor state. That is, the agent has some estimate of V at state S_t. Then when the agent actually plays and reaches state St+1, there will be some new estimate. We update the V(S_t, W) towards this new estimate.

Policy improvements using afterstates:

In deterministic games, we know the rules. That is, we know that a horse in chess can move 2 and half places to the left or right and so on. Hence, we know exactly where the piece will end up when that move it played. Or in terms of agent, we know exactly the state which the game will move to.

So, the policy can be improved by exploiting this idea of afterstates. As we know the rules and the states which the game will move to, we can simply try out these states in our mind and then choose the action which maximises the after state value.

Self-play TD: Othello in game Logistello

The game agent Othello was able to develop its own features using some basic input features.

It used generalized policy iteration to perform RL.

TD Gammon: Non-linear approximator for Backgammon

The TD Gammon feature vector basically flattens out the board and counts how many of the checkers are there on each location. This is then passed to a neural network which spits out the state value function.

So the main difference between TD Gammon and previous agents is that TD Gammon uses a non-linear function approximation (neural network) to approximate the value function.

Self-play TD in TD Gammon:

Notice that, TD Gammon uses temporal difference learning with only greedy policy improvement. That is, epsilon greedy or other techniques of exploration are not used. Yet, the algorithm manages to converge in practice. This is because Backgammon uses a dice. The element of dice brings stochasticity to the game and thus, there is an implicit exploration. Due to this, a greedy policy is able to reach optimality.

When this TD learning was combined with searching and hand-crafted features, the agent was able to defeat the world champion.

Combining reinforcement learning and minimax search:

Like it was mentioned before, searching is an important step in gaming agents. It shouldn’t be ignored.

One naïve way of combining RL with Minimax search is to first use TD to update towards the successor value and then use this approximated value function to perform minimax search.

Results of using Simple TD:

As we can see, for complicated games like Chess and Checkers it leads to poor performance.

TD root:

In TD root, we first perform a minimax search from state S_t and store the outcome. Now, we play for real and reach the next state S_t+1. Now, we perform minimax search in this state S_t+1 and get the outcome. The value function at S_t is updated towards the outcome of the minimax search of state S_t+1. The important thing to note is that we are updating the value function towards the outcome of the minimax search and not towards the value function S_t+1. From the figure, the state S_t+1 leads to scenario with green node. So we update the S_t node such that it goes closer to the green node. That is, the node S_t should be able to predict the green outcome without performing the minimax search.

TD Leaf:

In case of TD leaf, we again use minimax search at state S_t and state S_t+1. But now, we also update the leaf value node attained at state S_t towards the leaf value node attained at state S_t+1. That is, we are not only creating a better value estimate at state S_t but also a better minimax search estimate from state S_t.

Tree Strap:

In TreeStrap we use update nodes at every level towards nodes at deeper level. That is, the nodes at higher level should be able to predict what the nodes at deeper level are predicting.

Note, this algorithm can only work well if they are a sufficient number of real-world estimates within the search tree. Otherwise, we would just be updating towards a biased fake optimal solution (as we are simulating the search tree).

Simulation based search:

The UCT algorithm basically uses Monte-Carlo Tree Search seen in previously lectures but additionally considers every node as a bandit. That is, every node now has a upper confidence bound associated with it.

Performance of Monte Carlo Tree Search and UCT:

Simple Monte-Carlo search in Scrabble: (Imperfect information game problems)

Notice that scrabble is an imperfect information game as the other player’s letters are not visible to us.

Game tree search in imperfect games:

Notice that in imperfect information games like Poker, every player has a different search tree.

Solutions to imperfect information games:

Smooth UCT:

Smooth UCT is a variant of normal UCT algorithm where the current experience is taken into consideration. The agent takes the current opponent’s behaviour into consideration and learns to respond against the average of it. The action is picked from UCT algorithm with a high probability and with a small probability pick a action which plays best against the average behaviour.

In conclusion, a RL agent should be combined with searching methods, function approximators and hand-crafted features to form a robust and intelligent agent.

Lecture 9 - Advanced Exploration [Notes]

2020-05-04T00:00:00-07:00

Lecture Details

Title: Advanced Exploration
Description: The lecture notes are based on David Silver’s lecture video.
Video link: RL Course by David Silver - Lecture 9
Lecture Slides: Slides

Credits: All images used in this post are courtesy of David Silver

We have already studied the importance of exploration. In brief, the idea is to try out new things in the hope of getting more reward.

While exploring, we make short-term sacrifices to reward for long-term advantage.

Types of exploration:

Till now, we used the naïve approach of epsilon-greedy to explore. That is, we explored randomly with a small probability epsilon and acted greedily with a large probability 1 – epsilon. However, randomly exploring the environment is obviously not optimal.

Better approaches than naïve exploration:

Optimism in the face of uncertainty: The idea behind optimism in the face of uncertainty is to be optimistic about the unknown. Example: If there is a 70% chance of getting a reward of 100 and another action may lead to a reward of 1000 with a 30% chance then we should explore the action with 30% chance.

Information state search: Information state search uses previous information to make informed decisions. Example: If we are in a room and we know what is behind some door vs if we are in a room and we don’t know what is behind some door. The first case is much more useful as we are using previously known information to our advantage.

State action exploration vs parametric exploration:

In state action-exploration we systematically try out new things. Example: Consider we have been in the state s before and had taken a right from that point. So, when we are in that same state again, we would likely take a left in state action-exploration; that is, systematically try out different things.

In parameter exploration, we control our agent using some parameterised policy. Once we choose the parameters, we try it out for a while. This introduces consistency.

Example: An agent which would explore based on some fixed policy (parameters) is better than an agent which takes some random action at different states. That is, taking random actions may not lead to useful results but taking consistent actions while exploring may lead to better results.

Multi-arm bandits

The multi-arm bandits problem can be thought of as having many one-step slot machines. That is, say we have 10 slot machines in front of us, and get to pull the lever of one of the machines. Doing this leads to some reward R. We need to maximise the cumulative reward by pulling these levers one at a time. Hence, mathematically, a multi-arm bandit is a tuple of action and reward (A, R). Notice that it is state-less.

Example: One of the slot machines may have a 70% chance of giving a reward of 100, another may have 20% chance of giving a reward of 200 etc. We need to maximize the cumulative reward.

Regret

Instead of maximizing the cumulative reward, expressing the problem as minimizing the total regret has certain advantages.

Firstly, regret is the difference between the best we could have done (V*) and the action which we took Q(a_t). Note that here we are assuming that we somehow know the value V* (the optimal value).

Advantage of using regret instead of reward: By expressing the problem in the form of regret, we get to compare different algorithms in terms of exploration. That is, every algorithm will improve the cumulative reward (curve will keep on increasing), but by comparing the regret, we can find if the algorithm is decreasing/plateauing the curve.

Counting regret:

Regret can be expressed as shown above. Every time we take some action a, we increase its count N_t(a). The difference can be thought of as a gap between the best and our action. So, basically, we want to build an algorithm which would decrease the gaps as the count increases.

Linear vs sublinear regret:

As shown above, greedy and epsilon-greedy never end up plateauing. That is, the cumulative regret keeps on increasing linearly. However, decaying e-greedy works very well and ends up plateauing. That is, eventually we end up taking the optimal action and close the gap.

Analysis of greedy algorithm:

In greedy algorithm as we select the max action every time, we may ned up locking on a suboptimal path forever. Example: Say one machine has a 80% chance of giving a reward of 10, another has a 50% chance of giving a reward of 100. Now, we try machine 1, get a reward of 10, try machine 2 get a reward of 0 (we got unlucky). Now, we will end up choosing machine one every time (suboptimal action).

Optimistic initialization:

In optimistic initialization, the initial values of Q(a) for all a’s are high. Hence, every action is highly likely to be chosen (as they have high initial values). The values of bad actions will become smaller over time. That is, say some bad action a has an initial value of 100. If the agent takes it and finds that a paltry reward was received, the action value will be decreased. That is say reward received was 10. Then we have 100 + ½ * (10 – 100) = 100 – 45 = 55. However, 55 is still relatively high. Hence, the action will be tried a few more times till it becomes substantially small. That is, we are trying every action enough times before determining that it is shit.

However, if we are really unlucky, that is a good action ends up giving bad rewards for say 4-5 tries then we will never try it again and we may end up locking into a suboptimal solution.

Epsilon-greedy:

In epsilon greedy as we continue exploring forever (as it is non-decaying scenario), we will end up accumulating regret in each turn.

Decaying epsilon-greedy:

Consider the decaying schedule shown above. At every time step, we choose d, that is the difference between the best action and the second-best action. Intuitively, if the difference is large then the term c|A|/d^2*t will become smaller, i.e we would explore the second best action and subsequent actions a lot less as they are significantly worse than the best action. On the other hand if the difference is small, the c|A|/d^2*t term will be larger. That is, we would explore these actions with a higher probability.

Note: This cannot be done in practice as it requires advanced knowledge of gaps (we need to know the optimal value V* for each action).

Lower bound for regret:

It can be proved using KL divergence that the lower bound of regret is logarithmically asymptote. That is, the optimal regret curve will be a logarithmic curve.

Hence, the best algorithm will be that which leads to a logarithmic curve of regret. (Example – decaying e-greedy).

Optimism in the face of uncertainty:

Consider the three action value distributions (Q(a1), Q(a2), Q(a3)). We can see that action Q(a1) covers a larger area and has a small chance of getting the highest reward amongst the three actions. Hence, an optimistic agent will choose Q(a1).

Consider that after choosing Q(a1) (blue curve), we got a small reward. Now, we would update the distribution and now we would be less likely to choose Q(a1) again.

Problem: We need to find some way of forming these distributions or obviating their need

Upper Confidence Bounds:

Upper confidence bound obviates the need for forming the distribution. We define an upper confidence U_t(a). So, basically, the action value has a reward value anywhere between Q_cap(A) to U(a). As we try out the action more and more, the value of U(a) will decrease, that is we become less and less uncertain about our choices.

Hoeffding’s inequality:

Hoeffding’s inequality says that the probability of the sample mean X_t being greater than the actual mean E[X] by u has an upper cap of e^{-2tu^2}.

We can use this to represent our upper confidence bound and find the value of U(a).

Calculating the upper confidence bounds:

We start by selecting a large probability p. Say p = 0.95. That is, the initial uncertainty is high. We can then calculate U_t(a) as shown above (we have taken log on both sides and solved for U_t(a)). As we can see, the N_t(a) in the denominator ensures that U_t(a) becomes smaller as the number of times the action gets chosen increases.

We should also reduce the value of p for faster convergence.

By using this strategy, we achieve logarithmic total regret.

Bayesian Bandits:

In Bayesian bandits, we exploit prior knowledge of rewards. That is, we have a history of knowledge involving some actions and their respective rewards. Using this, we can build up distributions for each Q(a).

Assume a Gaussian distribution. We can make use of the prior knowledge to build up distributions as shown above. Then, we can estimate the upper confidence bound by using standard deviation.

I.E c*standard_deviation/(sqrt(N(a)) is the upper confidence bound. Now, we simply maximize using the UCB1 algorithm.

This algorithm will only work well if the prior knowledge is accurate.

Probability matching:

In probability matching, the actions are picked in proportion to their probability. Example: If there are two actions, one of which has a 70% chance of being the best and other has a 30% chance of being the best then the one with 70% chance will be picked 70% of the times and other action will be picked 30% of the times.

Thompson Sampling:

Thompson sampling is one of the earliest and simplest of ideas. In it, we sample from each of the distributions and pick the maximum. For example, consider the following distribution:

We will sample a value from Q(a1), Q(a2) and Q(a3). That is, we get three values from those three distributions (actions). Now, we simply pick the max of the values.

Value of information:

The value of information can be thought of as how much is taking an uncertain exploratory action worth? Mathematically, it can be thought of as long-term reward after getting the information – immediate reward.

Information State-Space:

In information-state space, we store information about the environment and use it to explore the environment better. The bandit problem can be converted to a sequential decision making problem from a one-step decision making problem. Hence, we define an information state S_tilde and a probability matrix P_tilde. By definition they satisfy the Markov property and hence, we can now express the problem as an MDP.

Example: Bernoulli Bandit

Bernoulli Bandit is a special case of multi-arm bandit which issues a reward of 1 with probability p and a reward of 0 with a probability 1 – p.

Consider the problem where we win a game with probability u. For this problem, we maintain a simple information state which is basically a count of times we won and loss. Alpha counts the times we lost and Beta counts the times we won.

Once we do this for countably infinite times, we can use any algorithm to solve the MDP.

Gittins indices (Bayes Adaptive RL):

Gittins indices is a dynamic programming solution to the bandits problem. We basically form a tree consisting of different scenarios and use it to update our information state. Basically, at each node of the tree, we are summarizing everything we know about the actions using distributions. We can see that we started with two initial drugs with some initial distribution and they got updated at every node. This is the Bayes Adaptive approach. Note that, solving Bayes Adaptive MDP using Dp is called Gittins index.

In reality, exact solution cannot be found in tractable time.

Summary:

Contextual bandits:

In contextual bandits, we also introduce the idea of states (context). So, now the tuple becomes (A, S, R) instead of (A, R).

Example: Consider the problem of ad-placement. The ads will be placed based on the user who has entered the website. That is, if the user is an Indian, male then there will be some particular placement of ads etc. Basically, we are taking actions based on the context (S).

Using linear regression to estimate the value function:

We can use a linear approximation of the action-value function which would improve over time (the parameters will improve).

UCB:

We can also estimate the variance to calculate the upper confidence bound U.

Geometric interpretation:

We are essentially defining a ellipsoid around the parameter theta. This ellipsoid will account for the uncertainty (upper confidence bound).

Hence, we get:

Extending the algorithms of bandits to MDP:

UCB in MDPs

Problem: When we are dealing with MDPs the Q(s, a) value itself keeps improving as the policy improves. So, there is not only uncertainty w.r.t U(s, a) but also Q(s, a). Hence, the problem becomes much harder in case of MDPs.

Example 2: Optimistic initialization in MDPs

The idea of the rmax algorithm is to build an optimistic model by imagining that every transition leads to heaven (best scenario). Once we actually start solving, we find that many of those states are actually bad and we can reduce the values appropriately.

Information state space MDP:

We basically combine the actual state s with the information state s_tilde into an augmented state S_tilde.

Lecture 8 - Integrating Learning and Planning [Notes]

2020-05-03T00:00:00-07:00

Lecture Details

Title: Integrating Learning and Planning
Description: The lecture notes are based on David Silver’s lecture video.
Video link: RL Course by David Silver - Lecture 8
Lecture Slides: Slides

Credits: All images used in this post are courtesy of David Silver

Till now, we have learned the policy/value directly from the experience. But it is also possible to learn the model of the environment from the experience. Such a model helps in planning. This planning helps construct a value/policy function.

Model Based RL:

In model based RL, we have a simulated representation of the environment. This simulated representation can be used for planning future actions. Basically, a model based RL agent will learn the probability transition matrix and the rewards i.e it will learn the MDP.

Advantages and Disadvantages of Model Based RL:

The model of the environment is learned by understanding from the real world experiences. That is, the agent first works in the real-world, gets some real-world experience (S, A, R, P) and uses these tuples to build the model. So, this is a supervised learning problem and hence, any supervised learning technique can be used.

Because we know what we know about the environment, that is, our model is our knowledge about the environment, we also know the things we are uncertain about and hence, we can handle uncertainty better.

Disadvantages: Here, we are performing a two-step process. First we are learning a model and then learning the value/policy as opposed to the previous model-free techniques where we only learned the value/policy. Hence, the approximation error increases.

Formal definition of a model:

So, a model is a representation of an MDP (S, A, P, R). That is, given a state S_t and action A_t, the model can estimate the next state S_t+1 and the reward R_t+1.

Note: Here we are assuming that the state space S and action space A are known. But in more convoluted problems, it is possible that this is not the case and the state/action space would also need to be estimated.

Why even do model based RL?

Consider a maze game where a new maze is formed each new episode. A model free agent will have extreme difficulty navigating the environment as it won’t be able to learn the value functions properly (as the maze changes every episode). On the other hand, the model based agent can simply learn the rules of the game, i.e going up makes me go north, going left makes me go west etc. By learning such trivial rules, the model based agent would be able to solve the maze much better.

How to learn a model?

As we previously saw, the model learning problem is a supervised learning problem.

Types of models which can be learned:

There are endless possibilities of models which can be learned. It is up to the programmer to choose the right model representation.

Table lookup model:

The simplest way to “learn” a model is by creating a table. The probability transition matrix can be created by counting the transitions actually experienced and dividing by total. Similarly, the rewards can be the average of all rewards.

Example:

Consider the simple problem shown above. Here, the transition from state A to state B is seen once (first tuple). Hence, the transition matrix will show it with 100% probability. There are 8 B’s leading to a terminal state, 6 of which get a reward of 1 and 2 of which get a reward of 0. Hence, we can form a model which says that 6/8 (75%) of the times, we get r = 1 and 25% (2/8) of the times we get r = 0.

Problem of exploration:

One evident issue with this idea is that if our real world experiences aren’t a good representation of the environment then the model constructed will have a heavy bias. That is, in our example, we have only seen one experience (tuple) with state A. Based on that single tuple, we learned a model which says that A -> B has a 100% chance of occurrence. But in reality, this is unlikely to be true and may simply be a result of learning from too few real-world experiences. Hence, the amount of exploration needs to be appropriately handled.

Planning with a model:

Once we have a model representation, we can use this model to generate new tuples (experiences) and learn from those experiences. Hence, it’s important to have a good simulated representation as we are learning from the generated simulated experiences. This learning can be done using any planning algorithm like value iteration, policy iteration etc.

Sample based planning:

The simplest approach is to use the model to generate new samples of data and apply model-free RL algorithms like MC, SARSA to those samples.

Example:

Once we create the model shown above, we can use that model to generate samples (sampled experiences) and apply RL algorithms to those sampled experiences.

In this case, as samples A, 0, B, 1 was generated twice by the model, the V(A) becomes 1 as starting in state A lead to a final reward of 1. (Note that we aren’t considering the real world experience A, 0, B, 0). The real world experiences are only considered to create the model and then the learning of value function is based on the sampled experiences from the model.

What if our model sucks?

Again, a big problem is that the learned algorithm is only as good as the model which we have learned. If the simulated environment representation doesn’t represent the real-world dynamics well, then the policy learned won’t be optimal. One way to solve this is to jettison the idea of using model-based algorithm and another way is to use sophisticated approaches like Deep Belief Networks which would weight the model based on uncertainty.

Integrated architectures:

A more robust solution is to use both the real experience and simulated experience. If used properly, the resulting agent performs better than normal.

DYNA:

Dyna is an integrate architecture which learns the model from real experience and does the planning of value function using both real and simulated experience.

Architecture:

Here we can see that the value/policy part is formed using planning via the learned model and directly via the real experiences.

Algorithm: Dyna Q

Basically, in Dyna we execute an action A in the real environment and observe the reward R and state S’. The Q-value is updated using the general Bellman’s equation. This (S, A, S’, R) tuple is also used to improve the model. Now, we again update the Q-value using the tuple sampled from the model.

Hence, in every iteration, the first update comes from real-world experience and then n updates are from the samples generated from the learned model.

Interesting problems:

Consider the environment shown above. Here the gray tiles represent a wall and hence, the agent cannot travel through that space. Now, the solution is to go all the way right from state S and go all the way up (as per the first diagram). But say that we change the environment mid-way. Now, if the agent follows the same solution it won’t get the optimal reward. It will take a lot of time to find the new correct way. This can be solved by an improved architecture called the Dyna Q+. Dyna Q+ rewards exploration. Hence, in dynamic environments which may change, algorithms like Dyna Q+ will perform better than Dyna Q.

The same is true even if the changed environment is easier. Dyna Q will keep following the same previous path even though the new path is shorter and better.

Simulation-based search:

Forward search algorithm emphasises on the present state. That is, we don’t care about some random sample from any random state (as we have seen till now) but we care about the sample starting from the present state S_t. So, we are looking ahead from the present state S_t. Now, this state S_t may be some intermediate state. That is fine, as just solving a sub-MDP we are still improving our understanding and policy.

Simulated forward search:

This idea of forward search can be executed by sampling from our learned model. Then we can apply model free RL techniques to these simulated episodes.

Formally,

Where k = episode

Monte-Carlo search:

Monte-Carlo search basically samples episodes from the model and evaluates those using the MC algorithm. Notice that, in simple monte-carlo search we aren’t improving the simulation policy.

Monte-Carlo tree search:

In tree search, we build a search table for every state visited. In case of simple Monte-Carlo we were only updating for the root state S_t.

The simulation policy is improved epsilon greedily. There might also be states which we have never explored. The actions to such states can be taken randomly.

Example: Game of Go

Rules:

In go, we essentially try to surround as much territory as possible. Also, if we surround an opposite colored stone, that stone will be eliminated. That is, if blacks surround a white stone, the white stone gets eliminated.

Evaluating the positions:

Notice that, max min v_pi(s) says that the agent needs to consider the optimal play (mini-max algorithm).

Applying forwards search:

We can see that by simulating the outcomes starting from some current position s, we can start approximating the value function. In the example shown above, as 2 in 4 outcomes lead to a black victory our V(s) will be 2/4 = 0.5.

Tree evaluation

Consider that we start by sampling some branch which lead to the black winning (terminal state has reward 1). Hence, we write the root as 1/1. (1 win in 1 episode)

Executing another sample:

Now, as the terminal state = 0, we write ½ in the root and 0/1in its child. Notice that we are slowly building the tree.

Executing another sample:

Now, we explore another branch and black wins. So, we update the values at the explored nodes.

This way we keep adding new information to the tree policy and go in the direction of the best branch. That is, the tree policy will choose the likely branch (epsilon greedily) and hence, we will follow the best known path. Once we are out of the tree policy domain, we choose the actions using some default policy.

Advantages of MC:

TD search:

Instead of MC, it is also possible to use TD search and use the idea of TD for tree evaluation etc.

Formally,

Dyna-2 architecture:

Lecture 7 - Policy Gradients [Notes]

2020-05-02T00:00:00-07:00

Lecture Details

Title: Policy Gradients
Description: The lecture notes are based on David Silver’s lecture video.
Video link: RL Course by David Silver - Lecture 7
Lecture Slides: Slides

Credits: All images used in this post are courtesy of David Silver

Instead of estimating the value functions and then reaching the optimal policy by following the updates epsilon greedily, we can directly tinker with the parameters of the policy function.

The types can be categorized as follows:

As we can see, value based has an implicit policy, that is by following the values epsilon greedily, we would reach the optimal solution without stating any explicit policy. On the other hand, policy based algorithms have an explicit policy which our agent will improve over time.

The actor-critic model takes the best of both worlds and the agent navigates the environment using policy based approach while action values are updated using value-based approach.

What’s the point of policy-based approach if Value based exists?

In value based approaches, we follow the values greedily and eventually reach convergence. However, in some cases following it greedily may be slower than directly tweaking the parameters and hence, policy based approaches have better convergence property.

Value based approach deal with action-value functions. So, if the action space has many dimensions or is continuous, the updates can be slow and hence, policy based is more effective in this case.

As value based takes the max, the policy is always deterministic (even though the probable actions are in terms of probabilities, we are always choosing a single output). Policy based approach can give stochastic policies.

Advantages of stochastic policies:

Consider the game of rock-paper-scissors. In this, by having a deterministic policy like always choosing rock, the opponent can easily exploit it by always choosing paper.

Hence, this is one of the scenarios where the optimal policy is a uniform random policy. Hence, a policy gradient algorithm would have found a better way of playing rock-paper-scissors over a value based algorithm.

Alias world: Incomplete MDPs

Policy gradient techniques are also useful in the case the environment is spitting out incomplete MDPs. (i.e all information is not known)

Consider the example of alias world. Here, the gray states cannot be differentiated by the agent; that is, the agent is unable to understand them.

If we solve the problem using value functions, we get the following solution:

Both gray states get either a left arrow (W direction) or a right arrow (E direction). This is not optimal as the first gray arrow should have been right while the second should be left. But as the agent cannot differentiate, they will get the same direction.

On the other hand, a policy gradient approach leads to the following solution:

As stochastic policies are allowed, the agent may now move left with a 50% chance and right with a 50% chance. Hence, the solution will be reached much faster than the value-based approach.

Measuring quality of policy based objective function:

Here the cost function cannot be the error function. We want to measure the quality of a policy hence it needs to be something different. If we know the starting state then it can be the total expected return. In continuing environments (it never terminates), we can choose the average value.

Here d^pi_theta (s) is the probability of being in state s with policy pi and parameters theta.

As gradient based methods provide the greatest efficiency, we optimise policies using those. However, it’s not necessary to restrict policy optimization to those. Anything can be used.

Exploiting the sequential structure is updating the policy right after getting a few sequential pieces of information instead of waiting till the end of the episode to do so.

Policy gradient:

Note that, in the case of value based gradients we were performing gradient descent as we were trying to minimize the error (hence trying to find the minimum of the function). In case of policy gradient, we are trying to maximize the score function. Hence, we perform gradient ascent.

Computing gradients using finite differences:

Score function:

We assume that policy is differentiable whenever it is non-zero. This means that the policy need NOT be differentiable everywhere. It only needs to be differentiable at the right places.

Here, the gradient of the policy is rewritten in the form of a log function. This is done as the output produced is equivalent and more importantly log will simplify the derivative calculation.

That is, in case of softmax, gaussian updates, the e^(x) terms will be simplified as x*log_ee and hence calculating the gradient of x becomes much simpler.

Softmax policy:

Here, the gradient can be interpreted as the value of action the agent took (phi(s, a)) – the average value of all actions (E[phi(s, .]). Basically, how much greater or smaller was the phi(s, a) than the expected (average) value. Update the gradient that strongly in the direction of the action.

Softmax policy is usually used for discrete action spaces.

Gaussian policy:

One-step MDPs:

Consider a rudimentary scenario where the episode lasts only for a single step. After taking one step, it is terminated, and a reward r is obtained. The policy problem is to find a policy which would maximise this reward.

We can use likelihood ratios to compute the policy gradients as shown above. For the computation, remember the log trick.

We know,

So, we can get rid of the policy distribution using the log trick. The reason we want to get rid of it is because we don’t have direct knowledge about the policy distribution pi (shown above).

So, to get rid of it, we can divide and multiply by the policy distribution in the gradient of the cost function. That is,

If we compare this with the derivative of the log function, we can see how we got the final gradient shown in the image above. The d(s) left in the equation will become 1 by law of large numbers, hence, we are simply calculating the expected value now!

Generalization of this idea:

The policy gradient theorem states that by simply replacing the one-step instantaneous reward r by the total long-term value Q, we get an optimal gradient policy update.

Monte-Carlo policy gradient:

Here, the long-term value Q will be the unbiased return at the end of the episode.

Reducing variance:

The problem with the previous policy-gradient update is that there is still a lot of variance. This can be reduced using the actor-critic model.

Here, the actor is the one who actually takes the decisions and performs the action. The critic is only their to evaluate. An actor will navigate the environment, take some action and get some reward. Then critic then evaluates how good/bad the action taken was and will update the action-value function accordingly. The actor then updates the policy in the direction suggested by the critic.

Job of the critic:

The job of the critic is to perform policy evaluation. This can be done using MC evaluation, TD evaluation or TD lambda.

Action value actor critic:

Hence, the critic uses linear TD(0) to approximate the action-value function and update it while the policy gets updated using policy gradient.

Bias in actor-critic algorithms:

As we are approximating the gradient (notice that true Q value is not used, it is a linear approximation in actor-critic model), lot of bias is introduced in the algorithm. Hence, the right solution may not be reached.

This problem can be solved using compatible approximation theorem:

Basically, over here the features are the score of the policy.

Reducing variance:

One way of reducing variance is to use the baseline function. The baseline function has an expected return of 0, i.e the gradient of policy is 0. We can see that this because B(s) and gradient operator can be taken out of the summation and summation of pi(s, a) = 1 and then gradient(1) = 0.

Now, a state value function will have a gradient of 0 (or near zero) as the state-value is the supposed to be the actual representative value of that state.

So, we can subtract the action value (Q) from that state value V. The subtracted value basically tells us how much “advantage” we are gaining by taking action a in state s.

Then the score function gradient can be updated by considering this advantage function.

How to estimate the advantage function?

One way is to use two function approximates and update both over time to get better approximations.

Representing advantage function in the form of TD error:

Advantage function can also be thought of as the TD error. This is because Q value is simply r + gamma*V(s’) (as per bellman’s equation) and error is this Q value – V(s).

Critics at different time scales:

Value function can be estimated at different time steps (scales) using the techniques shown above.

Actors at different time steps:

Similarly, actors can perform at different time steps.

Actor update with eligibility traces:

Alternative policy gradient directions:

One of the problems with policy gradients is that the policy itself is getting reparametrized (updated). So, we aren’t following the “True gradient”. Hence, the convergence may take a lot of time.

Natural policy gradients:

Natural policy gradients is the idea of starting off with a deterministic policy. This idea would minimize the issue caused by noise as a deterministic function will have little to no noise.

Natural actor-critic:

Summary:

Hence, we can see that policy gradient has many different forms. The different forms will reduce variance etc differently.

Lecture 6 - Value Function Approximation [Notes]

2020-05-01T00:00:00-07:00

Lecture Details

Title: Value function approximation
Description: The lecture notes are based on David Silver’s lecture video.
Video link: RL Course by David Silver - Lecture 6
Lecture Slides: Slides

Credits: All images used in this post are courtesy of David Silver

Why are function approximators required?

Complex reinforcement learning problems like learning the game of Go have huge state-space (10^170 for Go). Finding the exact value of all such states is not computationally feasible. Hence, function approximators are required to solve real-world, large scale problems.

One huge advantage of function approximators is that we can generalize from seen states to unseen states. That is, we don’t need to visit all the states to estimate their values. Once we approximate a function well enough, any state can be approximated well.

Types of function approximator:

Function approximators may take only the state as input or the state action pair (s, a) as input. Then we can output the state-value function, action value or action values for all actions as shown above.

Note – Here w = weight matrix

Different approximators:

Neural networks and linear combinations of features are widely used as they are differentiable.

Note that, in RL, the data is non-stationary; that is we are learning while exploring the environment and it is also non-iid (iid = independent and identical distributed); that is, the time sequence of data matters.

Incremental methods:

Basics of gradient descent:

Value function approximation using stochastic gradient descent:

Here, assume that the actual value v_pi(S) is known to us. Then we are simply calculating the squared error between the predicted value v_cap and the actual known value.

Feature vectors:

A state can be represented using features. This is useful because we can now pass features to the neural network to better approximate the value function.

Linear value function approximation:

The value function can be represented as a linear combination of the features x(S) and weight matrix w. (x(S)^T*W)

Intuitive thinking: By representing it in such a way, we can see that the squared error will become quadratic in nature. Hence, the plot of J(W) will be a quadratic curve. We know that quadratic functions have a global optimum and hence, we can say that this algorithm will converge to the global optimum.

Table lookup:

The case of table lookup can also be shown in the form of features. The feature vector in this case will have rows = number of states and the ith entry will 1 if the current state is Si else it will be 0.

Note this is only to show the relationship between the previous table lookup algorithm and current neural net implementation.

Incremental Prediction algorithm:

Till now we assumed that Vpi(S) was known to us. But this won’t be the case in reality. Hence, we approximate it using return G_t for Monte Carlo and the usual TD estimate for TD. Similarly, the lambda return G_t is used for TD(lambda).

Why isn’t the derivative of v_cap(S_t+1 , W) calculated in TD(0)?

The interesting thing is that in TD(0) the “actual” value Vpi is estimated using R_t+1 + lambda*V_cap(S_t+1, W). Here the V_cap entry is the value spitted by the neural network itself. Hence, we are using the neural network’s approximation to improve the neural network. This works over time as R_t+1 is the actual reward. Hence, by updating it every time step, we slowly bring it closer to the true estimate. But notice that we are ignoring the derivative of V_cap(S_t+1, W) and only calculating for V_cap(S_t , W). This is because, we want to move forward. When we calculate the fastest rate of change from state S at time step t, we get which direction to move forward in. At the same time if we calculate for S at time step t+1, we would be kind of pulling it in both directions.

However, in some cases taking both derivatives may provide better results.

Monte-Carlo with value-function approximation:

Remember in Monte Carlo we first run through the entire episode. Hence, we would collect tuples (S1, G1), (S2, G2)..(St, Gt) at the end of each episode. Then these tuples can be used to perform an update in the right direction. Hence, these tuples can be thought of as training data and the problem reduces down to a supervised learning problem per episode.

TD learning for value function approximation:

In case of TD learning, we aren’t getting the actual rewards. It’s an estimate hence, the training data will also be an estimate. Also, the update will be performed each time step.

TD(lambda) with value-function approximations:

Notice that in Backward linear TD, the eligibility trace at time step t is decaying trace at time step t-1 + x(St). Here are consider the features at step t. (for linear). Note this is basically, the gradient of v_cap(St, w) which in the case of linear combination decomposes to x(St).

Control with value-function approximation:

As we saw previously, action-value functions need to be used over state-value functions in case of model-free environments. Hence, we would instead approximate the action-value function in such cases.

Linear action-value representation:

Incremental algorithms:

Bootstrapping:

The graphs show that in most of the cases, bootstrapping (choosing TD lambda with lambda between 0 and 1) is usually a good idea.

Convergence of prediction algorithms:

It’s important to understand which algorithm may not converge as in some cases, the derivatives may shoot in the wrong direction and give catastrophic results.

Improvements: Gradient TD

Remember that in TD, we took derivative of Rt+1 + lambda*q_cap(St+1, a, W) where q_cap was approximated by the neural network itself. Hence, it didn’t follow the true gradient. Gradient TD solves this problem by following the true gradient of projected Bellman error.

Convergence of Control: (Note that control algorithms will optimal solution)

Batch reinforcement learning:

In incremental reinforcement learning we were using the (S, A) tuples only once. After updating, we were throwing away that tuple. Updating the gradient once is not enough to squeeze out all information from the tuple.

Example: A game may have different levels. After starting level 2 which may be different from level 1, our agent will start losing information of level 1 as it will be overshadowed and forgotten due to the current incoming tuples of level 2.

This can be solved by using experience replay where we store all the tuples and then choose a random sample from it at every time step.

Experience replay also converges to least square solution.

Experience Replay in Deep Q-networks (DQN):

State-of-art DQN use experience replay to solve the problem of forgetting the previous tuples and squeezing the maximum information out of each tuple by keeping the tuples in memory and sampling a batch from them in every iteration. This also helps mitigate extreme co-relation of data.

Fixed Q-targets:

The other improvement used is the fixed Q targets. This is like the off policy learning of Q learning where we had two policies: behavior and target. The q-value of state s’ was chosen from target policy while the current action was chosen from the behavior policy.

Similarly, here we keep a copy of the old q-learning targets. That is there are two networks. Old DQN and the present DQN. After every n iterations, say 1000 iterations, Old will be set to present.

But within those iterations the Q(s’, a’, w_) will be chosen from the old DQN. This helps stabilize the network.

Linear least square prediction:

For fairly small problems (where the number of features are small), we can instead use linear algebra to directly get the approximate values instead of using a neural network.

As we can see, the w matrix is calculated by taking the matrix inverse of the linear combination multiplied by the sum of X(s)*Vt. This only works where N (features) are small.

In practice:

Convergence:

Hence, linear algorithms will lead to the global optimum.

Least square policy evaluation:

We can use Q-learning with least squared error between the q values for evaluating policies.

Least square control:

Importance Sampling

2020-04-27T00:00:00-07:00

Original Video link:

Credits: All images used in this post are courtesy of Mathematical Monk

Importance Sampling: Introduction

Sampling is actually a misnomer. Using Importance Sampling we are essentially approximating the expected value of some distribution p(X) using another distribution q(X).

We use importance sampling when it is difficult to grab samples from original distribution p(x), so we estimate it using q(x). We might also use importance sampling when we want to give “importance” to certain areas of original distribution. Basically, say we want to grab more samples from areas from original distribution which occur rarely. We can design our q(x) such that we grab more samples from this region.

Another important thing to note is, even though we are assuming it’s difficult to grab samples from p(x), we still should be able to calculate value of p(x) given some x.

Mathematically, we just multiply and divide the expected value formula with q(x) as shown above. (Here distribution q(x) should be equal to zero when p(x) = 0; it’s called absolute continuity).

The p(x)/q(x) term can be basically thought of as a “weight” term. Hence, the formula becomes:

Can importance sampling estimate even better than original P(X)?

So, the cool thing about importance sampling is that if we choose our q(x) correctly then we might even be able to estimate better than directly estimating from p(x). We do this by reducing the variance term.

Intuition behind choosing a “good” Q:

Consider the figure shown above. Let the red line denote the return which we get (here return is just the values represented by our probability distribution p(x)). Now, p(x) represents our probability distribution. As we can see, most of the density is away from the huge negative spike in return. So, if we use something like Monte Carlo sampling, we won’t be able to estimate the true average properly as we will be grabbing samples from the dense area. So, it’s very unlikely that we get a f(x) where we experience that huge negative spike. But from the equation we can see that the huge negative spike is greatly affecting the expected value because even though p(x) is small, |f(x)| is significantly large hence the overall expected value will be influenced by such f(x)*p(x) entries. If we choose q(x) as shown above we can solve this problem.

Example of a bad q(x):

In this example, we can see that q(x) covers irrelevant regions and so it will be a bad estimate of the actual expected value.

Looking at this from another perspective: However, this also shows the power of importance sampling. If for some reason we want to sample more from these “irrelevant” regions then we can simply design our q(x) such that we end up sampling from these regions. So, using importance sampling we are able to choose regions of importance to sample from.

But in general, we choose q(x) such that |f(x)|*p(x) is large when q(x) is large.

Importance sampling without normalization

Till now we were assuming that we could calculate f(x) and w(x) (p(x)/q(x)) efficiently for all values of x. However, in reality, we might know p(x) and q(x) only up to a normalizing constant. So, how do we calculate w(x) efficiently in this case?

Look at the image shown above. We can see how p(x) and q(x) can be expressed in terms of normalizing constants. Here we say Zp and Zq are unknown to us. The right hand side integrals are true because integral of p(x) and q(x) must be 1, hence q_tilde(x) and p_tilde(x) should integrate to Zp and Zq as the fraction should equate to 1.

So, let’s start by rewriting the equations in terms of normalizing constants:

But we still don’t know Zq/Zp.

However, we can perform Monte Carlo approximations of these as follows:

So, the final equation becomes:

Variational Lower Bounds (ELBO)

2020-04-20T00:00:00-07:00

Original Video link:

Variational Lower Bounds

Credits: All images used in this post are courtesy of Hugo Larochelle

The idea of putting a lower bound on log of a function is based on idea of log concavity:

As logarithmic function is concave, it is true that log(sum(wi*ai)) >= sum(wi*log(ai)).

We exploit this idea to have lower bounds on our likelihood functions.

Here h⁽¹⁾ is the latent variable; i.e., we say that probability of our data x is based on latent variables h (variables which we cannot directly observe but we can use them to inference about our data).

As we can see, we have simply applied the log concavity idea to get the >= shown above.

Also, note that here the distribution q(h|x) is any arbitrary distribution. The choice of this distribution is up to us.

We can see an interesting property. If the chosen distribution q(h|x) is same as p(h|x) then the right hand side equation reduces to log(p(x)); hence we are just calculating the likelihood directly.

**Now, we know p(x, h) = p(x	h)*p(h), hence, log(p(x, h)) will be
log(p(x	h))p(h).*

We use this idea as follows:

Then, we get the following:

Which can be refactored as follows:

Note: The negative sign in front of KL divergence is because we need to invert log(p(x)/q(x)) to

-log(q(x)/p(x)) to bring it in KL divergence form.

Lecture 4 - Model Free Techniques - MC and TD[Notes]

2020-03-10T00:00:00-07:00

Lecture Details

Title: Model Free Techniques - Monte Carlo and Temporal Difference
Description: The lecture notes are based on David Silver’s lecture video.
Video link: RL Course by David Silver - Lecture 4
Lecture Slides: Slides

Credits: All images used in this post are courtesy of David Silver

Model-free reinforcement learning:

In model free techniques, the model of the environment is not known. Hence, we have no knowledge about the MDP’s transitions/rewards. Such an environment is closer to what actual complex problems will have.

Hence, we are trying to solve an unknown MDP.

Monte-Carlo Reinforcement Learning:

In Monte-Carlo, the agent learns through episodes. Episodes are one complete sample of the environment. That is, going through some states which eventually leads to a terminating state. Hence, Monte-Carlo methods learn from complete (terminating) episodes only.

To get the value of different states ,we simply calculate the mean returns at the end of episode.

Policy Evaluation using Monte Carlo:

One important point in MC policy evaluation is that we don’t calculate the estimated return. We are calculating the actual (empirical) mean return.

Types of Monte-Carlo:

First-visit Monte-Carlo:

For every episode, we update the state s only the first time. We do so by incrementing the counter and the total return. Also note that G_t is the empirical (actual) reward in this case. This would be obtained by visiting some sample of states in that episode.

It means that if the same state s is encountered multiple times in the same episode then we won’t be updating it. Example: If we go left, then go right again we would end up in the same state but it won’t be updated the second time.

Again, note that this will be done for many episodes and the values of counter, total return etc will persist across the different episodes.

Every-visit Monte Carlo:

In every-visit Monte Carlo, the state s is updated every time it is visited in the same episode.

Monte Carlo Black Jack example:

Consider, the simplified version of Black Jack. Here, we are defining two actions – stick and twist. The reward function is also defined by us. Notice that there is no probability transition matrix as we do not know the working of the environment.

For simplicity, we are only taking an action if our current sum is between 12 or 21 otherwise, we are automatically twisting if sum of cards < 12 (because there is no point in sticking and showing hand if sum is small). We are also considering whether we have a usable ace or not and looking whether the dealer’s current card is an ace or not. (Note: Usable Ace can take a value of 11)

The value function after Monte Carlo learning can be seen in the diagram shown above. We can see that the eventually after 500 episodes the graph peaks at player sum = 21 (i.e it gives a reward of +1) and it is flat in other regions.

Incremental Mean:

We can rewrite the mean as shown below:

That is, mean of k points can be written as mean of k-1 points + the difference between present point (k) and the previous mean.

More intuitively, we are expecting the value x_k to be near u_k-1, so Xk – Uk-1 can be thought of as the error or difference between the estimate. If it is completely same, then the U_{k =} U_k-1 otherwise it will change based on the error difference.

The incremental mean can be used to rewrite the value function update in MC. Then we can replace 1/N(S_t) with alpha. This alpha helps us change the equation to an exponential decay form where we can control how much of the old episodes we want to remember.

Temporal Difference Learning:

The main difference between Monte Carlo method and TD methods is that in TD the update is done while the episode is ongoing. That is, we can learn from incomplete episodes. This is done by estimating the remainder rewards instead of actually getting them. This idea is called bootstrapping.

Example: Consider the exit door of a classroom is the end state. In Monte-Carlo, all episodes must end with this state and the states would be updated only when the episode has ended. In TD Learning, we may travel halfway through the classroom and estimate the reward for the remainder distance.

As we can see, in temporal difference learning, R_t+1 is the actual reward which we get at time step t, and the gamma*V(S_t+1) is the estimated future reward.

Driving Home Example:

Consider the driving from office to home example. Here, we are travelling from office to home and there are 6 states. As we can see, the predicted time estimate changes with every state; that is we are updating the estimate while running through the episode.

In Monte Carlo, we can only update after the end of the episode. So, we wait till the episode has ended (arrive home state reached) before updating the state value estimates. In case of temporal difference learning we can update after every state. That is, after leaving the leaving office state we can immediately update our value estimates based on the rewards we got till now.

Advantages and Disadvantages of MC and TD:

Bias/Variance trade-off:

As the returns in Monte Carlo are actual returns, they are completely unbiased. That is, they are true values obtained from the environment. However, this means that it has high variance as it will be a perfect return (returns will include any noise/outlier obtained during the episodes).

On the other hand, the TD estimate is a rough expectation and hence it will be biased. As it’s an approximate estimate it will have low variance.

Therefore:

TD – high bias, low variance

MC – zero bias, high variance

Advantages/Disadvantages continued:

Here, function approximation is the idea of approximating the value functions instead of calculating them as it is very time consuming to actually calculate value functions for complex problems.

Random-walk example:

To compare MC and TD, consider a scenario where we select the same sample episode k and run both MC and TD(0) (Here 0 specifies that update will be done at every time step).

Consider the above sample episode. Here, as there are 8 instances of B, 6 1s and 2 0s, we can say that V(B) for both MC and TD will be 6/8.

TD would form an implicit MDP as follows:

For TD, we will use the update rule: R(T) + gamma * V(St+1) = 0 + 1*(6/8) = 0.75
For MC, the only observation was V(A) = 0, hence V(A) = 0. To put it more concretely, see the image below:

Here, we can see that as MC minimizes the mean squared error, V(S_t) will take a value of 0. As G_t - V(S_t) will be minimal when V(S_t) is 0 (Because A = 0).

On the other hand, TD(0) converges to max likelihood of Markov Model. Notice the 1 in the double summations. It is an indicative function which indicates that the transition from s to s’ actually exists.

So here, V(A) will be 0 + 1 + 1 + 1 + 1 + 1 + 1 + 0 = 6/8 as we are basically, calculating the transition in each episode. (Remember it’s based on the equation R_t+1 + gamma*V(S_t+1).

Backups:

Monte-Carlo Backup:

In Monte-Carlo we are basically traversing one random path of states which eventually leads to a terminating state. Hence, it will traverse through the depth and end with a terminating state.

TD Backup:

In TD, we only look one step ahead and then estimate the rest. That is R_t+1 + gamma*(V(S_t+1).

Dynamic programming backup:

In DP, we used to consider all possible states one level ahead, i.e the entire breadth of level+1.

As opposed to this, in MC and TD we are only considering a limited space.

Summary:

Remember that bootstrapping is estimating the future rewards.

Unified view of RL:

As we can see, TD Learning has shallow backups while Montel Carlo has deep backups (it goes all the way till the terminating state).

Using TD(lambda) we can adjust this level of backups as per our need.

Instead of looking ahead by 1-step we can look ahead by any arbitrary n steps.

N-step return:

Now, in n-step TD we would be updating the value function after n-steps. (that is, we would be backing up after n steps). Hence, the reward return G_t can be rewritten as shown above.

Averaging n-step returns:

Instead of selecting a fixed value of n (we can’t decisively decide the best value of n), it’s better to average these returns. For example, we can decide to calculate 2-step return and 4-step return and then average them together to update the value of the value function.

Lambda return:

Just like this, we can get the information for all n steps, i.e 1 step, 2 step, 3 step etc using lambda return. We write the lambda in the form of geometric series. Writing it in the form of geometric series is computationally efficient and will allow us to solve it for nearly the same time as TD(0).

It is a decaying function; hence, the larger steps are given smaller weightage.

Forward-view TD(lambda)

One disadvantage of TD(lambda) is that we need to wait till the end of the episode to perform any update. This is because we are considering all time steps n, where (nth step = last time step) and hence, the update would only be performed at the end of the episode like MC.

This can be solved using Backward pass using eligibility traces.

Backward pass: Eligibility traces

Consider the three bell, one light, shock scenario. Did the shock happen because the bell rang three times or because of the light? It depends on whether we are giving importance to frequency of occurrence or recency of occurrence. Eligibility traces combines both these heuristics. Everytime a state is visited, the value of the trace is increased, otherwise it will start decaying.

So, basically we will be updating every state with the information obtained till now in the form of eligibility traces.

The original equation was V(S) + alpha*(S_t), now we include the eligibility trace to it.

Hence, we don’t need to wait till the end of the episode to update. We can keep updating at every time step.

When lambda = 0, the eligibility trace will always be 1, as we will always be in the same state. Hence, the update will be same as TD(0).

Note that we are talking about the following lambda:

TD(lambda) and MC:

Another interesting property is that TD(1) will have the same total update as MC.

(Refer to PPT for proof)

Lecture 5 - Optimizing Model Free Techniques (Model Free Control) [Notes]

2020-03-10T00:00:00-07:00

Lecture Details

Title: Optimizing Model Free Techniques
Description: The lecture notes are based on David Silver’s lecture video.
Video link: RL Course by David Silver - Lecture 5
Lecture Slides: Slides

Credits: All images used in this post are courtesy of David Silver

Why is model free control even required?

Many real-world problems are too complex and thus, their MDP is unknown or the MDP might be known but it will be too large/complex. Hence, model free techniques (like Monte Carlo, TD Learning) can be used to solve this problem.

On Policy vs Off Policy learning:

Learning can be segregated into two types. On-policy learning is learning on the job, that is, the agent learns by exploring the environment and understanding the experiences by itself. Mathematically, it uses policy pi to explore the environment and also improves policy pi itself.

On the other hand, off-policy learning is learning using multiple policies. Example: A human may demonstrate how to perform a particular task, then the agent can understand the human’s policy and them use it to create its own policy. Mathematically, the agent learns policy pi based on some other policy mu.

Policy Iteration:

In Dynamic Programming techniques we had studied policy iteration algorithm which evaluates a policy and then improves it greedily.

Policy Iteration: For model free environments

For evaluating the policy, can we replace the iterative policy evaluation with monte-carlo policy evaluation?

The answer is we can’t do that as state-value functions require the probability transition matrix and in model-free environments we don’t know the probability transition matrix. It can be explained more concretely as follows:

As we can see that greedily improving over state-value functions is not possible as P_ss’ is required; which is not available in a model-free environment. Instead we can use action-value function to improve over the best action to take.

So, now our updated algorithm looks as follows:

The second question is, can we use greedy policy improvement for model-free techniques? The answer is no as we are running through a sample of states. That is, we do not explore every possibility in each episode, in terms of Monte Carlo we explore only one branch the tree.

So, we might end up missing the actual optimal policy if we keep improving the policy greedily.

Example:

Consider the example with two doors. The agent starts by randomly choosing an action (as it is the first action) and opens the left door. The reward is 0. Now, if the agent tries the right door and gets a reward of +1 then as per the greedy policy the agent will choose the right door. (becase left = 0, right = 1) hence, the greedy choice is right. Similarly, if we keep getting +3, +2 in the right door the agent keeps choosing it. However, this is not the correct policy as we have explored the left door only once. We are comparing against the 0 reward which we got once. It might be that it could produce a large reward like +100 in subsequent explorations. But our agent will fail to explore it in a greedy policy improvement.

Hence, complete exploitation doesn’t work. The agent perform exploration too.

Epsilon greedy exploration:

The idea in epsilon greedy exploration is to choose greedily with a high probability but to still explore with a low probability.

Example: if epsilon = 0.2, and there are 4 actions (m):

Then we will use greedy improvement with 0.2/4 + 1 – 0.2 = 0.85 or 85% probability

Note that, we are dividing the epsilon by m to take the effect of actions into account. If there are many actions, we would be exploring more often.

Proof that epsilon greedy leads to policy improvement:

Consider that pi’ is the new policy determined by the e-greedy algorithm. So, now the action value function q(s, pi’(s)) can be expressed into two parts (one with exploration e/m and other with exploitation component (1-e)).

So, the idea is that taking the max of q(s, a) will be better or at least equal to taking any other weighted sum of q(s, a). That is, the new policy is indeed a better policy than the original as it is based on taking max.

Hence, our improved policy iteration now looks as follows:

Improving the algorithm further:

There is no need to evaluate the action-value function fully to improve the policy. We can evaluate and improve the policy on every episode. That is, we can update the policy even without considering all scenarios of the value function.

Problems with exploration:

While we saw that greedily evaluating a policy won’t lead to optimal improvement, eventually we want to reduce the exploration to null as by definition optimal policy is a policy where there shouldn’t be randomness (unlike exploration). That is, we want to decrease the probability of exploring as we gain more and more knowledge.

Solution: GLIE

GLIE says that if we set e = 1/k and as long as all state-action pairs are explored infinitely many times, the policy will eventually converge to a greedy policy (that is, the probability of exploration will reduce to 0).

Basically, if e = 1/1, ½, 1/3…the probability of exploration keeps decreasing.

GLIE Monte-Carlo control can be shown as follows:

Using TD instead of Monte-Carlo:

Updating action value functions using SARSA in TD Learning:

In TD Learning, we are looking ahead by one time step instead of waiting for an entire episode to finish. Hence, we can update the action-value function and improve the policy every time step.

The update process can be called as SARSA. Where (S, A) are the original state and action pair. That is, given the state S, when the agent takes action A, it will get a reward R and it will end up in some state S’. Now the agent will take some action A’ from that state S’.

Algorithm:

Convergence of SARSA:

The Robbins-Monro first condition says that the step size should be sufficiently large such that the Q values can be updated/moved to the desired point.

The second condition states that eventually moving around won’t change the Q values; that is, eventually the Q values will reach their optimal value and stop changing.

Note: In practice, we use SARSA even if these conditions aren’t met

Example: Windy Grid world

Consider the gridworld game where S is the start point and G is the goal. The arrows represent wind and thus if the agent falls within that area, it will be blown “upwards”. The amount of steps it will be blown upwards is indicated by the number shown along the X axis. I.e 2 says that the agent will be blown two step upwards.

After applying SARSA (Using standard moves not King’s move)

From the graph we can see that initially, the agent required a lot of timesteps to complete episodes. This makes sense as initially the agent is unaware of the environment. But as time went by, the number of episodes completed increased drastically as the agent is now aware of the environment.

n-step SARSA:

Instead of updating the policy iteration every step, we can do it every n step.

Forward view SARSA(lamba)

Just like we had studied in the TD learning process, we can take the information from all steps into account by TD(lambda). The steps are represented using a geometric function for efficient update and calculation.

Problem with forward view: As all n steps need to be calculated before updating the Q values, we need to wait till the end of the episode to actually update the values.

To solve this problem, we use the backward view SARSA:

Backward SARSA algorithm:

Example:

Consider the above example. Let the first image be the path taken in an episode. If we use one-step SARSA then only the grid box right below the reward (Asterix point) will get updated. This is because, the rest of the path lead to no reward (with respect to one step). Then in the next episode, the grid point next to it will get updated and so on. Point being, the update w.r.t one-step SARSA is much slower than SARSA(lambda).

With SARSA Lambda we can see that the entire path was updated in the direction of reward. The grid points closer to the goal were updated more strongly (due to eligibility traces) and those before were updated little less strongly.

Off policy learning:

The main idea of off-policy learning is to observe some other policy/policies and use that information to update our target policy.

This could be useful in the following scenarios:

Learning from other humans. That is, the agent can observe the policy which a human uses and accordingly optimize its target policy.
The old policies which the agent might have used need not be completely discarded. The important bits from those old policies can be used to form the new policy.
Learning optimal policy while following exploratory policy. That is, we follow some exploratory policy, i.e explore a lot and use that information to update our target policy optimally. (used in Q learning)

Importance Sampling:

Given that we know the expectation over some distribution P(X) we can calculate the expectation over some other distribution Q(X) by multiplying and dividing it. Basically, the closer P(X) and Q(X) are, the closer the value of P(X)/Q(X) will be to 1. The more different they are, the more different the value.

Importance Sampling for off-policy Monte Carlo:

The idea is to use policy u and use the returns generated by u to evaluate policy pi. That is, based on the observations attained through policy u, we are correcting our target policy pi. But this is a bad way because this technique has extremely high variance. Hence, importance sampling shouldn’t be used with Monte Carlo in practice.

Importance Sampling for Off-policy TD learning:

In case of TD Learning, the importance sampling will only be done over 1 step instead of the entire episode (all time steps) hence, the variance is much lower.

Q-learning:

In Q-Learning, we have two policies: The behavior policy and the target policy.

From a state s, we choose some action A using the behavior policy. After choosing the action, we end up in state S’ (S_t+1). Now, from this state we choose a successor action A’ from our target policy.

Then we update the Q(S, A) towards the value of the alternate action A’.

Basically, Q Learning uses an exploratory policy to find the optimal policy.

That is, the behavior policy is epsilon greedy (exploratory) while the target policy is completely greedy.

Therefore, the behavior policy will help us explore scenarios while the target policy is updated strictly based on the maximum values.

Convergence:

Algorithm: