Tic-Tac-Toe

Find the code on
Player One's Move
Show Move Values
Highlight Best Move

Q-Learning: How it Works

Q-Learning in simplest terms is that it makes a look up table to find the value of any given (state, action) pair. In Tic-Tac-Toe, the input is the board and a spot that can move. In the above demo, when you enable seeing the values, this is simply looking up the value of that move given the game's current state. I normalized the values to be between 0 and 1 to make it easier to understand.

The natural follow up question after knowing that Q-Learning makes a look up table of values is: how does it do that? The answer lies with the Bellman equation (link to Wiki). Let's review what this equation is. I'll do my best to break it down as the first time I learned about this I had to spend a decent bit of time reading multiple articles.

Bellman Equation:

Q(s,a) = (1-alpha) * Q(s,a) + alpha * (R(s,a) + lambda * max(Q(s', a')))

Let's break it down into each individual component.

Q(s,a) -> This is the look up table. Aka what is the value of the (state, action) pair.

(1-alpha) * Q(s,a) -> This looks confusing but, it's just saying how much of our previous knowledge we should use to update the Q-Table with.

R(s,a) -> This is the reward for making action "a" at some state "s". In Tic Tac Toe, all rewards are 0, unless it leads to a win / loss state. In which case, it returns 1 / -1.

max(Q(s', a')) -> We want to know the maximum future reward after performing action "a" at state "s". "s'" and "a'" refer to the next state and next set of avaiable moves.

lambda * max(Q(s', a')) -> We want to add in a discounted future reward. This helps back prop future rewards (such as winning).

When this equation is performed many times, eventually it will reach a steady state. This means that the q-table no longer is changing and the reward for each state has converged to its true value.

A Simple Game and Example

Let's examine a simple game where the only purpose of the game is to move to the Right. The game has a starting state of 1000. Each move to the right move the 1 to the right. So, 1 move right at from the beginning state is 0100. We can easily find all the states of this game. They are: 1000, 0100, 0010, and 0001. The reward funciton is simple. If the state is 0001, then the reward is 1, otherwise it is 0.

The Q-table initially looks like the below.

Move / State1000010000100001
Right0000

Let's assume we want to run an update for the state 0010 and the action move right. Let's find the values we need. We will use alpha = 0.05 and lambda = .9

Q(state, action) = Q(0010, Right) = 0

R(state, action) = Reward(0010, Right) = Value of state 0001 = 1

Max(Q(s', a')) = Max value of all moves at the state after 0010 and action right. This means we need to find the max value of state 0001. Max(Q(0001, a)) = 0.

We have found all the values now, we need to put it all together.
New Q(s,a) = (1-alpha) * Q(0010, Right) + alpha * [R(0010, Right) + lambda * Max(Q(0001, a))]
New Q(s,a) = (1 - .05) * 0 + .05 * [1 + .9 * 0]
New Q(s,a) = 0.05

Move / State1000010000100001
Right00.050

You can now do the same thing for the other move states. Here is an updated table after finding the value for 0100.

Move / State1000010000100001
Right0.045.050

As you keep doing this, the values will eventually converge to some values. These values should be the optimal playing values. If you use them while playing the game, your AI should perform pretty well.

My Implementation

You can find my implementation on github here:

I implemented the bellman equation as written above. However, since this is a two player game, I did have to some thinking. Normally, the reward function works after making a move but, since this is a two player game, the opponent must make a move unless the game has ended. To do this, I trained 2 "Q-Learning agents" at the same time. When training player 1, I had player 2 make a move right after player 1 played to get the reward function. To train player 2, I had player 1 play right after player 2 made a move. I trained for 500k games, this took about 2-3 minutes on my mac. I highly suggest looking at my code if you want the nitty gritty details of how it worked.