After having to play my own game (Moody Man) for a while, I start to hit the score limit of 40-ish and couldn’t break my own record. On the other hand, I heard that one of my colleague’s daughter has a high score of 50, amazing! In order to achieve a higher score than her, I start to think of building an AI to learn to play the game. This post will be talking about the process of building the AI and see if it can achieve that 😀
Source code of the AI can be found here: https://github.com/bklim5/Deep-Moody-Man
Video of the AI playing Moody Man:
Teaching an AI how to play a game can be seen as a classification problem. For example, at certain stage of a game, we want to classify which action to take that will maximize the potential gain the player can get (eg: in Moody Man it means getting a score / not clashing with any object, or in DotA it could be getting gold from last hitting creeps). So one approach towards that goal is to prepare tons of training examples for our model. In our case, the features will be the pixels of the game screen and the label will be the action to take. However, this approach is not so feasible especially for complex game like DotA when there could be millions or billions of possibility in every second of gameplay. (maybe Moody Man can, lol, but it’s not the purpose of this post)
This is what reinforcement learning tries to solve.
Reinforcement learning is the process of learning by interacting with an environment through feedback.
One example of reinforcement learning is Q-learning, which I am going to talk about today.
Q-learning minimizes the behavior of a system over time through trial and error.
The idea is very intuitive. First of all, we have a representation of the environment states, we can call it S. At those states, there are a bunch of possible actions that the agent or the bot can carry out – A. Each of these actions can be taken to result in a different state which we can then evaluate the state to determine if the action taken is good or bad. Of course, initially we wouldn’t know whether the action taken is good or bad, so we allow the agent to randomly perform actions and we evaluate the state that the action leads to. For example, if you did an action and your game agent die, that’s a bad move and we can put a negative value on the state-action mapping. Vice-versa, we can assign a positive reward value if the action is considered a good move. We can call the value we assign to the state-action mapping the Q-value (which stands for Quality I think). One thing to note is that we’re updating the value for previous state-action mapping, i.e. we only update Q after we evaluate the results from the previous state -> action. Eventually, we will have a set of policy, i.e. which action to take in each state which is basically the highest value of Q value.
That leads to 2 questions:
- Once a positive Q-value is assigned to a state-action mapping, we tend to choose that action everytime we reach to that state. A very good example is myself: all my colleagues know that I will always go to Amoy Street 滨海菜饭 for lunch. It has so high Q-value that every day during lunchtime (the environment’s state), I will go to the stall to dabao (the action) since it brings me the highest satisfaction with lowest cost (the Q-value). However how do I ensure that there are no better actions that can bring me higher Q-value? That’s the exploration vs exploitation dilemma problem
- Sometimes the action taken at this state doesn’t reflect the reward immediately, but is a necessary step for the huge reward in the future state. Use back the lunchtime example above, I have to first take an action to go to Amoy before I can buy the 菜饭. The action of going to Amoy doesn’t have the huge reward like eating the food, but it’s a prerequisite step to reach to that reward. So we have to take into consideration of the future reward in the current state-action value. On the other hand, since the environment is stochastic (eg: 菜饭 stall close down when I arrive?!), we cannot be 100% sure that we will get the same reward if we perform the same actions. The more into the future the more uncertainty there is. Therefore we will use the future reward, but in a discounted manner. That’s the discounted future reward formula
Exploration vs. Exploitation
As mentioned previously, once we have a Q-value assigned to certain state-action, we tend to exploit the same action since we know that the action will give us a better reward compare to other states (at the time being). This can be overcome by using a technique called ε-greedy exploration. Meaning at each state we generate a random number, and if the random number is smaller (or greater) than ε, we make a random action (to explore) instead of going the usual highest Q-value approach (to exploit).
We can start with a higher epsilon value and slowly decrease/increase the number depending on how you do the check, but exploration should decrease over time as when time past, the agent should know the best / converged policy and should just use that policy to perform an action.
Well…. maybe my epsilon has already become 0 that’s why I am not exploring options for my lunch anymore 😀
Discounted Future Reward
Not going to spend too much time here, as some other post on the internet has great explanation of the formula. eg: here.
In summary, the discounted future reward formula is as follow:
Where r is the immediate reward, gamma is the discounted factor (between 0 – 1, 0 meaning consider only immediate reward) and Rt+1 is the reward value in the next time step.
By now I am slightly tired and brain dead lol, I will probably write another post for the following:
- Feature engineering and preprocessing (if you wonder why there is no score and background displayed, this is why!)
- Q-function + convolution neural network model
- Perhaps also slightly on the pygame wrapper for Moody Man
Here’s the pseudocode for using neural network to approximate the weight of the Q-function.
Surprisingly, just after about 30000 timesteps in the game, the agent starts to grasp the idea on how to avoid the obstacle, and around 40000 it knows how to beat the game (similar to the strategy I adopt! which is to move under the gap and start tapping madly so that the Moody Man stay around the x-coordinates XD). Using CPU only it took about few hours for the AI to master the game while it took me few days. And definitely, the AI will become better since human will make more mistake in a longer run.
Since we built the AI based on game screen pixel as input, it make sense that this bot is generic enough to learn other games as long as we can represent the game environment some way (using pixels) and quantify the action taken with rewards. In fact, this AI is built largely based on code from this, which is a Flappy Bird AI using Deep Q-learning as well. By just providing different pixel value as input feature and different rewarding mechanism, the AI learns how to play Moody Man.
Reinforcement learning is highly related to our daily life. Since the day we are born to this world, we start to learn how to interact with the surrounding, avoid taking actions that will penalize us and continue to behave in a certain way if the outcome is positive. It’s interesting that such concepts can be built into machine and allow them to “learn” like how the human did. Really looking forward how can this be applied even further!
Also a video from Siraj, highly recommended for all his other machine learning videos as well!