This post is about implementing a – quite basic – Neural Network that is able to play the game Tic-Tac-Toe. For sure there is not really a need for any Neural Network or Machine Learning model to implement a good – well, basically perfect – computer player for this game. This could be easily achieved by using a brute-force approach. But as this is the author’s first excursion into the world of Machine Learning, opting for something simple seems to be a good idea.
The motivation to start working on this post and the related project can be comprised in one word: AlphaGo. The game of Go is definitely the queen of competitive games. Before the age of AlphaGo it was assumed that it would take a really long time until any computer program could beat the best human players, if ever. But unlike the predominant chess programs, AlphaGo is based on some super-advanced – the word really fits here – Neural Network implementation. With this it simply swept away any human top player in the world. Depending on the viewpoint this is amazing, sad, scary or a bit of all.
If this is about the game of Go then why is there a video embedded about playing chess? The engine behind AlphaGo has been developed further. Its latest incarnation is called AlphaZero and it is so generic that it can teach itself different games based only on the rules. There is no human input required anymore, but learning is completely performed using self-play. This is really fascinating, isn’t it? AlphaZero had already easily defeated all its predecessors in the game of Go when it was trained to conquer the world of chess. After only four hours (!) of self-training it crushed the best chess engine around, which in turn would beat any human chess player.
So far for the motivation to start this project, which obviously cannot – and is not intended to – even scratch on the surface of what has been achieved with AlphaZero. Though the project name is clearly inspired by it ;-).
Then what should be achieved? Learning about and implementing a Neural Network with some kind of self-learning approach to start with. As the rules of Tic-Tac-Toe are very simple – just play on an empty field – not too much time must be spent implementing the game mechanic as such. This allows focusing on the Neural Network and the learning approach.
Ideally the program should play the game perfectly in the end. This would mean it will never loose to any human player and win if that player does not play the best moves. Tic-Tac-Toe cannot be won by any player if both players are playing decent moves.
The basic – and a little bit less ambitious – objective is that it wins a fair amount of games when playing against a random computer player after some amount of self-learning.
Playing a random computer player
Playing a random computer player is the first assessment of what has been achieved. Then we are going to take a closer look at the implementation, the ideas that did not work and the ideas that worked out in the end.
The complete implementation of Gamma-Tic-Tac-Toe can be found here: https://github.com/ThomasJaspers/gamma-tic-tac-toe. That page also includes instructions on how to compile and run it.
Self-play against the random computer player is implemented in a way that allows independent matches with any amount of games. The Neural Network is re-initialized and trained again between two matches. By default each match consists of 10.000 games and 50 matches are performed. All these values are configurable. The amount of training games are of course also configurable as this is an interesting parameter to test the ability of the Neural Network to learn.
The match between two random computer players is used to crosscheck the implementation. It is expected that the results are almost totally even as can be also seen in the following chart.
It is easy to make mistakes when validating the results using self-play. In the beginning the Neural Network was always playing the first move. Of course in a game like Tic-Tac-Toe this has let to wrong results. Having two random computer players playing the game this could be detected as it was quite suspicious that one random player was winning far more often than the other one.
The next match is the random computer player vs. an untrained gamma-engine (the fancy name used instead of writing “the Neural Net playing Tic-Tac-Toe”). This is interesting as the matches are going back and forth, but without a clear overall winner or loser. The individual matches are often won quite clearly in comparison to the games played between two random computer players.
Now we are having a gamma-engine that is trained in 50 games against the random computer player before each match. It can be seen that the amount of matches won is clearly increasing in comparison to the untrained version. But there are still quite some matches lost, sometimes even pretty clearly.
With 250 training games things are improving a lot. All but one match is won and often quite clearly.
Interestingly the results are pretty much the same as with 250 training runs. This time even two matches are lost. Still it is obvious that the training has a positive effect on playing.
So let’s perform 1500 training games before each match. The result is again not changing dramatically, but there is still some improvement.
Finally let’s make a huge jump to 15000 training runs before each match. With this amount of training the Neural Network is winning very consistently and on a high level. This result has been double-checked by executing it several times. The same is true for the other results as well.
The journey to gamma-engine stage-1
The results presented in the previous chapter are based on stage-1 of the gamma-engine. The different stages are intended to differ regarding the amount of learning that is applied. This is not related to the number of training runs, but the factors used to learn something about the game. In stage-1 the “learning algorithm” is based on the following factors: If a game is won the first and the last move of that game are “rewarded”.
This “rewarding the right decisions” is a kind of backpropagation that is often used to train Neural Networks. Even though what has been done here seems to be a bit simpler than what is described in that article.
Therefore the output weights of the corresponding neurons triggering those moves are increased. This does not seem to be a lot of learning at all, but it is enough for the results shown above. Of course this is only possible due to the fact that Tic-Tac-Toe is such a trivial game to play.
There are a lot of articles dealing with Neural Networks and Machine Learning. The corresponding Wikipedia page for example is quite extensive. Therefore this article is focusing on the practical approach towards the specific problem at hand and not so much on the theoretical side of Neural Networks. Still we need some theoretical background to start with.
A Neural Network is composed of different layers. It has an input layer, any amount of hidden layers and an output layer. Theoretically each layer can have any amount of neurons. But the amount of input and output nodes are precluded by the data and the task at hand. Thus practically only hidden layers can have any number of nodes. In this implementation dense layers are used where each neuron of one layer is connected to each neuron of the next layer. There are lots of other layer types where this is not the case. The input to a neuron is an input value representing (a part of) the task to be solved and a weight assigned to that connection. By modifying those weights the Neural Network can learn. The following diagram shows an example of such a Neural Network.
The input layer and the output layer are defined by the values to be processed and the result to be produced. It is pretty clear that there will be one output neuron as we need to generate one move in the end. That move will be the output of that single output neuron.
For the input neurons things are not that straightforward. It is clear that the game state must be passed to the input neurons. On first sight the different possible board representations after making each valid move have been considered as the input. But it is hard to do something meaningful with this in the hidden layer. Furthermore the input neurons would have a different semantic every time. This makes the learning difficult. The next idea has been to map one field from the board to one input neuron. That worked to some extend. The final solution has three input neurons for each field on the board. Those are representing the possible game states: empty, occupied by computer player and occupied by opponent. With this approach it is important that the same field – with its corresponding state – is assigned to the same input neuron every time. This is depicted in the following diagram.
In addition some input value is required. This is defined based on the different fields and whether or not that field is empty, occupied by the computer player or occupied by the opponent.
Neurons in the hidden layer are calculating a “positional score” for the candidate moves. This is done based on the field values and the input weights. Hereby each neuron in the hidden layer always exactly represents a move to a certain field on the board.
In the beginning every neuron in the hidden layer was calculating a candidate move out of all possible moves. But this approach felt too much like an algorithmic solution through the backdoor.
That’s why there are nine neurons in the hidden layer, as there is at any time a maximum of nine possible moves.
Thus the first neuron in the hidden layer stands for a move on the first field, neuron two for a move on the second field and so on. This implies that some neurons cannot “fire” a valid move as the corresponding field is already occupied. This is the equivalent to a threshold that decides whether or not a neuron is activated (fires) or not. If no neuron in the hidden layer can be activated the game is anyway finished as there are no more valid moves anymore.
Activation functions are a vital part of any Neural Network implementation. They are using input values and input weights to calculate output values. Those are the input to the neurons of the next layer or the result computed by the Neural Network. Common to all layers is the randomized generation of output weights when the Neural Network is (re-)initialized. All neurons of one layer are sharing the same implementation of the activation function.
The activation function of this layer is rather simple. It stores the field information that it has retrieved as an input. Based on this it calculates a value depending on the field state and location on the board. Basically this includes a kind of threshold function. Only one of the three neurons reflecting one field is used as an input in the hidden layer.
For each neuron in the hidden layer a so-called position value is calculated based on the input weights and values. This is done applying the formula below where the sum over all input neurons is created. By doing so the complete board state is considered.
For every set of input neurons that are reflecting the same field the input weight and value of the corresponding neuron is used. This depends on whether the field is empty, owned by the computer or owned by the opponent. Thus from the 27 input neurons always only the nine relevant neurons from the input layer are used for this calculation.
Z = ∑ FIELD_VALUE * INPUT_WEIGHT
Then the sigmoid function is applied to Z. The sigmoid function is quite commonly used in activation functions of Neural Networks.
S(Z) = 1 / 1 + e-Z
The resulting value is the positional score for this neuron and thus this candidate move.
In the output layer again a value Z is calculated. But this time not as a sum, but for each of the candidate moves.
Z = POSITION_VALUE * INPUT_WEIGHT
Then again the sigmoid function is applied to Z. The candidate move where S(Z) is the maximum is chosen as the move to execute.
Summary and Outlook
This has been one of the most fun projects for quite some time. Not having any idea where to start was a really interesting experience. Playing around with different parameters like number of neurons and weight changes applied and then seeing how this affects the outcome of playing the game was really fascinating.
Luckily there is still plenty of room for improvements. First of all a more thorough training algorithm can be applied like rewarding all moves that lead to a win and not only the first and the last one. Another idea is to decrease the output weight of neurons if a move has let to a loss.
Then the structure of the Neural Network can be evolved by introducing additional hidden layers and thus increasing the number of neurons and connections between those.
Pretty sure there will be a follow-up to this blog post as one of the main objectives is not yet achieved: A Neural Network that learns to play Tic-Tac-Toe flawlessly :).