“The first duty in life is to be as artificial as possible. What the second duty is no one has as yet discovered.” Oscar Wilde

# Game of Nim

This post starts a mini series of two posts in which we want to solve the Game of Nim using Reinforcement learning. The first part of this mini series is devoted to having a look at using our own Nim-specific custom environment for OpenAI.

The Game of Nim is a simple two player game. The rules of the game are indeed very simple, however, to play it well is difficult for the uninitiated.

It is a mathematical game and technically it is a zero sum complete information game, which sounds complicated but it is not. It means there are no dice in this game and a loss for one side is a win for the other.

In the popular variant, which we adopt here, the starting position is formed by three random sized heaps of objects (“beans”). One Player “A” starts by taking from any one heap any number of beans. Then the second Player “B” does likewise, i.e. s/he takes any number of beans from any one heap. The players take turns.

The game is typically played in the misère variant, that is the last person who plays a move loses (misère is french for destitution, that’s a bad situation, in case you did not know).

We’ve discussed how to extend the environment in previous posts. The main structural addition here are unit tests. These tests aim to add stability to the code and guidance to the user if he tries an action which is not allowed.

The action is supposed to be a list of two numbers, specifying the heap and the amount of beans taken, e.g. action = [0, 40] means from heap 0 take 40 (bear in mind that heap numbering starts at 0, so this is the first heap).

We use various assertions in the GameNimEnv to check that the action conforms to this pattern. In the unit test we specify non-conforming actions, in order to check whether these assertions actually work.

So first here are the assertions from GameNimEnv:

and here are the unit tests checking the proper working of the assertions by submitting non-conforming actions:

To simulate the game we use our own custom OpenAI environment : https://github.com/hfwittmann/gym_game_nim.

In [144]:

Now import custom environment …

In [145]:

We reset the environment, to intialize, then we do a random playout

In [148]:

Let’s now plot the game’s progress, and make a few observations:
1) The heap sizes should go down as a fucntion of the number of steps taken. In Plot ‘Random Playouts: Heap size’ we see that they do.

In [154]:

2) At each step, the maximum of the current heap sizes should be bigger (or equal) than the chosen’s heap size, which in turn should be bigger than the chosen number of beans that are taken away. In plot ‘Random Playouts Inequalities’ we see that they are.

In [155]:

In this post we had a look at extending OpenAI with a custom environment. We then played a random game and analysed with some plots.

In the second part, we’ll take a bite at learning the optimal strategy, using reinforcement learning.