This post starts a mini series of two posts in which we want to solve the Game of Nim using Reinforcement learning. The first part of this mini series is devoted to having a look at using our own Nim-specific custom environment for OpenAI.

The Game of Nim is a simple two player game. The rules of the game are indeed very simple, however, to play it well is difficult for the uninitiated.

It is a mathematical game and technically it is a zero sum complete information game, which sounds complicated but it is not. It means there are no dice in this game and a loss for one side is a win for the other.

More information can be found here: https://en.wikipedia.org/wiki/Nim

In the popular variant, which we adopt here, the starting position is formed by three random sized heaps of objects (“beans”). One Player “A” starts by taking from any one heap any number of beans. Then the second Player “B” does likewise, i.e. s/he takes any number of beans from any one heap. The players take turns.

The game is typically played in the misère variant, that is the last person who plays a move loses (misère is french for destitution, that’s a bad situation, in case you did not know).

We’ve discussed how to extend the environment in previous posts. The main structural addition here are unit tests. These tests aim to add stability to the code and guidance to the user if he tries an action which is not allowed.

The action is supposed to be a list of two numbers, specifying the heap and the amount of beans taken, e.g. action = [0, 40] means from heap 0 take 40 (bear in mind that heap numbering starts at 0, so this is the first heap).

We use various assertions in the GameNimEnv to check that the action conforms to this pattern. In the unit test we specify non-conforming actions, in order to check whether these assertions actually work.

So first here are the assertions from GameNimEnv:

        assert isinstance(action, list), 'Wrong type.' + \
            "Type is: {}. Should be <class 'list'>".format(type(action))
        
        assert len(action) == 2, 'Wrong length.' + \
            'Length is:{}. Should by 2'.format(len(action))
            
        
        heapnumber = action[0]
        beansnumber = action[1]
        
        heapsize = self.state[heapnumber]
        
        assert heapsize>=beansnumber, 'Heap is not big enough' + \
            "Heap number is: {}. You tried to take {}, but there are only {} beans in this heap".format(heapnumber, beansnumber, heapsize)

and here are the unit tests checking the proper working of the assertions by submitting non-conforming actions:

        G.reset()    
        # use first heap
        firstHeap = G.state[0]
    
        # produce failures to check asserts are working
        # 1 failure
        with self.assertRaises(AssertionError) as context:
            # use wrong type for action (action should be a list of 2 values)
            G.step(0)            
        self.assertTrue('Wrong type' in str(context.exception), str(context.exception))
        
        # 2 failure
        with self.assertRaises(AssertionError) as context:
           # use wrong list length for action (action should be a list of 2 values)
            G.step([0])            
        self.assertTrue('Wrong length' in str(context.exception), str(context.exception))
    
        # 3 failure
        with self.assertRaises(AssertionError) as context:
            # try to take more than is in the first heap to cause error            
            G.step([0, firstHeap + 1])

To simulate the game we use our own custom OpenAI environment : https://github.com/hfwittmann/gym_game_nim.

In [144]:

# import packages
import sys
import numpy as np
import matplotlib.style as style
import matplotlib.pyplot as plt
style.use('ggplot')

# for debugging
from IPython.core.debugger import set_trace

Now import custom environment …

In [145]:

import gym
import gym_game_nim
env = gym.make('game_nim-v0')

We reset the environment, to intialize, then we do a random playout

In [148]:

env.reset()
state = env.state.copy()
done = False
reward = 0

# game development
REWARDS = []
BEANSNUMBERS = []
STATES = []
HEAPSIZES = []
HEAPNUMBERS = []

while not done:
    
    # first choose heap with non-zero size
    nonZeroHeaps = np.arange(3) [env.state != 0]

    #  zero index heapnumber
    heapnumber = np.random.choice(nonZeroHeaps)

    # size of the chosen heap
    heapsize = env.state[heapnumber]

    # take a random number of beans, of course fewer than the heap size
    beansnumber = np.random.choice(heapsize) + 1 # np.random.choice(heapsize) starts at zero, therefore + 1

    next_state, reward, done, info = env.step([heapnumber, beansnumber])
        
    # set_trace()
    STATES.append(state)
    REWARDS.append(reward)
    BEANSNUMBERS.append(beansnumber)
    HEAPSIZES.append(heapsize)
    HEAPNUMBERS.append(heapnumber)
    
    state = next_state.copy()

starting in state: [35 41 99]

Let’s now plot the game’s progress, and make a few observations:
1) The heap sizes should go down as a fucntion of the number of steps taken. In Plot ‘Random Playouts: Heap size’ we see that they do.

In [154]:

STATES = np.array(STATES)

for heapnumber in range(3):
    plt.plot(STATES[:,heapnumber], label = 'Heap' + str(heapnumber) )


plt.title('Random Playouts: Heap size')
plt.legend(loc='best')
plt.xlabel('Steps taken')
plt.ylabel('Heap size')
plt.show()

2) At each step, the maximum of the current heap sizes should be bigger (or equal) than the chosen’s heap size, which in turn should be bigger than the chosen number of beans that are taken away. In plot ‘Random Playouts Inequalities’ we see that they are.

In [155]:

plt.plot(STATES.max(axis=1), label = 'Max Heapsize')
plt.plot(HEAPSIZES, label = "Chosen Heaps' sizes")
plt.plot(BEANSNUMBERS, label = "Chosen beansnumbers")
plt.title("Random Playouts Inequalities: \n Max Heap size is bigger o.e. than \nChosen Heaps' sizes is bigger o.e. than \nChosen beansnumbers\n (o.e. or equal )")
plt.legend(loc='best')
plt.show()

In this post we had a look at extending OpenAI with a custom environment. We then played a random game and analysed with some plots.

In the second part, we’ll take a bite at learning the optimal strategy, using reinforcement learning.