# Game of Nim, Reinforcement Learning

## Intro

In this post we report success in using reinforcement to learn the game of nim. We had previously cited two theses (ERIK JÄRLEBERG (2011) and  PAUL GRAHAM & WILLIAM LORD (2015)) that used Q-learning to learn the game of nim. However, in this setting, the scaling issues with Q-learning are much more severe than with value-learning. In this post we use a value-based approach with a table. Because the value-based approach is much more efficient than Q-learning no functional approximation is needed, up to reasonable heap sizes.

## Scaling maxHeapSizenumberOfHeapsStatesActionsq_table 743.30E+022.40E+031.96E+06 952.00E+035.90E+044.82E+07 1161.24E+041.77E+061.45E+09 1362.71E+044.83E+063.94E+09 1483.20E+051.48E+091.20E+12 Scaling in the game of nim, as a function of maximum heap sizes and the number of heaps.

Given k heaps with a maximum heap size n the number of states, which is the number of different starting positions,  can be calculated accurately: This is equivalent to calculating the number of multisets of cardinality k, with elements taken from a finite set of cardinality n, This is called the multiset coefficient or multiset number and its notation is (we use the hash #  symbol to signify “number of”, so # States means Number of States):

(i) #Different starting positions = #States = ${\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)}$

It can be evaluataed as

(ii) #States = ${\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)}$ =    ${ \!{n + k - 1 \choose k}\!}$

To estimate the number of actions we use the same approach as Graham & Lord namely

(iii) #Actions =   ${\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! }$

Therefore the number of entries in the Q-table is

(iv) Size (Q-table)  =  #States * #Actions =

${ \!{n + k - 1 \choose k}\!}$ *   ${\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! }$

whereas the number of entries in the Value tables is

(ii)Size (Value-table) = #States = ${\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)}$ =    ${ \!{n + k - 1 \choose k}\!}$

It is apparent that the scalabilty is much better for the Value-table based approach, as the Q-value table is bigger by the actions factor of equation (iii).

## Package, Python package, comes with tests

For the solution we create a package complete with tests which can be found on github:

https://github.com/hfwittmann/MEAV

Description of the package

Tricks

• Memory replay: The package uses standard memory replay to make optimal use of the history of played-out moves
• Moves are not stored in the history when the choice was suboptimal, as is sometimes for exploration. Cf. Exploitation Exploration trade-off.
• Exploitation Exploration trade-off is implemented using the standard epsilon-greedy mechanism.

## 3 heaps Size 10

In [1]:

Train the system …
In [2]:
Now let’s plot the result:
In [3]:

Next we’ll look at two different startting positions, namely

## 3 heaps Size 20 = (20, 20, 20)

If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.

You can use the same code as above, except one line that needs to be changed.

## 4 heaps size 7 = (7, 7, 7,  7)

Next we restrict the heaps to those positions that can be reached from  a (7,7,7,7) starting position. Among those is the classical so-called Marienbad Position, namely (1,3,5,7).

If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.

You can use the same code as above, except one line that needs to be changed.

Success!

## Summary

We have implemented a system that shows several improvements compared to earlier work.
The system is more scalable, by virtue of the fact that we’ve used a value based approach, coupled with a redundancy reducing trick.

Also, we have implemented a package version which makes extensive use of test cases.

The result is a system which can play the game of nim up to reasonable heap sizes without the need for function approximation.