Game of Nim, Reinforcement Learning

Intro

In this post we report success in using reinforcement to learn the game of nim. We had previously cited two theses (ERIK JÄRLEBERG (2011) and PAUL GRAHAM & WILLIAM LORD (2015)) that used Q-learning to learn the game of nim. However, in this setting, the scaling issues with Q-learning are much more severe than with value-learning. In this post we use a value-based approach with a table. Because the value-based approach is much more efficient than Q-learning no functional approximation is needed, up to reasonable heap sizes.

Scaling

maxHeapSize numberOfHeaps States Actions q_table

7 4 3.30E+02 2.40E+03 1.96E+06

9 5 2.00E+03 5.90E+04 4.82E+07

11 6 1.24E+04 1.77E+06 1.45E+09

13 6 2.71E+04 4.83E+06 3.94E+09

14 8 3.20E+05 1.48E+09 1.20E+12

Scaling in the game of nim, as a function of maximum heap sizes and the number of heaps.

maxHeapSize	numberOfHeaps	States	Actions	q_table
7	4	3.30E+02	2.40E+03	1.96E+06
9	5	2.00E+03	5.90E+04	4.82E+07
11	6	1.24E+04	1.77E+06	1.45E+09
13	6	2.71E+04	4.83E+06	3.94E+09
14	8	3.20E+05	1.48E+09	1.20E+12

Given k heaps with a maximum heap size n the number of states, which is the number of different starting positions, can be calculated accurately: This is equivalent to calculating the number of multisets of cardinality k, with elements taken from a finite set of cardinality n, This is called the multiset coefficient or multiset number and its notation is (we use the hash # symbol to signify “number of”, so # States means Number of States):

(i) #Different starting positions = #States = [latex] {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)} [/latex]

It can be evaluataed as

(ii) #States = [latex] {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)} [/latex] = [latex] { \!{n + k – 1 \choose k}\!} [/latex]

To estimate the number of actions we use the same approach as Graham & Lord namely

(iii) #Actions = [latex] {\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! } [/latex]

Therefore the number of entries in the Q-table is

(iv) Size (Q-table) = #States * #Actions =

[latex] { \!{n + k – 1 \choose k}\!} [/latex] * [latex] {\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! } [/latex]

whereas the number of entries in the Value tables is

(ii)Size (Value-table) = #States = [latex] {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)} [/latex] = [latex] { \!{n + k – 1 \choose k}\!} [/latex]

It is apparent that the scalabilty is much better for the Value-table based approach, as the Q-value table is bigger by the actions factor of equation (iii).

Package, Python package, comes with tests

For the solution we create a package complete with tests which can be found on github:

https://github.com/hfwittmann/MEAV

Description of the package

Tricks

Memory replay: The package uses standard memory replay to make optimal use of the history of played-out moves
Moves are not stored in the history when the choice was suboptimal, as is sometimes for exploration. Cf. Exploitation Exploration trade-off.
Exploitation Exploration trade-off is implemented using the standard epsilon-greedy mechanism.

Results

3 heaps Size 10

Load the packages…

In [1]:

import matplotlib.style as style
import matplotlib.pyplot as plt
style.use('ggplot')

from MEAV import Memory, Agent, Environment, ValueTable, PerfectTest, RunOnPolicy

Train the system …

In [2]:

rOP = RunOnPolicy(Memory, Environment, Agent, ValueTable, PerfectTest, maxHeapSize=10)

accuracy_predictions, tableSize, memorySize,reward, \
            stats_rewards, stats_differences, stats_accuracy = \
            rOP.runEpisodes(desired_accuracy_predictions=0.99, \
            maxHeapSize=7, numberOfHeaps=3)

Now let’s plot the result:

In [3]:

plt.plot(stats_accuracy)
plt.xlabel('Number of episodes')
plt.ylabel('Out of sample accuracy')
plt.show()

plt.plot(stats_differences, color = 'Blue')
plt.xlabel('Number of episodes')
plt.ylabel('Sum of Differences to Optimal moves (Exploration)')

Next we’ll look at two different startting positions, namely

3 heaps Size 20 = (20, 20, 20)

If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.

You can use the same code as above, except one line that needs to be changed.

accuracy_predictions, tableSize, memorySize,reward, \
            stats_rewards, stats_differences, stats_accuracy = \
            rOP.runEpisodes(desired_accuracy_predictions=0.99, \
            maxHeapSize= 20, numberOfHeaps=3)

4 heaps size 7 = (7, 7, 7, 7)

Next we restrict the heaps to those positions that can be reached from a (7,7,7,7) starting position. Among those is the classical so-called Marienbad Position, namely (1,3,5,7).

Classical Marienbad position with matches (instead of beans). From Wikipedia

If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.

You can use the same code as above, except one line that needs to be changed.

accuracy_predictions, tableSize, memorySize,reward, \
            stats_rewards, stats_differences, stats_accuracy = \
            rOP.runEpisodes(desired_accuracy_predictions=0.99, \
            maxHeapSize= 7, numberOfHeaps=4)

Success!

Summary

We have implemented a system that shows several improvements compared to earlier work.
The system is more scalable, by virtue of the fact that we’ve used a value based approach, coupled with a redundancy reducing trick.

Also, we have implemented a package version which makes extensive use of test cases.

The result is a system which can play the game of nim up to reasonable heap sizes without the need for function approximation.