Intro
In this post we report success in using reinforcement to learn the game of nim. We had previously cited two theses (ERIK JÄRLEBERG (2011) and PAUL GRAHAM & WILLIAM LORD (2015)) that used Q-learning to learn the game of nim. However, in this setting, the scaling issues with Q-learning are much more severe than with value-learning. In this post we use a value-based approach with a table. Because the value-based approach is much more efficient than Q-learning no functional approximation is needed, up to reasonable heap sizes.
Scaling
maxHeapSize | numberOfHeaps | States | Actions | q_table |
---|---|---|---|---|
7 | 4 | 3.30E+02 | 2.40E+03 | 1.96E+06 |
9 | 5 | 2.00E+03 | 5.90E+04 | 4.82E+07 |
11 | 6 | 1.24E+04 | 1.77E+06 | 1.45E+09 |
13 | 6 | 2.71E+04 | 4.83E+06 | 3.94E+09 |
14 | 8 | 3.20E+05 | 1.48E+09 | 1.20E+12 |
Given k heaps with a maximum heap size n the number of states, which is the number of different starting positions, can be calculated accurately: This is equivalent to calculating the number of multisets of cardinality k, with elements taken from a finite set of cardinality n, This is called the multiset coefficient or multiset number and its notation is (we use the hash # symbol to signify “number of”, so # States means Number of States):
(i) #Different starting positions = #States = {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)}
It can be evaluataed as
(ii) #States = {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)} = { \!{n + k - 1 \choose k}\!}
To estimate the number of actions we use the same approach as Graham & Lord namely
(iii) #Actions = {\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! }
Therefore the number of entries in the Q-table is
(iv) Size (Q-table) = #States * #Actions =
{ \!{n + k - 1 \choose k}\!} * {\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! }
whereas the number of entries in the Value tables is
(ii)Size (Value-table) = #States = {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)} = { \!{n + k - 1 \choose k}\!}
It is apparent that the scalabilty is much better for the Value-table based approach, as the Q-value table is bigger by the actions factor of equation (iii).
Package, Python package, comes with tests
For the solution we create a package complete with tests which can be found on github:
https://github.com/hfwittmann/MEAV
Description of the package
Tricks
- Memory replay: The package uses standard memory replay to make optimal use of the history of played-out moves
- Moves are not stored in the history when the choice was suboptimal, as is sometimes for exploration. Cf. Exploitation Exploration trade-off.
- Exploitation Exploration trade-off is implemented using the standard epsilon-greedy mechanism.
Results
3 heaps Size 10
Load the packages…
In [1]:
import matplotlib.style as style
import matplotlib.pyplot as plt
style.use('ggplot')
from MEAV import Memory, Agent, Environment, ValueTable, PerfectTest, RunOnPolicy
rOP = RunOnPolicy(Memory, Environment, Agent, ValueTable, PerfectTest, maxHeapSize=10) accuracy_predictions, tableSize, memorySize,reward, \ stats_rewards, stats_differences, stats_accuracy = \ rOP.runEpisodes(desired_accuracy_predictions=0.99, \ maxHeapSize=7, numberOfHeaps=3)
plt.plot(stats_accuracy)
plt.xlabel('Number of episodes')
plt.ylabel('Out of sample accuracy')
plt.show()
plt.plot(stats_differences, color = 'Blue')
plt.xlabel('Number of episodes')
plt.ylabel('Sum of Differences to Optimal moves (Exploration)')
Next we’ll look at two different startting positions, namely
3 heaps Size 20 = (20, 20, 20)
If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.
You can use the same code as above, except one line that needs to be changed.
accuracy_predictions, tableSize, memorySize,reward, \ stats_rewards, stats_differences, stats_accuracy = \ rOP.runEpisodes(desired_accuracy_predictions=0.99, \ maxHeapSize= 20, numberOfHeaps=3)
4 heaps size 7 = (7, 7, 7, 7)
Next we restrict the heaps to those positions that can be reached from a (7,7,7,7) starting position. Among those is the classical so-called Marienbad Position, namely (1,3,5,7).
If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.
You can use the same code as above, except one line that needs to be changed.
accuracy_predictions, tableSize, memorySize,reward, \ stats_rewards, stats_differences, stats_accuracy = \ rOP.runEpisodes(desired_accuracy_predictions=0.99, \ maxHeapSize= 7, numberOfHeaps=4)
Success!
Summary
We have implemented a system that shows several improvements compared to earlier work.
The system is more scalable, by virtue of the fact that we’ve used a value based approach, coupled with a redundancy reducing trick.
Also, we have implemented a package version which makes extensive use of test cases.
The result is a system which can play the game of nim up to reasonable heap sizes without the need for function approximation.