Game of Nim, Reinforcement Learning

Intro

In this post we report success in using reinforcement to learn the game of nim. We had previously cited two theses (ERIK JÄRLEBERG (2011) and  PAUL GRAHAM & WILLIAM LORD (2015)) that used Q-learning to learn the game of nim. However, in this setting, the scaling issues with Q-learning are much more severe than with value-learning. In this post we use a value-based approach with a table. Because the value-based approach is much more efficient than Q-learning no functional approximation is needed, up to reasonable heap sizes.

Continue reading “Game of Nim, Reinforcement Learning”

Game of Nim, Supervised Learning

There are entire theses devoted to reinforcement learning of the game of nim, in particular those of ERIK JÄRLEBERG (2011) and  PAUL GRAHAM & WILLIAM LORD (2015).

Those two were successful in training a reinforcement-based agent to play the game of nim with a high percentage of accurate moves. However, they used lookup tables as their evaluation functions, which leads to scalability  problems.  Further, there is no particular advantage in using Q-learning as opposed to a Value-based approach. This is due to the fact that the “environment’s response” to a particular action (“take b beans from heap h”) is entirely known, and particularly simple. This is different, e.g. from games where the rules are unclear and not stated explicitly and must be learned by the agent, as is the case in the video. In the game of nim the rules are stated explicitly. Indeed, if the action “take b beans from heap h” is possible, i.e. there are at least b beans on heap h, then the update rule is:

heapSize(h) -> heapSize(h) – b

In other the size of heap h is reduced by he beans taken away from it. Therefore, as stated, there is no advantage in using Q-learning over a Value-based approach. The curse of dimensionality, however, is worse for the Q-learning setup as for a heap vector of (h0, h1, …, hn-1) there is one Value but, without paying special attention to duplicates,
~ h0* h1  * … * hn-1  actions, and therefore ~ h0* h1  * … * hn-1 Q-values. There will use a Value based approach.

In other words, we want to use a neural network approximation for the evaluation function. A priori, it is, however, by no means clear that this type of function approximation will work. Yet, the game of Nim is in a sense easy, as there is  a complete solution of the game. We can use this solution to our advantage by using it to estimate whether it is likely that the mentioned network approximation will work.

A simple way is to use supervised learning as a test.

The simplest case is to test a classification of a position as winning or losing.

Continue reading “Game of Nim, Supervised Learning”