Monte Carlo Tree Search: Improving the rollouts using the Rapid Action Value Estimation (RAVE) algorithm

Let’s Implement Rapid Action Value Estimation (RAVE) and All Moves As First (AMAF) algorithms to enhance the quality of simulations in the MCTS algorithm.

Masoud Masoumi Moghadam
6 min readSep 3, 2023

These are what we previously spoke about:

  • Reinforcement learning and its components (Article 1).
  • Monte Carlo Tree Search (MCTS) and how it works (Article 2).
  • Implementing the UCT algorithm on a board game using Python (Article 3).

For the ones in a hurry, this link provides you with the project code:

Let’s review some of the stuff we learned about the MCTS. We learned that board games are the type of environment that is modeled by the Markov Decision Process (MDP) and because of the big action-state space, it is impractical to search for every possible state.

To deal with that problem, an algorithm named Monte Carlo Tree Search is introduced that accumulates rewards on nodes (that represent game states) and then finds the promising trajectories in the search tree. It consists of 4 phases selection, expansion, simulation (also called rollout or playout), and backpropagation (also called backup). we keep iterating through the (previously visited) nodes and pick the one with the maximum UCT score (Selection phase), If we reach a leaf node (the ones that are not pointing to the end of the game), we generate its children nodes (Expansion phase). Then in the simulation phase, we keep choosing actions based on some predefined policy (like random selection), and at the end of the simulation (terminal state), we reward the chosen nodes in the selection phase with 1 if the winning player took those actions, and penalize the the losing player with 0 (or -1) for (backpropagation phase). So each time the agent visits a node, two values get updated, the number of visits, and the accumulation of rewards.

The number of simulations and the number of wins are two statistics that we keep in our tree nodes and based on the UCT formula we calculate each one’s score. So let’s illustrate that on an image:

Illustration of UCT updates: As you can see it takes some time for the tree to reach the promising nodes where it is more likely to win. image by author

As you can see in the image above, UCT only updates the nodes chosen in the selection phase. This process demands so much time for the tree to get warmed up. The term “warming up the tree” here means reaching a state where the game states have been sampled enough so that the decision it makes in each turn is quite confident. But what should we do before the tree gets warmed up? Is there a way to make up for this lack of enough simulations (especially at the beginning of the game where there are no statistics on nodes). To overcome this issue, we are going to apply RAVE and AMAF in our work.

All Moves as First (AMAF):

So we know that due to the low number of simulations (UCT visits and updates) on a node, the value of the node won’t be statistically reliable. In order to overcome the problem, Helmbold [1] suggested not only updating the actions chosen in the selection phase (UCT) but also doing it for the actions chosen in the simulation phase. This is how it looks in contrast to UCT updates:

Illustration of AMAF updates: image by author

AMAF is more time-consuming than UCT since it updates more nodes in different layers of the tree.

The AMAF scores can help the MCTS agent choose better moves when nodes’ visits are low. Then after the nodes’ UCT visits are sufficient, we can rely solely on UCT evaluation. This is the idea behind the AMAF and RAVE algorithms.

In UCT, we only update the moves in the selection phase, but AMAF updates the moves played in both selection and playouts. This means that the reward estimate for an action A from a state S is updated whenever A is encountered during a playout, even if A was not the actual move chosen from S [1].

Now the question is how we can blend AMAF and UCT algorithms.
It’s simple. We use the RAVE scoring formula[2]:

image by author

In this formula, V is the value of each node, A is the AMAF score which is calculated by dividing the number of AMAF wins by the total AMAF simulations, and U is the UCT score. The α value is calculated by this formula [2]:

image by author

Each node starts with 0 UCT and AMAF visits. Then gradually when UCT visits increase (let’s imagine C-rave is 20 and the number of UCT visits is 5), at initial steps after a few simulations, the value of α is around 1 because the UCT visits is not much. Then according to the RAVE scoring function, the AMAF score portion is bigger than the UCT score. As the UCT visits increase to 20, the value of α is around 0. so the UCT score portion gets bigger than the AMAF score. This is how UCT evaluation gets dominant in the RAVE formula as the UCT visits increase.

The parameter C-rave is the value that states how many UCT visits a node must have before RAVE values are not used at all [2]. C-rave is a constant that is typically between 200 to 700 and can be tuned based on experiments.

Implementation of the RAVE algorithm

To implement this Algorithm, I added another file named rave_mctsagent.py , and added two classes namely RaveNode and RaveMctsAgent which replaces two previous classes Node and UctMctssAgent.
This is the RaveNode:

As you can see, the value method is redesigned by combining AMAF and UCT.

Now Let’s write RaveMctsAgent :

In lines 4, 16, 36, and 103 we replaced Node in UctMctsAgent with RaveNode . In the roll-out method, (on lines 121 — 131), after the roll-out is finished, we store the white, and black cells, and then on backup method (on lines 145—153), we reward the rave cells with AMAF updates. Accordingly, we update the search method with roll_out and backup methods.

Now rewrite lines 23–31 in gui.py by adding the RaveMctsAgent to the AGENT dictionary:

That’s it. Here we have implemented the RAVE algorithm in our Monte Carlo Tree Search which can come up with better moves during the game. Although the RAVE rollouts and backups are consuming more time, it is a lot faster when it comes to warming up the tree.

In the next articles, we will implement different simulation strategies that would further improve performance.

References:

[1] D. P. Helmbold and A. Parker-Wood, “All-Moves-As-First Heuristics in Monte-Carlo Go,” in Proc. Int. Conf. Artif. Intell., Las Vegas, Nevada, 2009, pp. 605–610.

[2] C. B. Browne et al., “A Survey of Monte Carlo Tree Search Methods,” in IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, March 2012, doi: 10.1109/TCIAIG.2012.2186810.

--

--