Monte Carlo Tree Search: Improving the rollouts using the Decisive Move, Last Good Reply (LGR), and PoolRAVE algorithms.

We can improve the quality of simulation by using a good policy for move selection. In this episode, we are going to explain and implement such algorithms as Decisive Move, Last Good Reply (LGR), and PoolRave algorithms.

Masoud Masoumi Moghadam
4 min readSep 3, 2023

Hello again. We’re back with another article explaining MCTS and how we can make it stronger in the gameplay. So far we implemented UCT and RAVE algorithms in previous parts. This article code is based on RAVE implementation, so I recommend you make sure that you understand the code implemented in previous parts:

In this article, we are reviewing and implementing 3 helpful algorithms that are easy to combine with our RaveMctsAgent module.

Decisive Move

In every game, there are situations where there are actions that can immediately result in a win or are critical for winning. Fabien Teytaud et al called them decisive moves [1].

Teytaud suggested this algorithm in his paper:

image from [1]

In other words, when choosing action during playout, we always check if available moves are going to end the game in loss/win. if found, then it’s a critical move. if not found, we choose moves based on default policy (like random selection).

To code this, we need a mechanism to detect the critical moves. So, we open the file gamestate.py and add this function would_lose which can check if the selected cell is going to result in a win by color or not:

Now we have implemented the DecisiveMoveAgent :

As you can see, during playout, we choose moves one by one, and each time, we check all the available moves (lines 12 — 24) that if they are critical or not (line 18). If not found, we choose randomly(lines 26 — 29). In the end, after the state transition, we swap the players' good moves (line 33). and then it’s time for the rave algorithm (lines 38 — 43).

Last Good Reply (LGR)

This Algorithm was originally introduced by Baier and Drake in 2010. The idea is simple but effective. It states that there might be successful replies against opponent actions. So after simulations, we keep the opponent's actions and succeeding replies in memory. Then for the next simulations, we use those replies (this strategy is used with some probability) as if they could be good immediate replies against the opponents’ actions.

Here’s the code for LGRMctsAgent :

Let’s keep the white and black replies in dictionaries and initiate it in __init__ and set_gamestate method (lines 5 — 6 and 15 — 16).

In roll_out method, we initiate current_reply and other_reply as variables to keep track of replies by players. Then during the move selection, we choose moves with 0.5 probability based on previous replies stored in memory (LGR policy on lines 36 — 40) if there exists any.
On lines 55 — 60, the rave algorithm section starts. Then at the end, we keep updating the moves and their replies.

PoolRave

Hoock et al. [3] describe the poolRAVE enhancement, which modifies the MCTS simulation step as follows:

- Build a pool of the k best moves according to RAVE.

- Choose one move m from the pool.

- Play m with a probability p, else the default policy.

It was found to yield improvements for Havannah and Go programs by Hoock et al [4].

Now let’s have a look at the code:

We initiate the class with black_rave and white_rave (lines 5 — 6 and 15 — 16). In roll_out method, we sort them all and pick the m best moves based on the poolrave scores (lines 26 — 44). Then we start the roll_out phase until we reach the end of the game (lines 45 — 60).

In lines 65 — 91, we keep the rave points for black and white players. Then if the rollout finished in win, we increase their poolrave scores for that move (lines 71–74 for black, and 83–86 for white), and if not, we decrease its score (lines 75–79 for black, and 87–91 for white).

Conclusion

All of the algorithms that are introduced in this article are improvements dedicated to the simulation phase of the MCTS. The algorithms introduced in this article will demand more time to process, so we would observe the simulations would decrease each second compared to UCT and RAVE, but they help the tree to grow smarter. In the next article, we are going to implement another algorithm that is going to modify both backup and roll_out phases and is applicable to both UCT and RAVE algorithms and it is called Quality-Based Rewards.

References

  • [1] F. Teytaud and O. Teytaud, “On the huge benefit of decisive moves in Monte-Carlo Tree Search algorithms,” Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, Copenhagen, Denmark, 2010, pp. 359–364, doi: 10.1109/ITW.2010.5593334.
  • [2] H. Baier and P. D. Drake, “The Power of Forgetting: Improving the Last-Good-Reply Policy in Monte Carlo Go,” in IEEE Transactions on Computational Intelligence and AI in Games, vol. 2, no. 4, pp. 303–309, Dec. 2010, doi: 10.1109/TCIAIG.2010.2100396.
  • [3] J.-B. Hoock, C.-S. Lee, A. Rimmel, F. Teytaud, O. Teytaud, and M.-H. Wang, “Intelligent Agents for the Game of Go,” IEEE Comput. Intell. Mag., vol. 5, no. 4, pp. 28–42, 2010.
  • [4] C. B. Browne et al., “A Survey of Monte Carlo Tree Search Methods,” in IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, March 2012, doi: 10.1109/TCIAIG.2012.2186810.

--

--