Advantage Actor Critic!

Hey guys! After about a week of researching it, I finally got the A2C (Advantage Actor Critic) algorithm to work on Snake! My previous post on neural networks featured an AI learning how to play Snake with the use of a DQN, which was a LOT simpler than this, but the performance and speed was much better. For my implementation of A2C, I decided to use two networks (sometimes people implement it with 1). The math and pseudocode were SO MUCH HARDER to understand than Q learning, but I probably should have researched policy gradient methods better before diving into this.

A2C explanation:

(My implementation of) A2C basically functions with two neural networks: an actor and critic. The actor’s goal is to predict what actions would maximize the reward from the critic. The critic has to evaluate how good a state is, and is supposed to tell the actor how to improve. It’s like a parent and child scenario. In a regular DQN, there is only one neural network that estimates the value function (the Q value), and it takes actions based on choosing the action with the highest Q value (of course with epsilon greedy as well). This makes the implementation much easier to do from scratch, so I was able to do that in vanilla processing.

However, for A2C, the network architecture and gradient computation is much more complex, so I had to migrate to using processing for eclipse and the DL4J library.

Now, the hyperparameters:

  • 11 inputs (same as before), two hidden layers 128 neurons each with leakyrelu activation, 2 output layers (one is softmax for the actor - 3 outputs), and the other is identity for the critic to output one q value - 1 output.
  • gamma = 0.95, learning rate = 0.001, 20x20 board (same as last time)
  • one update per game

I noticed some pretty significant performance changes in the network, where the DQN took 800 games to reach a max score of 21, but after 350 games with the actor critic network, it reached a max score of 23. Another important thing to note is that the DQN used experience replay, and this didn’t, so the speed of the algorithm improved drastically (from 30 fps to 240), and there were far fewer updates performed to get to this level of play.

Here is a picture (from a model I saved and loaded):

If anyone sees this, have a great day! I might do PPO or something in the future, but I might just take a break for now.