Proximal Policy Optimization!

Hey guys! So I decided that I wasn’t gonna take a break after all (lol), because I wanted to challenge myself and code an AI for Pong with PPO in only one day, learning as I go. Since I’d never exposed myself to such a big-boy algorithm before, I was expecting it to be the ultimate challenge, but it is really similar to A2C and A3C, just that we have a fixed trajectory length (number of steps in the environment before training), and we have a couple extra variables, like lambda (smoothing parameter) for the generalized advantage function (new!!), and a new loss function. We just don’t want huge policy updates, so we say the ratio of the probability of one action occurring in now versus before is restricted to a certain trust region. Then the loss function is similar to that of A2C.

Bruh the math notation was wack as heck, and I spent a good hour or two just trying to decipher what the heck was up with the letters with hats and all. After I kind of understood what was going on, I tried implementing it, but of course it didn’t work. I pretty much failed my challenge, but I didn’t not expect that. I went at it again in the morning, and I was able to get the algorithm to work!

However, its performance was dogwater bruh when I tried putting it on snake and pong. It did WORSE than A3C on snake, getting a max score of 12 after 125 episodes before going haywire. It worked kind of well on pong, being able to beat a simple bot I coded 4 times out of 10 after 200 episodes (this is better than me, who could only do it twice out of 10).

However, I was kind of foolish hoping that my pong ai would beat a bot that basically plays perfectly (can’t win all the time though), especially with only 6 inputs.

I am thinking about looking into convolutions in the future, and maybe throwing some convolutional layers onto PPO to make this algorithm work better (since it is good with many inputs - powerful algorithm), but I don’t know the first thing about CNNs, so that will be the next thing I research.

Here’s a picture of pong:

If anyone sees this, have a great day!