I started following Deep Learning Curriculum(DLC) written by Jacob Hilton and here is what I experienced and learnt from the exercise in Topic 6 - Reinforcement Learning. My solution is written in Colab T6-RL-solution.ipynb

It took me around 40 hours to finish the exercise. I started by spending around 15 hours doing the exercise in ARENA about RL to get myself familiar with different components in RL and then spending the rest 25 hours doing the DLC exercise.

To implement and debug the RL algorithm, I referred to posts Debugging RL, Without the Agonizing Pain and The 37 Implementation Details of Proximal Policy Optimization.

I used Colab Pro+ environment to enable background running and more compute. The experimentation is done using 1 V100 GPU.

I have generated the Result Report in Weights & Bias. It shows

1. Increasing amount of episode return
2. reasonable amount of ratios clipped by PPO.
3. Small and fairly stable approximate KL.
4. Policy entropy (relative entropy) falls gradually
5. Value residual Variance (1 - value explained variance) tend to something positive.
6. Mean and standard deviation for advantage normalization are fairly stable and mean is pretty close to zero.

Another closer look at the episode return. It shows that using IMPALA model performs better than traditional CNN, especially at bigfish environment.

Other than these results, I found implementing the PPO algorithm under customized easy probing environments quite helpful. It is also helpful to achieve decent performance under CartPole environment and Atari environments before moving into harder Procgen Environments.

Extensively track the metrics can also be helpful to debug where things went wrong or well, though relying on some of them solely might be inadequate. I am confused by residual variance oscillation inside CartPole environment before, i.e. the residual variance doesn’t go smoothly towards a positive value, but oscillate wildly. It turns out that this can be due to the inherent simplicity of the CartPole environment: it doesn’t need much value estimation, but more relying on planning. This cause the value estimation to be unstable.

Overall I found this exercise quite helpful for me to understand the PPO algorithm and generally RL algorithm structure.