Reinforcement learning algorithms are ridiculously fragile. Older deep Q-learning algorithms may or may not train just with a different seed. Newer algorithms such as PPO and SAC are much more stable. Well-configured ones can train on a wide range of tasks with same / similar hyper-parameters. Because of that, these algorithms are now used more and more in real-world settings, rather than just a research interest.
To implement PPO and SAC from scratch correctly, however, can still be challenging. The wonderful blog post The 37 Implementation Details of Proximal Policy Optimization covered many aspects of it. Tianshou’s training book also contains many gems. The problem, as 37 Implementation Details pointed out, can be traced to the fact that many libraries have logics distributed to different places, which makes it harder to understand what tricks the author applied to reach a certain score.
I am going to highlight a few more implementation details here that are not covered in 37 Implementation Details. Hopefully this would help other practitioners in RL to implement these algorithms.
Action Scaling and Clipping
37 Implementation Details mentioned action clipping, and casually said that “the original unclipped action is stored as part of the episodic data”. What it means is that when compute logarithmic probabilities on the distribution, we should use the unclipped action value. Using the clipped action value would make the training unstable, causing NaNs in the middle of a training, due to the clipped action value drifting too far from the centroid.
While clipping to the action range directly would work for simpler cases such as InvertedPendulum or Ant, for Humanoid, it is better to clip to range
[-1, 1] and then scale to the action range. The latter approach showed to be more stable on all MuJoCo tasks.
Clipping Conditional Sigma
SAC, when trained with continuous action space such as MuJoCo tasks, need to sample from a Gaussian distribution with conditional sigma. The conditional sigma is derived from the input and conveniently shares the same network as the action. It turns out to be helpful to constrain the range of sigma when doing action sampling. This helps to avoid NaN when sigma is too big. A common choice is between
[-20, 2]. Did you notice the -20 on the lower bound? That’s because we will apply exp on the sigma value (thus, the network produced sigmas are really the “log sigmas”), hence reflected on the range.
Updates to Reward Normalization (or “Reward Scaling”)
Depending on your implementations, you may collect data from the vectorized environments into many micro-batches, which sometimes would be easier to treat truncated / terminated returns differently. Reward normalization collects statistics (standard deviation) from returns in these micro-batches. The statistics will also be used to normalize subsequent returns. It is recommended to update the statistics on a per batch basis. Thus, for the current batch, always use the statistics calibrated from the previous batch. Keeping an updated statistics and using that directly in the current batch can cause further instabilities during the training.
If no observation normalization, clipping is not needed. After normalization however, clipping can help to avoid sharp changes in losses, thus, making the training smoother.
Logarithmic Probability (Entropy) on Gaussian and Corrected for Tanh Squashing
Logarithmic probability is computed element-wise. This is not a problem for PPO because there is no correction term and it is easy to balance its actor losses (only two terms, one from the “entropy” (logarithmic probability), one to regularize sampling sigma). For SAC, it is not so easy. Entropies are corrected for tanh squashing, and mixed with rewards, resulting in a target / loss with 3 to 4 terms. This is especially important if you implement the log_prob function yourselves.
It seems summing up along the feature dimension matches the SAC paper. Mismatching this (i.e. averaging for log_prob but summing up for correction terms) can cause sudden spikes in critic / actor losses and harder to recover from it.
This is different from “reward normalization” in PPO. For SAC, since it computes the current target value with n-step rewards + future value + action entropy. The reward scaling here refers to applying coefficient to the n-step rewards to balance between critics’ estimation and the near-term reward. This is important for SAC because of the entropy term’s scale again. Higher reward scaling results in more exploitation while lower reward scaling results in more exploration.