Precompute log probability in PPO#430
Merged
keisuke-nakata merged 6 commits intochainer:masterfrom Apr 17, 2019
Merged
Conversation
Member
Author
|
Seems like this PR (PPO ParallelLink Precompute) further improves PPO from #427 . |
Member
Author
|
Before this PR After this PR You can see from the |
1 task
Member
Author
|
After changes in #427 (stop scaling weight initialization at value function and decaying clip eps), I conducted the comparison again, with 10 random seeds each. I confirmed again that this PR improves performance. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Merge #427 before this PR.Before this PR, for every minibatch (e.g. 64 transitions) in a batch (e.g. 2048 transitions) used for an iteration of PPO, both the current and old models are evaluated to compute the probability ratio. This PR instead pre-compute the outputs from the old model and evaluate only the current model for each minibatch. The reasons behind this PR are:
obs_normalizeris updated after collecting a batch of transitions and before updating the policy. To correctly implement importance sampling, the policy used for collecting transitions should be used to compute importance weights. Thus, it is better to compute the probabilities before updatingobs_normalizerwhich can affect the policy. https://github.com/chainer/chainerrl/blob/master/chainerrl/agents/ppo.py#L225neglogpacshttps://github.com/openai/baselines/blob/master/baselines/ppo2/runner.py#L33This PR also adds
n_updatesto PPO's statistics to help debugging.Todos