Batch PPO Implementation#295
Merged
toslunar merged 38 commits intochainer:masterfrom Nov 12, 2018
ljvmiranda921:batch-ppo
Merged
Batch PPO Implementation#295toslunar merged 38 commits intochainer:masterfrom ljvmiranda921:batch-ppo
toslunar merged 38 commits intochainer:masterfrom
ljvmiranda921:batch-ppo
Conversation
This reverts commit a2b4c8a.
This commit adds # NOQA comments to some top-level imports in order to please flake8 (specifically E402) Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>
This commit adds a batch implementation of the Proximal Policy Optimization algorithm. It is meant to interact with the VecEnv environment in envs.vec_env.py. batch_act() and batch_observe() methods are implemented to achieve this task. Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>
This commit adds a train_agent_batch in the experiments module Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>
This commit adds another test class, TestBatchPPO, to test the batch implementation of PPO. Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>
This commit adds a gym example for batch PPO implementation. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>
This commit refactors `batch_act` and `batch_observe` into: `batch_act_and_train` and `batch_observe_and_train`. There's also an additional set of `batch_act` and `batch_observe` methods implemented. The idea is that during training, we use all the `*_and_train` methods, similar to the standard API. Then we call its counterparts during testing/evaluation. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>
This commit fixes the BatchPPO algorithm by applying a set of accumulators to keep the episode memories in check and prevent leaking episodes without advantage computations. A deque was also implemented in train_agent_batch to control the resolution of the reported mean_r Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>
This commit implements the BatchEvaluator and updates the train_agent_batch so that the agent is evaluated at some timestep. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>
This commit fixes the return value of _batch_act in order to handle the bug in Pendulum-v0 Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>
muupan
reviewed
Sep 10, 2018
muupan
reviewed
Sep 10, 2018
| env = gym.make(args.env) | ||
| # Use different random seeds for train and test envs | ||
| env_seed = 2 ** 32 - 1 - args.seed if test else args.seed | ||
| env.seed(env_seed) |
Member
There was a problem hiding this comment.
Each env in VectorEnv should be assigned different random seeds. See train_a3c_gym.py for how to assign different random seeds.
muupan
reviewed
Sep 10, 2018
| # Start new episode for those with mask | ||
| episode_r *= masks | ||
| episode_len *= masks | ||
| t += 1 |
Member
There was a problem hiding this comment.
t should be the total number of transitions experienced so far. So, it should increase by num_envs, not by 1. By doing so, we can keep the other hyperparameters and only change num_envs to trade cpus with computation time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor branch: https://github.com/ljvmiranda921/chainerrl/pull/13
Description
This PR is built on top of @iory 's A2C implementation in #149. It provides a batch/parallel implementation of Proximal Policy Optimization. I'm using #149's
VecEnvenvironment to achieve this task. Here are the main changes:batch_act()andbatch_observe()methodschainerrl.experiments.train_agent_batch()Status: In progressChanges in data structure
Previously, the computation resides in the
self.memoryandself.last_episodeattributes of PPO. Now we're also usingself.batch_memoryto handle this task. During a batch run, the type signature looks like:The same goes for
self.last_episode:New methods:
batch_act()andbatch_observe()These methods are supposed to handle all batch computations during the run. The method
batch_act(batch_obs)performs a set of actions given a set of observations, whilebatch_observe(obs, r, done, info)updates the model. A simple way to use them can be seen below:This assumes that the environment in
VectorEnvsends aresetsignal in the form of a dictionary entry ininfo.