Batch PPO Implementation by ljvmiranda921 · Pull Request #295 · chainer/chainerrl

ljvmiranda921 · 2018-08-09T07:20:59Z

Refactor branch: https://github.com/ljvmiranda921/chainerrl/pull/13

Description

This PR is built on top of @iory 's A2C implementation in #149. It provides a batch/parallel implementation of Proximal Policy Optimization. I'm using #149's VecEnv environment to achieve this task. Here are the main changes:

Add batch implementation to PPO via the batch_act() and batch_observe() methods
Add chainerrl.experiments.train_agent_batch() Status: In progress
Add new tests for batch PPO Status: In progress

Changes in data structure

Previously, the computation resides in the self.memory and self.last_episode attributes of PPO. Now we're also using self.batch_memory to handle this task. During a batch run, the type signature looks like:

# Type signature for self.batch_memory during batch run
batch_memory :: [[dict], [dict], [dict]]
where len(batch_memory) == batch_env.num_envs

The same goes for self.last_episode:

# Type signature for self.last_episode
last_episode :: [[dict], [dict], [dict]]
where len(last_episode) == batch_env.num_envs

New methods: `batch_act()` and `batch_observe()`

These methods are supposed to handle all batch computations during the run. The method batch_act(batch_obs) performs a set of actions given a set of observations, while batch_observe(obs, r, done, info) updates the model. A simple way to use them can be seen below:

t = 0; steps = 100
# o_0, r_0 : Init observation and reward
obs = batch_env.reset()
r = np.zeros(num_process, dtype='f')
# Initialize episode reward
episode_r = np.zeros(num_process, dtype='f')

while t < steps:
    # a_t : First action
    action = agent.batch_act(obs)
    # o_{t+1}, r_{t+1} : Get observation and reward
    obs, r, done, info = batch_env.step(action)
    # Train model
    agent.batch_observe(obs, r, done, info)
    # Update counters
    t += 1
    update_or_reset_reward(episode_r, done, info)

This assumes that the environment in VectorEnv sends a reset signal in the form of a dictionary entry in info.

This reverts commit a2b4c8a.

This commit adds # NOQA comments to some top-level imports in order to please flake8 (specifically E402) Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>

This commit adds a batch implementation of the Proximal Policy Optimization algorithm. It is meant to interact with the VecEnv environment in envs.vec_env.py. batch_act() and batch_observe() methods are implemented to achieve this task. Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>

This commit adds a train_agent_batch in the experiments module Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>

This commit adds another test class, TestBatchPPO, to test the batch implementation of PPO. Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>

This commit adds a gym example for batch PPO implementation. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

This commit refactors `batch_act` and `batch_observe` into: `batch_act_and_train` and `batch_observe_and_train`. There's also an additional set of `batch_act` and `batch_observe` methods implemented. The idea is that during training, we use all the `*_and_train` methods, similar to the standard API. Then we call its counterparts during testing/evaluation. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

This commit fixes the BatchPPO algorithm by applying a set of accumulators to keep the episode memories in check and prevent leaking episodes without advantage computations. A deque was also implemented in train_agent_batch to control the resolution of the reported mean_r Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

This commit implements the BatchEvaluator and updates the train_agent_batch so that the agent is evaluated at some timestep. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

This commit fixes the return value of _batch_act in order to handle the bug in Pendulum-v0 Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

chainerrl/envs/vec_env.py

muupan · 2018-09-10T14:03:11Z

examples/gym/train_ppo_batch_gym.py

+        env = gym.make(args.env)
+        # Use different random seeds for train and test envs
+        env_seed = 2 ** 32 - 1 - args.seed if test else args.seed
+        env.seed(env_seed)


Each env in VectorEnv should be assigned different random seeds. See train_a3c_gym.py for how to assign different random seeds.

muupan · 2018-09-10T14:07:42Z

chainerrl/experiments/train_agent_batch.py

+            # Start new episode for those with mask
+            episode_r *= masks
+            episode_len *= masks
+            t += 1


t should be the total number of transitions experienced so far. So, it should increase by num_envs, not by 1. By doing so, we can keep the other hyperparameters and only change num_envs to trade cpus with computation time.

iory added 9 commits October 11, 2017 14:47

[distribution.py] Add keepdims option for gaussian log_probs

a2b4c8a

Add vectorized envirnment

cba88dd

Add a2c

1cc2449

Add synchronized agent trainer

65fc848

Add a2c example gym and ale

87633a8

Add a2c test

5354e5a

Close sample_env

8e58d3a

Using differential entropy in case of continuous

5964a98

Revert "[distribution.py] Add keepdims option for gaussian log_probs"

6ee06bc

This reverts commit a2b4c8a.

ljvmiranda921 changed the title ~~Batch PPO Implementation~~ [WIP] Batch PPO Implementation Aug 9, 2018

ljvmiranda921 and others added 7 commits August 9, 2018 16:28

Merge branch 'batch-ppo'

5910ac9

Add NOQA to A2C code

03b499b

This commit adds # NOQA comments to some top-level imports in order to please flake8 (specifically E402) Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>

Add train_agent_batch.py

ee96cb7

This commit adds a train_agent_batch in the experiments module Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>

Add tests for batch ppo

2fe30f5

This commit adds another test class, TestBatchPPO, to test the batch implementation of PPO. Signed-off-by: ljvmiranda921 <ljvmiranda@gmail.com>

Add gym example batch PPO

d802074

This commit adds a gym example for batch PPO implementation. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

muupan mentioned this pull request Sep 2, 2018

Synchronous parallel training #150

Open

ljvmiranda921 and others added 6 commits September 5, 2018 10:23

Add evaluation statistics

e38a096

Add BatchEvaluator for agent evaluation

851435d

This commit implements the BatchEvaluator and updates the train_agent_batch so that the agent is evaluated at some timestep. Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

Add coverage and cache .gitignore

36434dc

Fix self._batch_act return value

6e277b2

This commit fixes the return value of _batch_act in order to handle the bug in Pendulum-v0 Signed-off-by: Lester James V. Miranda <ljvmiranda@gmail.com>

Add option to save training average

602c278

muupan reviewed Sep 10, 2018

View reviewed changes

chainerrl/envs/vec_env.py Show resolved Hide resolved

muupan reviewed Sep 10, 2018

View reviewed changes

ljvmiranda921 added 3 commits September 12, 2018 13:17

Fix proper environment stepping procedure

61dbe8b

Set random seed for every env instance

5edaf89

Fix variable errors in process_seeds

99518fc

Add __del__ attr for automatic env.close()

d6cb7a1

ljvmiranda921 changed the title ~~[WIP] Batch PPO Implementation~~ Batch PPO Implementation Sep 13, 2018

ljvmiranda921 added 12 commits September 14, 2018 13:54

[ci skip] Remove env.close() at train_agent_batch

ddd82df

[ci skip] Threading

7b92cf2

Update batch_update

169487d

Fix updates

c085124

Update xp.array to xp.stack

f098fa5

Run autopep8 and flake8 to batch PPO

c1efa08

Add ./pytest_cache

b653773

Revert to using xp.array()

b9695c8

Add fix on batch_act and resets

7395c21

Update last_episode name in compute_teacher

7af5693

Add comments for terminal (split)

e3e5c08

Update VectorEnv reset observations

b26aa88

toslunar merged commit b26aa88 into chainer:master Nov 12, 2018

muupan added this to the v0.5 milestone Nov 13, 2018

muupan added the enhancement label Nov 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch PPO Implementation#295

Batch PPO Implementation#295
toslunar merged 38 commits intochainer:masterfrom
ljvmiranda921:batch-ppo

ljvmiranda921 commented Aug 9, 2018 •

edited

Loading

Uh oh!

Uh oh!

muupan Sep 10, 2018

Uh oh!

muupan Sep 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ljvmiranda921 commented Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes in data structure

New methods: batch_act() and batch_observe()

Uh oh!

Uh oh!

muupan Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

muupan Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ljvmiranda921 commented Aug 9, 2018 •

edited

Loading

New methods: `batch_act()` and `batch_observe()`