Skip to content

Allow envs to send a 'needs_reset' signal#356

Merged
muupan merged 30 commits intochainer:masterfrom
muupan:continuing-time-limit
Dec 12, 2018
Merged

Allow envs to send a 'needs_reset' signal#356
muupan merged 30 commits intochainer:masterfrom
muupan:continuing-time-limit

Conversation

@muupan
Copy link
Copy Markdown
Member

@muupan muupan commented Nov 16, 2018

This PR enables an env to send a signal that indicates it needs a reset via the info dict returned by env.step without setting done=True.

This functionality is needed to implement to strictly follow the training protocol of DeepMind on Atari games, which limits the number of frames of an episode ignoring life losses, while the agent still sees episodes that terminate by a life loss.

  • add ContinuingTimeLimit wrapper
  • handle needs_reset signal in both training and evaluation
  • check how it affects DQN on Atari

@muupan
Copy link
Copy Markdown
Member Author

muupan commented Nov 26, 2018

I confirmed the change introduced by this PR does not affect the scores on Atari.

old: examples/ale/train_dqn_ale.py --eval-interval 1000000 --env {env_id} (before this PR)
new: examples/ale/train_dqn_ale.py --eval-interval 1000000 --max-frames 108000 --env {env_id} (after this PR)

beamridernoframeskip-v4
breakoutnoframeskip-v4
seaquestnoframeskip-v4

@muupan muupan changed the title [WIP] Allow envs to send a 'needs_reset' signal Allow envs to send a 'needs_reset' signal Nov 26, 2018
Copy link
Copy Markdown
Contributor

@prabhatnagarajan prabhatnagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to dig deeper in another review.

if done or episode_len == max_episode_len or t == steps:
reset = (episode_len == max_episode_len
or info.get('needs_reset', False))
if done or reset or t == steps:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later in this "if" statement (https://github.com/chainer/chainerrl/pull/356/files#diff-a2caf3ec0e2750a1d16edb375789daa5R81), you reset the environment. Why do you reset the environment if done is true? What if reset = False?

Copy link
Copy Markdown
Member Author

@muupan muupan Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if reset == False, we need to call env.reset() if done == True. In other words, the reset variable can be false when we reset the env due to done == True. It is possible to rename reset as non_done_reset or something, but it would be verbose.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is targeted primarily at environments that reset based off of a max-episode-length or via done, and not environments where done is True but you still do not reset the environment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Currently, ChainerRL assumes that env.reset must be called when done==True.

outdir, global_t, local_t, episode_r)
logger.info('statistics:%s', agent.get_statistics())

# Evalaute the current agent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Evaluate*, not "Evalaute"

@@ -86,9 +88,6 @@ def train_agent_batch(agent, env, steps, outdir, log_interval=None,
# 5. reset the env to start a new episode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these comments be revised?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these comments are still correct except that 3-5 are skipped when training is finished. I'll clarify this in the comments.


from gym import spaces

import chainerrl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're changing the file, perhaps we should change the header from "This file is a fork from a MIT-licensed project" to "This file adapted from an MIT-licensed project..."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fork" means it already has changes, no?

each episode, except that done=False is returned and that
info['needs_reset'] is set to True when past the limit.

Code that calls env.step is repsonsible for checking the info dict, the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"responsible" - typo

@muupan muupan mentioned this pull request Nov 27, 2018
3 tasks
parser.add_argument('--max-episode-len', type=int,
default=30 * 60 * 60 // 4, # 30 minutes with 60/4 fps
help='Maximum number of timesteps for each episode.')
parser.add_argument('--max-frames', type=int,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is train_dqn_ale the only file that should have max_frames? At least atari/train_dqn.py should also incorporate these changes, right? Why did you only change the DQN example?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other examples that use atari_wrappers are affected as well. If the changes to examples/ale/train_dqn_ale.py look ok to you, I can apply the same changes to other examples as well, though I believe they would work without changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think we should still make the changes to other examples using atari_wrappers for completeness/consistency.

Copy link
Copy Markdown
Contributor

@prabhatnagarajan prabhatnagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through this PR, and everything seems to be okay. was slightly confused by some of the tests, but they all seemed fine. This is ready for approval, but please make the following minor changes before merging with master:

  • [Address the comments (e.g. typos etc.)]
  • [Make the changes to train_dqn_ale.py apply to other relevant files]

parser.add_argument('--max-episode-len', type=int,
default=30 * 60 * 60 // 4, # 30 minutes with 60/4 fps
help='Maximum number of timesteps for each episode.')
parser.add_argument('--max-frames', type=int,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think we should still make the changes to other examples using atari_wrappers for completeness/consistency.

@muupan
Copy link
Copy Markdown
Member Author

muupan commented Dec 7, 2018

I addressed the comments and made the same change to other examples.

Copy link
Copy Markdown
Contributor

@prabhatnagarajan prabhatnagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM;

@muupan muupan merged commit ea69e24 into chainer:master Dec 12, 2018
@muupan muupan deleted the continuing-time-limit branch December 12, 2018 06:30
@muupan muupan added this to the v0.6 milestone Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants