Skip to content

[cherry-pick] Fix group rewards for POCA, add warning for non-POCA trainers #5120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 16, 2021

Conversation

ervteng
Copy link
Contributor

@ervteng ervteng commented Mar 15, 2021

Proposed change(s)

Cherry-pick #5113

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments

…5113)

* Fix end episode for POCA, add warning for group reward if not POCA

* Add missing imports
@ervteng ervteng requested a review from andrewcoh March 15, 2021 23:28
@ervteng ervteng changed the base branch from main to release_15_branch March 15, 2021 23:28
Warn if the trainer receives a Group Reward but isn't a multiagent trainer (e.g. POCA).
"""
if not self._has_warned_group_rewards:
group_reward = np.sum(buffer[BufferKey.GROUP_REWARD])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is sum enough?

Suggested change
group_reward = np.sum(buffer[BufferKey.GROUP_REWARD])
group_reward = np.sum(np.abs(buffer[BufferKey.GROUP_REWARD]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user only uses group penalties, it will not be caught.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with np.any, which is faster and doesn't have this issue.

"""
Warn if the trainer receives a Group Reward but isn't a multiagent trainer (e.g. POCA).
"""
if not self._has_warned_group_rewards:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no group_rewards at all, this will be checked at every process trajectory. I wonder if we should only check for the first n trajectories only (Although this might have issues)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could set n to a large number (like 1000) and it should cover most of the cases. It wouldn't catch situations where the group reward is sparse and doesn't occur until halfway through the training session. But then again, in that case is the warning actually useful? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a = np.ones(10240)

%timeit np.sum(np.abs(a)) # 11.8 us
%timeit np.any(a) # 7.51 us
%timeit np.count_nonzero(a) # 23.2 us

It looks like if we go with np.any() we can speed this up a bit. For a trajectory of 1000 elements (the longest we have) each check costs us about 3.5 us. Count_nonzero is also great for shorter arrays (<2000-ish elements) but doesn't scale well.

"""
if not self._has_warned_group_rewards:
group_reward = np.sum(buffer[BufferKey.GROUP_REWARD])
if group_reward > 0.0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be floating point issues ?
I think we should do something like

Suggested change
if group_reward > 0.0:
if group_reward > self.EPSILON: # No not use this suggestion, there is no static EPSILON in RLTrainer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be, as an unassigned group reward is exactly 0.0

@ervteng ervteng merged commit 7cbb832 into release_15_branch Mar 16, 2021
@delete-merged-branch delete-merged-branch bot deleted the release-group-rew-fix branch March 16, 2021 21:17
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants