-
Notifications
You must be signed in to change notification settings - Fork 4.3k
[cherry-pick] Fix group rewards for POCA, add warning for non-POCA trainers #5120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…5113) * Fix end episode for POCA, add warning for group reward if not POCA * Add missing imports
Warn if the trainer receives a Group Reward but isn't a multiagent trainer (e.g. POCA). | ||
""" | ||
if not self._has_warned_group_rewards: | ||
group_reward = np.sum(buffer[BufferKey.GROUP_REWARD]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is sum enough?
group_reward = np.sum(buffer[BufferKey.GROUP_REWARD]) | |
group_reward = np.sum(np.abs(buffer[BufferKey.GROUP_REWARD])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user only uses group penalties, it will not be caught.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with np.any
, which is faster and doesn't have this issue.
""" | ||
Warn if the trainer receives a Group Reward but isn't a multiagent trainer (e.g. POCA). | ||
""" | ||
if not self._has_warned_group_rewards: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no group_rewards at all, this will be checked at every process trajectory. I wonder if we should only check for the first n trajectories only (Although this might have issues)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could set n
to a large number (like 1000) and it should cover most of the cases. It wouldn't catch situations where the group reward is sparse and doesn't occur until halfway through the training session. But then again, in that case is the warning actually useful? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a = np.ones(10240)
%timeit np.sum(np.abs(a)) # 11.8 us
%timeit np.any(a) # 7.51 us
%timeit np.count_nonzero(a) # 23.2 us
It looks like if we go with np.any() we can speed this up a bit. For a trajectory of 1000 elements (the longest we have) each check costs us about 3.5 us. Count_nonzero is also great for shorter arrays (<2000-ish elements) but doesn't scale well.
""" | ||
if not self._has_warned_group_rewards: | ||
group_reward = np.sum(buffer[BufferKey.GROUP_REWARD]) | ||
if group_reward > 0.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be floating point issues ?
I think we should do something like
if group_reward > 0.0: | |
if group_reward > self.EPSILON: # No not use this suggestion, there is no static EPSILON in RLTrainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be, as an unassigned group reward is exactly 0.0
Proposed change(s)
Cherry-pick #5113
Types of change(s)
Checklist
Other comments