[cherry-pick] Fix group rewards for POCA, add warning for non-POCA trainers #5120

ervteng · 2021-03-15T23:28:15Z

Proposed change(s)

Cherry-pick #5113

Types of change(s)

Checklist

Added tests that prove my fix is effective or that my feature works
Updated the changelog (if applicable)
Updated the documentation (if applicable)
Updated the migration guide (if applicable)

Other comments

…5113) * Fix end episode for POCA, add warning for group reward if not POCA * Add missing imports

vincentpierre · 2021-03-16T16:48:33Z

ml-agents/mlagents/trainers/trainer/rl_trainer.py

+        Warn if the trainer receives a Group Reward but isn't a multiagent trainer (e.g. POCA).
+        """
+        if not self._has_warned_group_rewards:
+            group_reward = np.sum(buffer[BufferKey.GROUP_REWARD])


is sum enough?

Suggested change

group_reward = np.sum(buffer[BufferKey.GROUP_REWARD])

group_reward = np.sum(np.abs(buffer[BufferKey.GROUP_REWARD]))

If the user only uses group penalties, it will not be caught.

Replaced with np.any, which is faster and doesn't have this issue.

vincentpierre · 2021-03-16T16:50:27Z

ml-agents/mlagents/trainers/trainer/rl_trainer.py

+        """
+        Warn if the trainer receives a Group Reward but isn't a multiagent trainer (e.g. POCA).
+        """
+        if not self._has_warned_group_rewards:


If there are no group_rewards at all, this will be checked at every process trajectory. I wonder if we should only check for the first n trajectories only (Although this might have issues)

Yeah, we could set n to a large number (like 1000) and it should cover most of the cases. It wouldn't catch situations where the group reward is sparse and doesn't occur until halfway through the training session. But then again, in that case is the warning actually useful? 🤔

a = np.ones(10240) %timeit np.sum(np.abs(a)) # 11.8 us %timeit np.any(a) # 7.51 us %timeit np.count_nonzero(a) # 23.2 us

It looks like if we go with np.any() we can speed this up a bit. For a trajectory of 1000 elements (the longest we have) each check costs us about 3.5 us. Count_nonzero is also great for shorter arrays (<2000-ish elements) but doesn't scale well.

vincentpierre · 2021-03-16T16:52:33Z

ml-agents/mlagents/trainers/trainer/rl_trainer.py

+        """
+        if not self._has_warned_group_rewards:
+            group_reward = np.sum(buffer[BufferKey.GROUP_REWARD])
+            if group_reward > 0.0:


Will there be floating point issues ?
I think we should do something like

Suggested change

if group_reward > 0.0:

if group_reward > self.EPSILON: # No not use this suggestion, there is no static EPSILON in RLTrainer

There shouldn't be, as an unassigned group reward is exactly 0.0

Fix end episode for POCA, add warning for group reward if not POCA (#…

4f0befd

…5113) * Fix end episode for POCA, add warning for group reward if not POCA * Add missing imports

ervteng requested a review from andrewcoh March 15, 2021 23:28

ervteng changed the base branch from main to release_15_branch March 15, 2021 23:28

vincentpierre reviewed Mar 16, 2021

View reviewed changes

Use np.any, which is faster

0c40f16

vincentpierre approved these changes Mar 16, 2021

View reviewed changes

ervteng merged commit 7cbb832 into release_15_branch Mar 16, 2021

delete-merged-branch bot deleted the release-group-rew-fix branch March 16, 2021 21:17

github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cherry-pick] Fix group rewards for POCA, add warning for non-POCA trainers #5120

[cherry-pick] Fix group rewards for POCA, add warning for non-POCA trainers #5120

Uh oh!

ervteng commented Mar 15, 2021

Uh oh!

vincentpierre Mar 16, 2021

Uh oh!

vincentpierre Mar 16, 2021

Uh oh!

andrewcoh Mar 16, 2021

Uh oh!

ervteng Mar 16, 2021

Uh oh!

vincentpierre Mar 16, 2021

Uh oh!

ervteng Mar 16, 2021

Uh oh!

ervteng Mar 16, 2021

Uh oh!

vincentpierre Mar 16, 2021

Uh oh!

ervteng Mar 16, 2021

Uh oh!

Uh oh!

	group_reward = np.sum(buffer[BufferKey.GROUP_REWARD])
	group_reward = np.sum(np.abs(buffer[BufferKey.GROUP_REWARD]))

	if group_reward > 0.0:
	if group_reward > self.EPSILON: # No not use this suggestion, there is no static EPSILON in RLTrainer

[cherry-pick] Fix group rewards for POCA, add warning for non-POCA trainers #5120

[cherry-pick] Fix group rewards for POCA, add warning for non-POCA trainers #5120

Uh oh!

Conversation

ervteng commented Mar 15, 2021

Proposed change(s)

Types of change(s)

Checklist

Other comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!