Skip to content
This repository was archived by the owner on Oct 7, 2024. It is now read-only.

Commit beb1630

Browse files
DeepMindcopybara-github
authored andcommitted
Calculate best episode using full episode return in cartpole_swingup.
Return is non-monotonic in this problem; currently this cherry-picks the peak of return during the episode. Also applied same change to base cartpole for consistency and efficiency, but cartpole return is monotonic (so not a bug). PiperOrigin-RevId: 308033113 Change-Id: I9add00d41f8e87d518e00c3fef9cd9ad7ad18d0b
1 parent f9b74bf commit beb1630

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

bsuite/environments/cartpole.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,9 +145,9 @@ def step(self, action):
145145
reward = 1. if is_reward else 0.
146146
self._raw_return += reward
147147
self._episode_return += reward
148-
self._best_episode = max(self._episode_return, self._best_episode)
149148

150149
if self._state.time_elapsed > self._max_time or not is_reward:
150+
self._best_episode = max(self._episode_return, self._best_episode)
151151
self._reset_next_step = True
152152
return dm_env.termination(reward=reward, observation=self.observation)
153153
return dm_env.transition(reward=reward, observation=self.observation)

bsuite/experiments/cartpole_swingup/cartpole_swingup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,11 +111,11 @@ def step(self, action):
111111
self._total_upright += 1
112112
self._raw_return += reward
113113
self._episode_return += reward
114-
self._best_episode = max(self._episode_return, self._best_episode)
115114

116115
is_end_of_episode = (self._state.time_elapsed > self._max_time
117116
or np.abs(self._state.x) > self._x_threshold)
118117
if is_end_of_episode:
118+
self._best_episode = max(self._episode_return, self._best_episode)
119119
self._reset_next_step = True
120120
return dm_env.termination(reward=reward, observation=self.observation)
121121
else: # continuing transition.

0 commit comments

Comments
 (0)