Self play: Add a ghost trainer and track agents elo rating #1975

LeSphax · 2019-04-23T16:39:07Z

See #1070 for more context around this pull-request.

Ghost trainer

This pull request adds a new type of trainer, the ghost, which allows an agent to play against past versions of himself.
The ghost manages several policies, each with a different past version of the agent.
The ghost brain doesn't update his weights, he is assigned to a master ppo trainer and use the master trainer's past checkpoints. On academy resets, the ghost randomly samples the ppo trainer's checkpoints and load them into its policies.

Ghost configuration in trainer_config.yaml:

TennisLearning:
normalize: true
max_steps: 2e5
use_elo_rating: true

TennisGhost:
trainer: ghost
ghost_master_brain: TennisLearning
ghost_num_policies: 3
ghost_prob_sample_only_recent: 0.6
ghost_recent_ckpts_threshold: 10

ghost_master_brain: The brain from which the ghost loads past checkpoints
ghost_num_policies: Number of policies that the ghost manages at the same time

I also wanted to give some control around the sampling of checkpoints so I added those parameters:

ghost_prob_sample_only_recent: Probability that the ghost will only sample recent checkpoints instead of the whole history. 0.6 means 60% chance to sample recent and 40% to sample the whole history.
ghost_recent_ckpts_threshold: Number of checkpoints that are considered recent.

The unity side can decide which agent is using the ghost brain and which agent is using the learning brain. Changing those assignments is also supported

Elo rating:

When using self-play, measuring rewards is not very helpful because better agents might be good at preventing each other from getting rewards.
One way to measure progression then is elo rating.
Each agent has an elo rating and agents will play matches against each other. The winner of the match gains rating and the loser loses rating, the amount they win/lose depends on their respective ratings before the match.

Elo rating is stored in the models, allowing each policy to know its own elo rating when loading a graph.

The match results are sent by agents in the CollectObservations method in this form
"opponent_agent_id|match_result".
So for example: "1234|win", "1234|loss" or "4321|playing".

Tennis Environment

What I just described was implemented on the python side and I also modified the Tennis Example to use self-play.
I added the TennisGhost, some winrate metrics to help me debug and I changed the reward settings to encourage winning matches instead of encouraging as many passes as possible.
So you can test this on the Tennis Environment. The results are a bit noisy, but it usually gets around 1240-1270 elo rating at 25000 steps with trainer_config.yaml and --save-freq=5000.

There are a few things I am not sure about:

At the moment, the match results are passed through text observations. This would make it hard for users to use text observations while using self-play. Should I add a field to the agent_info protobuf message instead?
I built self-play to work with two agents but the match results will need a different format in team vs team format. (Maybe add a match ID)
I modified the Tennis Environment to add self-play and put some additional metrics there to help me debug it. Not sure if it should stay like that or if there should be a self-play and non-self-play version.

So this pull-request might not be complete enough yet, but I felt like I should ask for feedback to make sure that this is something that should be in ML agents before proceeding.
If anything doesn't make sense feel free to ask me :)

Change the rewards to promote competition Send a text observation with the winner of each match Use a Learning Brain and a ghost brain playing against each other Add logging to track the winrates and if the agents are getting better.

A ghost trainer manages several policies and assign them randomly to several agents. These policies' graphs are loaded randomly from saved models of a master trainer on each academy reset. Add an elo parameter to models to track their elo rating. Elo rating is updated by getting results of the matches between agents from text observations.

shihzy · 2019-04-30T20:39:10Z

hi - this is awesome. we need a couple of weeks to review, but wanted to put it on your radar we are looking at it.

LeSphax · 2019-05-02T06:05:13Z

Oki doki, thanks for the heads up :)

…

On Tue, Apr 30, 2019 at 10:39 PM Jeffrey Shih ***@***.***> wrote: hi - this is awesome. we need a couple of weeks to review, but wanted to put it on your radar we are looking at it. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1975 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACTJSNSTLXQOFVDIVVMXWYLPTCVAPANCNFSM4HH3MHVQ> .

AcelisWeaven · 2019-07-16T09:44:15Z

Hi @LeSphax,

Thanks for your contribution, I've tried to use it on my project but I'm having some issues and some questions.

My ELO rating keeps going down on my game. This doesn't happen on your modified Tennis environnement however. This may be due to the fact that my game is a bit more complex. Do you have an idea of what can cause this? Here's a sample output from my training:

INFO:mlagents.trainers: league3-0: PlayerLearning: Step: 354000. Time Elapsed: 27112.426 s Mean Reward: 2.748. Std of Reward: 0.579. Elo Rating: 357.2 Training.
INFO:mlagents.trainers: league3-0: PlayerLearningGhost: Step: 354000. Time Elapsed: 27112.503 s Mean Reward: -2.748. Std of Reward: 0.579. Elo Rating: 1200.0 Not Training.

Do you have a solution to limit RAM usage? Even on the demo env, ml-agents allocates about 2Mo/second. After an seven hours of training, it crashes because of this (got 16Go of RAM). Here's my env configuration, for reference:

PlayerLearning:
    batch_size: 1024
    buffer_size: 102400
    max_steps: 10000000
    hidden_units: 64
    num_layers: 3
    summary_freq: 1000
    use_elo_rating: true

PlayerLearningGhost:
    trainer: ghost
    ghost_master_brain: PlayerLearning
    ghost_num_policies: 3
    ghost_recent_ckpts_threshold: 10
    ghost_prob_sample_only_recent: 0.5

Is Curiosity supported?
Do you had any issue with Tensorboard? With Tennis I can see the graphs fine, but with my env nothing even show up. Maybe that's an issue on my side.
Also... any plan for team training? :)

Here's a screenshot of my learning scene, red player have to score in the blue zone and vice-versa.

Actions are discrete (move in 8 directions, jump)

Thanks,

LeSphax · 2019-07-16T16:08:23Z

Hi @AcelisWeaven,

Unfortunately I am on holidays without my computer and it has been a while since I looked at this but let's see if I can help anyway :)

For the elo rating going down my guess would be that match results are not attributed correctly. What matters is this line for the elo computation SetTextObs(opponent.Id + "|" + matchState);

If that doesn't help the way I debugged that kind of problem is to have only two agents training, don't switch the brain between them and log all the match results to check that they make sense.

For the memory leak, I don't remember having problems when running overnight with 16GB but I also never looked at it so if it happens on the Tennis environment there is surely a problem.
I think curiosity should work since the checkpoints should include the curiosity model but I didn't try it.
I don't think I had problems with Tensorboard.
For teams I think the ghost could work the same way but we would need to change the elo rating calculation and send a list of opponents and teammates with the result of the match.

I will try to get hold of a computer this weekend and see if I can investigate that memory leak.

AcelisWeaven · 2019-07-16T17:05:26Z

Thanks @LeSphax for your quick answer!

For the SetTextObs part, you were totally right. I had a logic issue where both agents got the state win.
No rush for the memory issue :)
I couldn't reproduce the issue with Tensorboard. Now the graphs load just fine, but I've got another small issue using the --load flag; the time starts over at zero, causing overlapping graphs. I'm not 100% sure this is happening because of this PR, but I didn't have this issue using the latest ml-agents release.
Edit: Actually the issue happened again, I can't find a way reproduce the issue. On this screenshot, the training has been running for about 6 hours, and there's no scalars.
I though using the Elo ranking for team was not possible if you were using different brains between the teammates. Is that true? Some people suggested a ranking system like TrueSkill (not open source) but I'm not sure that's possible. In the meantime, a friend suggested using only a single brain to control multiple players of the same team, so I'll try something like this and see how it plays out.

Edit: I also forgot to mention that I'm using Anaconda on Windows 10 to manage my env, and that I'm running tensorflow-cpu. I'm using your fork with the latest master updates (using a rebase)
Edit 2: Beside in-game stats, what does switching agents does?
Edit 3: I figured out why my Tensorboard showed the wrong data. It's because it was actually the ghost's data! Reversing the brains in the Academy object (first the ghost brain, then the learning brain) fixed the issue for me, so this seems like a bug. This also fixed the "graph continuity" issue.

CLAassistant · 2019-09-13T23:49:39Z

All committers have signed the CLA.

… using them so the buffer just got bigger and bigger

…load the graph.

LeSphax · 2019-10-02T09:26:26Z

Hey @AcelisWeaven ,

I finally spent some time investigating this memory leak (actually there were two). I believe my two latest commits should fix it.

There is also a third memory leak that happens when you run only in inference mode but it seems like this one was already in the framework in that version. To fix it I will need to update the PR to the latest develop branch.

AcelisWeaven · 2019-10-04T09:41:13Z

Hey @LeSphax !

I've been running it for 24h+ and everything works flawlessly.

Thanks!

Edit: The longer sessions allowed me to find some issues in my environment, and since I fixed those, the training process is a lot better. I still have a little issue after 48h+ training (without crash!), were nothing happen anymore. It may be coming from my environment so I'll take a closer look. Edit 2: Definitely come from my environment (the ball can sometime be stuck "inside" the wall).

AcelisWeaven · 2019-10-19T16:51:06Z

ml-agents/mlagents/trainers/ghost/trainer.py

+                    policy_used_ckpts = all_ckpts
+
+                    # There is a 1-p_load_from_all_checkpoints probability that we sample the policy only from the last load_from_last_N_checkpoints policies
+                    if random.random() > self.ghost_prob_sample_only_recent:


@LeSphax Shouldn't this condition be reversed, according to the explanation? To me it seems that it'll have only 40% chances to pick a recent checkpoints list.

ghost_prob_sample_only_recent: Probability that the ghost will only sample recent checkpoints instead of the whole history. 0.6 means 60% chance to sample recent and 40% to sample the whole history.

Suggested change

if random.random() > self.ghost_prob_sample_only_recent:

if random.random() < self.ghost_prob_sample_only_recent:

chriselion · 2019-12-05T19:34:02Z

Sorry, this got automatically closed when we deleted the develop branch, and it looks like I can't reopen or change the target branch

chriselion · 2019-12-06T19:30:19Z

Changing target from develop to master.

elliott-omosheye · 2019-12-24T23:05:51Z

This has diverged a lot from master. I wouldn't mind doing a rebase but I can't figure out why this PR wasn't merged/declined months ago and don't want to waste the effort if the maintainers have a reason why they didn't like this.

LeSphax · 2019-12-25T06:13:14Z

Hey @elliott-omosheye,

I didn't work on this PR in a while but last time I discussed it with the maintainers it was in this issue #2559.

The status at that time was We are still quite interested in the contribution! We are planning on taking a look at is as part of a wider look at multi-agent, which we will be taking soon.

I don't mind doing the rebase myself if/when we want to merge it.

andrewcoh · 2020-01-29T23:54:34Z

This feature has been added in #3194. Thanks for the initial PR!

roboserg · 2020-01-29T23:56:34Z

This feature has been added in #3194. Thanks for the initial PR!

Does that mean this feature will be available in the next release?

andrewcoh · 2020-01-30T14:27:24Z

Yes.

LeSphax added 3 commits April 22, 2019 08:53

Ignore Pycharm config files

b128ab2

Modify the Tennis environment to support self play

ba54787

Change the rewards to promote competition Send a text observation with the winner of each match Use a Learning Brain and a ghost brain playing against each other Add logging to track the winrates and if the agents are getting better.

LeSphax mentioned this pull request Apr 23, 2019

Contribution: Self-play #1070

Closed

Fix loading bug

66b8013

Unity-Technologies deleted a comment May 2, 2019

AcelisWeaven mentioned this pull request Jul 26, 2019

Steps take extremely long with great processor and low buffer-size #2323

Closed

ervteng mentioned this pull request Aug 8, 2019

Is there any recommended way to load .nn file from external directory of Unity(not in "Asset" folder) #2407

Closed

mbaske mentioned this pull request Sep 15, 2019

Self-play environment / ghost trainer #2559

Closed

Sebastien Kerbrat added 2 commits October 1, 2019 10:43

Fix memory leak issue (The ghost was collecting experiences but never…

f040f75

… using them so the buffer just got bigger and bigger

Fix other memory leak caused by the saver being created each time we …

55193c6

…load the graph.

chriselion mentioned this pull request Oct 16, 2019

how do you train 2 diffrent brain at the same time? #2739

Closed

AcelisWeaven reviewed Oct 19, 2019

View reviewed changes

chriselion closed this Dec 5, 2019

chriselion reopened this Dec 6, 2019

chriselion changed the base branch from develop to master December 6, 2019 19:30

andrewcoh closed this Jan 29, 2020

github-actions bot locked as resolved and limited conversation to collaborators May 16, 2021

	if random.random() > self.ghost_prob_sample_only_recent:
	if random.random() < self.ghost_prob_sample_only_recent:

Self play: Add a ghost trainer and track agents elo rating #1975

Self play: Add a ghost trainer and track agents elo rating #1975

Uh oh!

Conversation

LeSphax commented Apr 23, 2019

Ghost trainer

Elo rating:

Tennis Environment

Uh oh!

shihzy commented Apr 30, 2019

Uh oh!

LeSphax commented May 2, 2019 via email

Uh oh!

AcelisWeaven commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeSphax commented Jul 16, 2019

Uh oh!

AcelisWeaven commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeSphax commented Oct 2, 2019

Uh oh!

AcelisWeaven commented Oct 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AcelisWeaven Oct 19, 2019

Choose a reason for hiding this comment

Uh oh!

chriselion commented Dec 5, 2019

Uh oh!

chriselion commented Dec 6, 2019

Uh oh!

elliott-omosheye commented Dec 24, 2019

Uh oh!

LeSphax commented Dec 25, 2019

Uh oh!

andrewcoh commented Jan 29, 2020

Uh oh!

roboserg commented Jan 29, 2020

Uh oh!

andrewcoh commented Jan 30, 2020

Uh oh!

Uh oh!

AcelisWeaven commented Jul 16, 2019 •

edited

Loading

AcelisWeaven commented Jul 16, 2019 •

edited

Loading

CLAassistant commented Sep 13, 2019 •

edited

Loading

AcelisWeaven commented Oct 4, 2019 •

edited

Loading