Skip to content

Self play: Add a ghost trainer and track agents elo rating #1975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

LeSphax
Copy link
Contributor

@LeSphax LeSphax commented Apr 23, 2019

See #1070 for more context around this pull-request.

Ghost trainer

This pull request adds a new type of trainer, the ghost, which allows an agent to play against past versions of himself.
The ghost manages several policies, each with a different past version of the agent.
The ghost brain doesn't update his weights, he is assigned to a master ppo trainer and use the master trainer's past checkpoints. On academy resets, the ghost randomly samples the ppo trainer's checkpoints and load them into its policies.

Ghost configuration in trainer_config.yaml:

TennisLearning:
normalize: true
max_steps: 2e5
use_elo_rating: true

TennisGhost:
trainer: ghost
ghost_master_brain: TennisLearning
ghost_num_policies: 3
ghost_prob_sample_only_recent: 0.6
ghost_recent_ckpts_threshold: 10

  • ghost_master_brain: The brain from which the ghost loads past checkpoints
  • ghost_num_policies: Number of policies that the ghost manages at the same time

I also wanted to give some control around the sampling of checkpoints so I added those parameters:

  • ghost_prob_sample_only_recent: Probability that the ghost will only sample recent checkpoints instead of the whole history. 0.6 means 60% chance to sample recent and 40% to sample the whole history.
  • ghost_recent_ckpts_threshold: Number of checkpoints that are considered recent.

The unity side can decide which agent is using the ghost brain and which agent is using the learning brain. Changing those assignments is also supported

Elo rating:

When using self-play, measuring rewards is not very helpful because better agents might be good at preventing each other from getting rewards.
One way to measure progression then is elo rating.
Each agent has an elo rating and agents will play matches against each other. The winner of the match gains rating and the loser loses rating, the amount they win/lose depends on their respective ratings before the match.

Elo rating is stored in the models, allowing each policy to know its own elo rating when loading a graph.

The match results are sent by agents in the CollectObservations method in this form
"opponent_agent_id|match_result".
So for example: "1234|win", "1234|loss" or "4321|playing".

Tennis Environment

What I just described was implemented on the python side and I also modified the Tennis Example to use self-play.
I added the TennisGhost, some winrate metrics to help me debug and I changed the reward settings to encourage winning matches instead of encouraging as many passes as possible.
So you can test this on the Tennis Environment. The results are a bit noisy, but it usually gets around 1240-1270 elo rating at 25000 steps with trainer_config.yaml and --save-freq=5000.

There are a few things I am not sure about:

  • At the moment, the match results are passed through text observations. This would make it hard for users to use text observations while using self-play. Should I add a field to the agent_info protobuf message instead?
  • I built self-play to work with two agents but the match results will need a different format in team vs team format. (Maybe add a match ID)
  • I modified the Tennis Environment to add self-play and put some additional metrics there to help me debug it. Not sure if it should stay like that or if there should be a self-play and non-self-play version.

So this pull-request might not be complete enough yet, but I felt like I should ask for feedback to make sure that this is something that should be in ML agents before proceeding.
If anything doesn't make sense feel free to ask me :)

LeSphax added 3 commits April 22, 2019 08:53
Change the rewards to promote competition
Send a text observation with the winner of each match
Use a Learning Brain and a ghost brain playing against each other
Add logging to track the winrates and if the agents are getting better.
    A ghost trainer manages several policies and assign them randomly to several agents.
    These policies' graphs are loaded randomly from saved models of a master trainer on each academy reset.
    Add an elo parameter to models to track their elo rating.
    Elo rating is updated by getting results of the matches between agents from text observations.
@LeSphax LeSphax mentioned this pull request Apr 23, 2019
@shihzy
Copy link
Contributor

shihzy commented Apr 30, 2019

hi - this is awesome. we need a couple of weeks to review, but wanted to put it on your radar we are looking at it.

@LeSphax
Copy link
Contributor Author

LeSphax commented May 2, 2019 via email

@Unity-Technologies Unity-Technologies deleted a comment May 2, 2019
@AcelisWeaven
Copy link
Contributor

AcelisWeaven commented Jul 16, 2019

Hi @LeSphax,

Thanks for your contribution, I've tried to use it on my project but I'm having some issues and some questions.

  • My ELO rating keeps going down on my game. This doesn't happen on your modified Tennis environnement however. This may be due to the fact that my game is a bit more complex. Do you have an idea of what can cause this? Here's a sample output from my training:
INFO:mlagents.trainers: league3-0: PlayerLearning: Step: 354000. Time Elapsed: 27112.426 s Mean Reward: 2.748. Std of Reward: 0.579. Elo Rating: 357.2 Training.
INFO:mlagents.trainers: league3-0: PlayerLearningGhost: Step: 354000. Time Elapsed: 27112.503 s Mean Reward: -2.748. Std of Reward: 0.579. Elo Rating: 1200.0 Not Training.
  • Do you have a solution to limit RAM usage? Even on the demo env, ml-agents allocates about 2Mo/second. After an seven hours of training, it crashes because of this (got 16Go of RAM). Here's my env configuration, for reference:
PlayerLearning:
    batch_size: 1024
    buffer_size: 102400
    max_steps: 10000000
    hidden_units: 64
    num_layers: 3
    summary_freq: 1000
    use_elo_rating: true

PlayerLearningGhost:
    trainer: ghost
    ghost_master_brain: PlayerLearning
    ghost_num_policies: 3
    ghost_recent_ckpts_threshold: 10
    ghost_prob_sample_only_recent: 0.5
  • Is Curiosity supported?
  • Do you had any issue with Tensorboard? With Tennis I can see the graphs fine, but with my env nothing even show up. Maybe that's an issue on my side.
  • Also... any plan for team training? :)

Here's a screenshot of my learning scene, red player have to score in the blue zone and vice-versa.
image
Actions are discrete (move in 8 directions, jump)

Thanks,

@LeSphax
Copy link
Contributor Author

LeSphax commented Jul 16, 2019

Hi @AcelisWeaven,

Unfortunately I am on holidays without my computer and it has been a while since I looked at this but let's see if I can help anyway :)

  • For the elo rating going down my guess would be that match results are not attributed correctly. What matters is this line for the elo computation SetTextObs(opponent.Id + "|" + matchState);

If that doesn't help the way I debugged that kind of problem is to have only two agents training, don't switch the brain between them and log all the match results to check that they make sense.

  • For the memory leak, I don't remember having problems when running overnight with 16GB but I also never looked at it so if it happens on the Tennis environment there is surely a problem.

  • I think curiosity should work since the checkpoints should include the curiosity model but I didn't try it.

  • I don't think I had problems with Tensorboard.

  • For teams I think the ghost could work the same way but we would need to change the elo rating calculation and send a list of opponents and teammates with the result of the match.

I will try to get hold of a computer this weekend and see if I can investigate that memory leak.

@AcelisWeaven
Copy link
Contributor

AcelisWeaven commented Jul 16, 2019

Thanks @LeSphax for your quick answer!

  • For the SetTextObs part, you were totally right. I had a logic issue where both agents got the state win.
  • No rush for the memory issue :)
  • I couldn't reproduce the issue with Tensorboard. Now the graphs load just fine, but I've got another small issue using the --load flag; the time starts over at zero, causing overlapping graphs. I'm not 100% sure this is happening because of this PR, but I didn't have this issue using the latest ml-agents release.
    image Edit: Actually the issue happened again, I can't find a way reproduce the issue. On this screenshot, the training has been running for about 6 hours, and there's no scalars.
    image
  • I though using the Elo ranking for team was not possible if you were using different brains between the teammates. Is that true? Some people suggested a ranking system like TrueSkill (not open source) but I'm not sure that's possible. In the meantime, a friend suggested using only a single brain to control multiple players of the same team, so I'll try something like this and see how it plays out.

Edit: I also forgot to mention that I'm using Anaconda on Windows 10 to manage my env, and that I'm running tensorflow-cpu. I'm using your fork with the latest master updates (using a rebase)
Edit 2: Beside in-game stats, what does switching agents does?
Edit 3: I figured out why my Tensorboard showed the wrong data. It's because it was actually the ghost's data! Reversing the brains in the Academy object (first the ghost brain, then the learning brain) fixed the issue for me, so this seems like a bug. This also fixed the "graph continuity" issue.

@CLAassistant
Copy link

CLAassistant commented Sep 13, 2019

CLA assistant check
All committers have signed the CLA.

Sebastien Kerbrat added 2 commits October 1, 2019 10:43
@LeSphax
Copy link
Contributor Author

LeSphax commented Oct 2, 2019

Hey @AcelisWeaven ,

I finally spent some time investigating this memory leak (actually there were two). I believe my two latest commits should fix it.

There is also a third memory leak that happens when you run only in inference mode but it seems like this one was already in the framework in that version. To fix it I will need to update the PR to the latest develop branch.

@AcelisWeaven
Copy link
Contributor

AcelisWeaven commented Oct 4, 2019

Hey @LeSphax !

I've been running it for 24h+ and everything works flawlessly.

Thanks!

Edit: The longer sessions allowed me to find some issues in my environment, and since I fixed those, the training process is a lot better. I still have a little issue after 48h+ training (without crash!), were nothing happen anymore. It may be coming from my environment so I'll take a closer look. Edit 2: Definitely come from my environment (the ball can sometime be stuck "inside" the wall).

policy_used_ckpts = all_ckpts

# There is a 1-p_load_from_all_checkpoints probability that we sample the policy only from the last load_from_last_N_checkpoints policies
if random.random() > self.ghost_prob_sample_only_recent:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeSphax Shouldn't this condition be reversed, according to the explanation? To me it seems that it'll have only 40% chances to pick a recent checkpoints list.

ghost_prob_sample_only_recent: Probability that the ghost will only sample recent checkpoints instead of the whole history. 0.6 means 60% chance to sample recent and 40% to sample the whole history.

Suggested change
if random.random() > self.ghost_prob_sample_only_recent:
if random.random() < self.ghost_prob_sample_only_recent:

@chriselion chriselion closed this Dec 5, 2019
@chriselion
Copy link
Contributor

Sorry, this got automatically closed when we deleted the develop branch, and it looks like I can't reopen or change the target branch

@chriselion chriselion reopened this Dec 6, 2019
@chriselion
Copy link
Contributor

Changing target from develop to master.

@chriselion chriselion changed the base branch from develop to master December 6, 2019 19:30
@elliott-omosheye
Copy link

This has diverged a lot from master. I wouldn't mind doing a rebase but I can't figure out why this PR wasn't merged/declined months ago and don't want to waste the effort if the maintainers have a reason why they didn't like this.

@LeSphax
Copy link
Contributor Author

LeSphax commented Dec 25, 2019

Hey @elliott-omosheye,

I didn't work on this PR in a while but last time I discussed it with the maintainers it was in this issue #2559.

The status at that time was We are still quite interested in the contribution! We are planning on taking a look at is as part of a wider look at multi-agent, which we will be taking soon.

I don't mind doing the rebase myself if/when we want to merge it.

@andrewcoh
Copy link
Contributor

This feature has been added in #3194. Thanks for the initial PR!

@andrewcoh andrewcoh closed this Jan 29, 2020
@roboserg
Copy link

This feature has been added in #3194. Thanks for the initial PR!

Does that mean this feature will be available in the next release?

@andrewcoh
Copy link
Contributor

Yes.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants