Skip to content

dist.ddp: make rendezvous work out of the box on all schedulers #400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Feb 23, 2022

Every OSS scheduler we support provides a special environment variable for the rank0 host address. This creates a new macro that provides that for all the schedulers and updates dist.ddp to use it.

This macro is a bit different since it's a rank0_env macro which provides the name of the ENV variable to read from. To use that, the component either needs to read that or do something like f"/bin/bash -c main.py $${macros.rank0_env}. We wrap dist.ddp with /bin/bash to achieve this. Due to the macro template language we need to use $$ instead of just $.

Other misc changes:

  • fixed ray tests to skip if ray isn't installed
  • deleted TORCHX_IMAGE_EXAMPLES since it doesn't exist anymore
  • updated dist.ddp to have -m, --max_restarts and set LOGLEVEL by default

Test plan:

Updated slurm, docker and aws batch integ tests to use dist.ddp instead of bespoke usage.

Tested k8s and local_cwd manually.

I haven't tested ray

Updated component test framework to use two workers

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 23, 2022
@codecov
Copy link

codecov bot commented Feb 23, 2022

Codecov Report

Merging #400 (06a0433) into main (be2a579) will decrease coverage by 0.07%.
The diff coverage is 86.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #400      +/-   ##
==========================================
- Coverage   94.57%   94.50%   -0.08%     
==========================================
  Files          66       66              
  Lines        3593     3638      +45     
==========================================
+ Hits         3398     3438      +40     
- Misses        195      200       +5     
Impacted Files Coverage Δ
torchx/schedulers/slurm_scheduler.py 98.11% <ø> (ø)
torchx/version.py 100.00% <ø> (ø)
torchx/components/dist.py 77.77% <76.66%> (-4.58%) ⬇️
torchx/schedulers/api.py 93.24% <100.00%> (+0.59%) ⬆️
torchx/schedulers/aws_batch_scheduler.py 85.39% <100.00%> (+0.25%) ⬆️
torchx/schedulers/docker_scheduler.py 97.20% <100.00%> (+1.16%) ⬆️
torchx/schedulers/kubernetes_scheduler.py 92.67% <100.00%> (+0.15%) ⬆️
torchx/schedulers/local_scheduler.py 92.95% <100.00%> (+0.01%) ⬆️
torchx/schedulers/ray_scheduler.py 93.93% <100.00%> (ø)
torchx/specs/api.py 98.90% <100.00%> (+<0.01%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be2a579...06a0433. Read the comment docs.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@d4l3k has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@d4l3k has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@d4l3k has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@d4l3k has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:
Every OSS scheduler we support provides a special environment variable for the rank0 host address. This creates a new macro that provides that for all the schedulers and updates `dist.ddp` to use it.

This macro is a bit different since it's a `rank0_env` macro which provides the name of the ENV variable to read from. To use that, the component either needs to read that or do something like `f"/bin/bash -c main.py $${macros.rank0_env}`. We wrap dist.ddp with `/bin/bash` to achieve this. Due to the macro template language we need to use `$$` instead of just `$`.

Other misc changes:
* fixed ray tests to skip if `ray` isn't installed
* deleted TORCHX_IMAGE_EXAMPLES since it doesn't exist anymore
* updated dist.ddp to have `-m`, `--max_restarts` and set LOGLEVEL by default

Pull Request resolved: #400

Test Plan:
Updated slurm, docker and aws batch integ tests to use `dist.ddp` instead of bespoke usage.
Tested k8s and local_cwd manually.
I haven't tested ray
Updated component test framework to use two workers

Reviewed By: kiukchung

Differential Revision: D34437397

Pulled By: d4l3k

fbshipit-source-id: f2f78e7d293c9b50393e75cf4e9ca0eb5388d0b4
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34437397

@d4l3k d4l3k deleted the distrank0 branch February 28, 2022 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants