-
Notifications
You must be signed in to change notification settings - Fork 136
dist.ddp: make rendezvous work out of the box on all schedulers #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #400 +/- ##
==========================================
- Coverage 94.57% 94.50% -0.08%
==========================================
Files 66 66
Lines 3593 3638 +45
==========================================
+ Hits 3398 3438 +40
- Misses 195 200 +5
Continue to review full report at Codecov.
|
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Every OSS scheduler we support provides a special environment variable for the rank0 host address. This creates a new macro that provides that for all the schedulers and updates `dist.ddp` to use it. This macro is a bit different since it's a `rank0_env` macro which provides the name of the ENV variable to read from. To use that, the component either needs to read that or do something like `f"/bin/bash -c main.py $${macros.rank0_env}`. We wrap dist.ddp with `/bin/bash` to achieve this. Due to the macro template language we need to use `$$` instead of just `$`. Other misc changes: * fixed ray tests to skip if `ray` isn't installed * deleted TORCHX_IMAGE_EXAMPLES since it doesn't exist anymore * updated dist.ddp to have `-m`, `--max_restarts` and set LOGLEVEL by default Pull Request resolved: #400 Test Plan: Updated slurm, docker and aws batch integ tests to use `dist.ddp` instead of bespoke usage. Tested k8s and local_cwd manually. I haven't tested ray Updated component test framework to use two workers Reviewed By: kiukchung Differential Revision: D34437397 Pulled By: d4l3k fbshipit-source-id: f2f78e7d293c9b50393e75cf4e9ca0eb5388d0b4
This pull request was exported from Phabricator. Differential Revision: D34437397 |
Every OSS scheduler we support provides a special environment variable for the rank0 host address. This creates a new macro that provides that for all the schedulers and updates
dist.ddp
to use it.This macro is a bit different since it's a
rank0_env
macro which provides the name of the ENV variable to read from. To use that, the component either needs to read that or do something likef"/bin/bash -c main.py $${macros.rank0_env}
. We wrap dist.ddp with/bin/bash
to achieve this. Due to the macro template language we need to use$$
instead of just$
.Other misc changes:
ray
isn't installed-m
,--max_restarts
and set LOGLEVEL by defaultTest plan:
Updated slurm, docker and aws batch integ tests to use
dist.ddp
instead of bespoke usage.Tested k8s and local_cwd manually.
I haven't tested ray
Updated component test framework to use two workers