Skip to content

schedulers/kubernetes_scheduler: add workspace/patching support #384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Feb 9, 2022

This adds patching support to the kubernetes scheduler. It requires you to specify image_repo as a config option with the docker repository to push to.

If the dryrun/schedule methods find a local image such as sha256:... it'll remap it to a remote repo package and push it during schedule.

Test plan:

pyre
pytest torchx/schedulers/test/kubernetes_scheduler_test.py
(torchx) tristanr@tristanr-arch2 ~/D/torchx-proj> torchx run --scheduler kubernetes -c queue=default,image_repo=495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests --wait --log utils.sh sh foo.sh
torchx 2022-02-09 15:51:12 INFO     loaded configs from /home/tristanr/Developer/torchx-proj/.torchxconfig
torchx 2022-02-09 15:51:12 INFO     building patch images for workspace: file:///home/tristanr/Developer/torchx-proj...
torchx 2022-02-09 15:51:13 INFO     built image sha256:d1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec from ghcr.io/pytorch/torchx:0.1.2dev0
torchx 2022-02-09 15:51:14 INFO     pushing image 495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests:d1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec...
torchx 2022-02-09 15:51:14 INFO     docker: {'status': 'The push refers to repository [495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests]'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '004e5e059580'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'de1d3a8ac491'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'e6d41c036803'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '0827b8e37332'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'a8496aa14f72'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '0827b8e37332'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'de1d3a8ac491'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'e6d41c036803'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '004e5e059580'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'a8496aa14f72'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:16 INFO     docker: {'status': 'd1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec: digest: sha256:da9fba179cb37f2f6d6d09c16dc4f0c39ca84a6fbb767c0aff7b77738b608805 size: 2413'}
torchx 2022-02-09 15:51:16 INFO     docker: {'progressDetail': {}, 'aux': {'Tag': 'd1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec', 'Digest': 'sha256:da9fba179cb37f2f6d6d09c16dc4f0c39ca84a6fbb767c0aff7b77738b608805', 'Size': 2413}}
kubernetes://torchx/default:sh-n71zqm25lrk61
torchx 2022-02-09 15:51:17 INFO     Launched app: kubernetes://torchx/default:sh-n71zqm25lrk61
torchx 2022-02-09 15:51:17 INFO     AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2022-02-09 15:51:17 INFO     Job URL: None
torchx 2022-02-09 15:51:17 INFO     Waiting for the app to finish...
torchx 2022-02-09 15:51:17 INFO     Waiting for app to start before logging...
torchx 2022-02-09 15:51:22 INFO     Job finished: SUCCEEDED
sh/0 2022-02-09T23:51:21.702981500Z foo

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2022
@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34125887

d4l3k added a commit that referenced this pull request Feb 10, 2022
Summary:
This adds patching support to the kubernetes scheduler. It requires you to specify `image_repo` as a config option with the docker repository to push to.

If the dryrun/schedule methods find a local image such as `sha256:...` it'll remap it to a remote repo package and push it during schedule.

Pull Request resolved: #384

Test Plan:
```
pyre
pytest torchx/schedulers/test/kubernetes_scheduler_test.py
```
```
(torchx) tristanr@tristanr-arch2 ~/D/torchx-proj> torchx run --scheduler kubernetes -c queue=default,image_repo=495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests --wait --log utils.sh sh foo.sh
torchx 2022-02-09 15:51:12 INFO     loaded configs from /home/tristanr/Developer/torchx-proj/.torchxconfig
torchx 2022-02-09 15:51:12 INFO     building patch images for workspace: file:///home/tristanr/Developer/torchx-proj...
torchx 2022-02-09 15:51:13 INFO     built image sha256:d1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec from ghcr.io/pytorch/torchx:0.1.2dev0
torchx 2022-02-09 15:51:14 INFO     pushing image 495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests:d1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec...
torchx 2022-02-09 15:51:14 INFO     docker: {'status': 'The push refers to repository [495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests]'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '004e5e059580'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'de1d3a8ac491'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'e6d41c036803'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '0827b8e37332'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'a8496aa14f72'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '0827b8e37332'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'de1d3a8ac491'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'e6d41c036803'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '004e5e059580'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'a8496aa14f72'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:16 INFO     docker: {'status': 'd1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec: digest: sha256:da9fba179cb37f2f6d6d09c16dc4f0c39ca84a6fbb767c0aff7b77738b608805 size: 2413'}
torchx 2022-02-09 15:51:16 INFO     docker: {'progressDetail': {}, 'aux': {'Tag': 'd1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec', 'Digest': 'sha256:da9fba179cb37f2f6d6d09c16dc4f0c39ca84a6fbb767c0aff7b77738b608805', 'Size': 2413}}
kubernetes://torchx/default:sh-n71zqm25lrk61
torchx 2022-02-09 15:51:17 INFO     Launched app: kubernetes://torchx/default:sh-n71zqm25lrk61
torchx 2022-02-09 15:51:17 INFO     AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2022-02-09 15:51:17 INFO     Job URL: None
torchx 2022-02-09 15:51:17 INFO     Waiting for the app to finish...
torchx 2022-02-09 15:51:17 INFO     Waiting for app to start before logging...
torchx 2022-02-09 15:51:22 INFO     Job finished: SUCCEEDED
sh/0 2022-02-09T23:51:21.702981500Z foo
```

Reviewed By: kiukchung

Differential Revision: D34125887

Pulled By: d4l3k

fbshipit-source-id: e03d6c0ea70f4827b1eb5d24c8ad973c6c75a859
Summary:
This adds patching support to the kubernetes scheduler. It requires you to specify `image_repo` as a config option with the docker repository to push to.

If the dryrun/schedule methods find a local image such as `sha256:...` it'll remap it to a remote repo package and push it during schedule.

Pull Request resolved: #384

Test Plan:
```
pyre
pytest torchx/schedulers/test/kubernetes_scheduler_test.py
```
```
(torchx) tristanr@tristanr-arch2 ~/D/torchx-proj> torchx run --scheduler kubernetes -c queue=default,image_repo=495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests --wait --log utils.sh sh foo.sh
torchx 2022-02-09 15:51:12 INFO     loaded configs from /home/tristanr/Developer/torchx-proj/.torchxconfig
torchx 2022-02-09 15:51:12 INFO     building patch images for workspace: file:///home/tristanr/Developer/torchx-proj...
torchx 2022-02-09 15:51:13 INFO     built image sha256:d1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec from ghcr.io/pytorch/torchx:0.1.2dev0
torchx 2022-02-09 15:51:14 INFO     pushing image 495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests:d1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec...
torchx 2022-02-09 15:51:14 INFO     docker: {'status': 'The push refers to repository [495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests]'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '004e5e059580'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'de1d3a8ac491'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'e6d41c036803'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '0827b8e37332'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'a8496aa14f72'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Preparing', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Waiting', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '0827b8e37332'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'de1d3a8ac491'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'e6d41c036803'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '004e5e059580'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'a8496aa14f72'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '824bf068fd3d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '0f801b69538d'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '1f84c52a7d38'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': 'f15a0881ce19'}
torchx 2022-02-09 15:51:15 INFO     docker: {'status': 'Layer already exists', 'progressDetail': {}, 'id': '354dfcbe6a14'}
torchx 2022-02-09 15:51:16 INFO     docker: {'status': 'd1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec: digest: sha256:da9fba179cb37f2f6d6d09c16dc4f0c39ca84a6fbb767c0aff7b77738b608805 size: 2413'}
torchx 2022-02-09 15:51:16 INFO     docker: {'progressDetail': {}, 'aux': {'Tag': 'd1cd394f88861a5ca18de88cc0801513cd6c3dc7d945f7cbfe7121bb1d552bec', 'Digest': 'sha256:da9fba179cb37f2f6d6d09c16dc4f0c39ca84a6fbb767c0aff7b77738b608805', 'Size': 2413}}
kubernetes://torchx/default:sh-n71zqm25lrk61
torchx 2022-02-09 15:51:17 INFO     Launched app: kubernetes://torchx/default:sh-n71zqm25lrk61
torchx 2022-02-09 15:51:17 INFO     AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2022-02-09 15:51:17 INFO     Job URL: None
torchx 2022-02-09 15:51:17 INFO     Waiting for the app to finish...
torchx 2022-02-09 15:51:17 INFO     Waiting for app to start before logging...
torchx 2022-02-09 15:51:22 INFO     Job finished: SUCCEEDED
sh/0 2022-02-09T23:51:21.702981500Z foo
```

Reviewed By: kiukchung

Differential Revision: D34125887

Pulled By: d4l3k

fbshipit-source-id: bc1177ee17e33d5e6bdd340234755c5c53670293
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34125887

@codecov
Copy link

codecov bot commented Feb 10, 2022

Codecov Report

Merging #384 (f107376) into main (c7efc75) will decrease coverage by 0.35%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #384      +/-   ##
==========================================
- Coverage   94.70%   94.34%   -0.36%     
==========================================
  Files          63       63              
  Lines        3359     3398      +39     
==========================================
+ Hits         3181     3206      +25     
- Misses        178      192      +14     
Impacted Files Coverage Δ
torchx/schedulers/local_scheduler.py 93.25% <ø> (ø)
torchx/schedulers/slurm_scheduler.py 98.00% <ø> (ø)
torchx/schedulers/kubernetes_scheduler.py 89.86% <78.57%> (-4.56%) ⬇️
torchx/runner/workspaces.py 100.00% <100.00%> (ø)
torchx/schedulers/docker_scheduler.py 96.27% <100.00%> (-0.94%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7efc75...f107376. Read the comment docs.

@d4l3k d4l3k deleted the k8spatch branch February 15, 2022 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants