-
Notifications
You must be signed in to change notification settings - Fork 136
Closed
Labels
bugSomething isn't workingSomething isn't workingmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesissues related to the torchx.runner and torchx.scheduler modulesslurmslurm schedulerslurm scheduler
Description
🐛 Bug
According to aws/aws-parallelcluster#2198 PCluster has problems running jobs that have explicit memory requirements.
We need to modify our slurm scheduler to address this.
Module (check all that applies):
-
torchx.spec
-
torchx.component
-
torchx.apps
-
torchx.runtime
-
torchx.cli
- [ x]
torchx.schedulers
-
torchx.pipelines
-
torchx.aws
-
torchx.examples
-
other
To Reproduce
Steps to reproduce the behavior:
- ssh to slurm cluster
- create main.py that prints hello world
- torchx run -s slurm --scheduler_args partition=compute,time=10 dist.ddp --script main.py
Expected behavior
Job successfully executed
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesissues related to the torchx.runner and torchx.scheduler modulesslurmslurm schedulerslurm scheduler