Skip to content

[Slurm scheduler] Add better support for specifying resources in slurm #359

@aivanou

Description

@aivanou

🐛 Bug

According to aws/aws-parallelcluster#2198 PCluster has problems running jobs that have explicit memory requirements.

We need to modify our slurm scheduler to address this.

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • [ x] torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

Steps to reproduce the behavior:

  1. ssh to slurm cluster
  2. create main.py that prints hello world
  3. torchx run -s slurm --scheduler_args partition=compute,time=10 dist.ddp --script main.py

Expected behavior

Job successfully executed

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesslurmslurm scheduler

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions