Skip to content

When networkTopology.mode is not hard, but subgroups exist, scheduling is not possible. #4871

@zhengchenyu

Description

@zhengchenyu

Description

When networkTopology.mode is not hard, but subgroups exist, scheduling is not possible.

Steps to reproduce the issue

Create job like below yaml:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vcjob-with-subgroups
  namespace: default
spec:
  schedulerName: volcano
  queue: test
#  networkTopology:
#    mode: soft
#    highestTierAllowed: 1
  tasks:
    - name: worker
      replicas: 2
      partitionPolicy:
        totalPartitions: 2
        partitionSize: 1
        minPartitions: 2
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: worker
              image: nginx:latest
              resources:
                requests:
                  cpu: "2"
                  memory: "40Gi"

We see that the pod cannot be scheduled:

[user@host dir]# kubectl get vcjob
NAME                   STATUS    MINAVAILABLE   RUNNINGS   AGE
vcjob-with-subgroups   Pending   2                         5s
[user@host dir]# kubectl get pg
NAME                                                        STATUS    MINMEMBER   RUNNINGS   AGE
vcjob-with-subgroups-fd291270-29cb-4f7b-9fb6-62ff5dc3354b   Inqueue   2                      8s
[user@host dir]# kubectl get pod
NAME                            READY   STATUS    RESTARTS   AGE
vcjob-with-subgroups-worker-0   0/1     Pending   0          12s
vcjob-with-subgroups-worker-1   0/1     Pending   0          12s

Describe the results you received and expected

The pod is running.

What version of Volcano are you using?

master

Any other relevant information

The scheduler has the following logs:

E1224 06:03:19.760817       1 allocate.go:330] "Can not find default subJob or tasks for job" job="default/vcjob-with-subgroups-fd291270-29cb-4f7b-9fb6-62ff5dc3354b" subJobExist=false tasksExist=false

Based on the logs, the cause is easily identified: For jobs with subgroups but no hard topology, scheduling will enter the else. However, due to the existence of subgroups, job.SubJob and actx.tasksNoHardTopology will actually be indexed using their own SubJobID. Therefore, sjExist and tasksExist will both return false, causing scheduling failure.

Solution:

  • Use allocateForJob for scheduling when a hard topology is set or a subgroup exists.

Performance:

Although using allocateForJob will enter the Network Topology scheduling logic, using two nested loops. However, since no hard topology is used, meaning that the two nested loops will not get caught in the Network Topology logic, there are no performance issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions