Skip to content

[WIP] Add e2e test for KubeRay NativeWorkloadScheduling#6227

Draft
mboersma wants to merge 17 commits into
kubernetes-sigs:mainfrom
mboersma:kuberay-native-scheduling-e2e
Draft

[WIP] Add e2e test for KubeRay NativeWorkloadScheduling#6227
mboersma wants to merge 17 commits into
kubernetes-sigs:mainfrom
mboersma:kuberay-native-scheduling-e2e

Conversation

@mboersma

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add a new e2e test that exercises the unreleased NativeWorkloadScheduling feature from the kuberay workload-poc branch. This feature uses the Kubernetes-native scheduling.k8s.io/v1alpha2 API (Workload + PodGroup) for gang scheduling of Ray pods.

Changes:

  • New cluster template ci-version-native-scheduling with K8s feature gates GenericWorkload, GangScheduling, and runtime config for scheduling.k8s.io/v1alpha2
  • InstallHelmChartFromPath and InstallKubeRayOperatorFromSource helpers for installing kuberay from a local chart with custom image
  • KubeRayNativeSchedulingSpec test that creates a RayCluster with the opt-in annotation, verifies Workload and PodGroup resources are created, and confirms all pods reach Running state
  • New Ginkgo test case tagged [KubeRay] [NativeScheduling]

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 10, 2026
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jont828 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 10, 2026
@mboersma mboersma changed the title [WIP] Add e2e test for KubeRay NativeWorkloadScheduling on self-managed k8s 1.36+ [WIP] Add e2e test for KubeRay NativeWorkloadScheduling Apr 10, 2026
@codecov

codecov Bot commented Apr 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.01%. Comparing base (b51d84e) to head (18b8f37).
⚠️ Report is 17 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6227   +/-   ##
=======================================
  Coverage   44.01%   44.01%           
=======================================
  Files         288      288           
  Lines       25305    25305           
=======================================
  Hits        11138    11138           
  Misses      13395    13395           
  Partials      772      772           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mboersma mboersma force-pushed the kuberay-native-scheduling-e2e branch from b956e97 to a8aa2fc Compare April 10, 2026 20:36
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

4 similar comments
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Last run looked like a flake, actually.

@mboersma

Copy link
Copy Markdown
Contributor Author

Test passed, and it looks like it's testing the right things. Now I'll add some more specific workload API checks.

@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Run the test again after adding more specific checks.

@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

1 similar comment
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Just to ensure it's still passing after a refactoring to remove duplicated code.

@mboersma mboersma force-pushed the kuberay-native-scheduling-e2e branch from 51e99fb to f16dc8c Compare April 13, 2026 21:27
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

1 similar comment
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma

mboersma commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026
@mboersma mboersma force-pushed the kuberay-native-scheduling-e2e branch from a94382b to 84b6340 Compare May 14, 2026 20:36
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2026
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

mboersma added 15 commits June 12, 2026 14:16
… 1.36+

Add a new e2e test that exercises the unreleased NativeWorkloadScheduling
feature from the kuberay workload-poc branch. This feature uses the
Kubernetes-native scheduling.k8s.io/v1alpha2 API (Workload + PodGroup)
for gang scheduling of Ray pods.

Changes:
- New cluster template ci-version-native-scheduling with K8s feature
  gates GenericWorkload, GangScheduling, and runtime config for
  scheduling.k8s.io/v1alpha2
- InstallHelmChartFromPath and InstallKubeRayOperatorFromSource helpers
  for installing kuberay from a local chart with custom image
- KubeRayNativeSchedulingSpec test that creates a RayCluster with the
  opt-in annotation, verifies Workload and PodGroup resources are
  created, and confirms all pods reach Running state
- New Ginkgo test case tagged [KubeRay] [NativeScheduling]
…dential provider dependency

- Add scripts/ci-build-kuberay-operator.sh to clone marosset/kuberay@workload-poc,
  build the operator image, and push it to the local registry
- Source the build script from ci-e2e.sh when GINKGO_FOCUS matches NativeScheduling
- Remove ACR credential provider scripts and kubelet args from the
  ci-version-native-scheduling template (not needed without custom CCM)
- Remove cloud-provider-azure-chart-ci HelmChartProxy (use released CCM)
- Remove CLOUD_PROVIDER_AZURE_LABEL=azure-ci override from the test
- Add _kuberay-source/ to .gitignore
The ci-version-native-scheduling template requires the azure-ci CCM chart
variant with explicit image tags, same as other ci-version flavors. Without
it the released cloud-provider-azure chart fails to install because it
cannot auto-detect images for unreleased K8s versions.

- Restore template to full ci-version parity (ACR credential provider,
  cloud-provider-azure-chart-ci HelmChartProxy)
- Restore CLOUD_PROVIDER_AZURE_LABEL=azure-ci in the test
- Trigger CCM build in ci-e2e.sh when GINKGO_FOCUS matches NativeScheduling
The Prow e2e-kuberay job sets GINKGO_FOCUS=\[KubeRay\], but the build
trigger only checked for NativeScheduling. This caused KUBERAY_SOURCE_DIR
to never be set, failing InstallKubeRayOperatorFromSource.
The ci-build-kuberay-operator.sh script computed REPO_ROOT as a relative
path (e.g. ./scripts/..), which resulted in KUBERAY_SOURCE_DIR being
exported as a relative path. When Ginkgo runs the test binary, the
working directory differs from the repo root, causing the Helm chart
lookup to fail with 'no such file or directory'.

Fix by resolving REPO_ROOT to an absolute path via pwd after the cd.
The kuberay-operator Helm chart includes ~71K lines of CRDs that need
to be applied to the API server. On a small self-managed VM
(Standard_D2s_v3), processing these CRDs plus ACR credential provider
auth and image pull can exceed the 5-minute timeout. The CI run
confirmed an exact 5-minute timeout hit (context deadline exceeded).

Increase to 10 minutes to give sufficient headroom.
When InstallHelmChartFromPath fails, dump:
- Pod status, container states, and restart counts
- Pod events (scheduling, image pull, etc.)
- Pod logs (operator startup errors)
- CRD list (to check if CRDs finished processing)

Also pass --debug to helm install for verbose Helm output
showing what it's waiting on during --wait.
…values

The kuberay operator Helm chart template uses Go's %t format verb on
featureGates[N].enabled values. When --set-string is used, Helm stores
these as strings ("true") rather than booleans, causing fmt.Sprintf
to produce "%terraform plan(string=true)" which fails strconv.ParseBool.

Using --set passes the values as booleans, matching what the chart expects.
Add a parallel negative test that creates a RayCluster WITHOUT the
ray.io/native-workload-scheduling annotation and verifies that:
- No Workload resources are created
- No PodGroup resources are created
- Pods do not have schedulingGroup set

The test runs on its own cluster (vm-nonatsched) in a separate Ginkgo
Context, so it executes in parallel with the positive test.
- Unify 4 identical input structs into single KubeRaySpecInput
- Extract waitForRayClusterReady() and waitForRayPodRunning() helpers
- Simplify newRayClusterWithNativeScheduling() to build on newRayClusterUnstructured()
- Move podGVR to package level (was declared locally in 2 functions)
- Simplify Workload/PodGroup lookups from scan to direct Get by name
- Use rayClusterName variable consistently instead of hardcoded strings
- Net reduction of ~141 lines
- Use NestedFieldNoCopy+BeNumerically instead of NestedInt64 for
  minCount assertion (JSON numbers are float64, not int64)
- Strengthen gang scheduling assertion to verify all-or-nothing:
  at most 1 worker Running, >= 19 Pending (not just pendingCount > 0)
- Hardcode apiServer feature-gates consistently with controller-manager
  and scheduler (remove K8S_FEATURE_GATES override that could cause
  component feature gate mismatch)
@mboersma mboersma force-pushed the kuberay-native-scheduling-e2e branch from 84b6340 to 46bc744 Compare June 12, 2026 20:16
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Upstream Kubernetes renamed the in-tree gang-scheduling API group
version from scheduling.k8s.io/v1alpha2 to v1alpha3. The ci-version
flavor floats on ci/latest, so once CI advanced past that rename the
apiserver rejected --runtime-config=scheduling.k8s.io/v1alpha2=true
("group version ... has not been registered") and the control plane
never initialized. Update the runtime-config and the test GVRs to
v1alpha3 to track current master.
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Upstream Kubernetes removed the GangScheduling feature gate; the gang
scheduling behavior is now activated by the GenericWorkload gate, which
auto-enables the GangScheduling scheduler plugin (see scheduler
default_plugins.go applyFeatureGates). Passing the no-longer-recognized
GangScheduling=true to kube-scheduler made it exit 1 in a crash loop,
which blocked CNI and cloud-provider scheduling, so the control plane
node never became Ready and KCP never reported Initialized. Set the
scheduler feature-gates to GenericWorkload=true only to track current
master.
@mboersma

Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@mboersma: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e-kuberay 18b8f37 link false /test pull-cluster-api-provider-azure-e2e-kuberay

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mboersma

Copy link
Copy Markdown
Contributor Author

@marosset — CAPZ side is fixed and the cluster comes up healthy. The test now fails only because the KubeRay operator crash-loops on startup:

NativeWorkloadScheduling feature gate is enabled but scheduling.k8s.io/v1alpha2 API is not available.

Current master Kubernetes renamed scheduling.k8s.io/v1alpha2v1alpha3. Action needed on the workload-poc branch: switch the operator to scheduling.k8s.io/v1alpha3 and rebuild the operator image. Once that's pushed, this PR should pass as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants