Update kubeflow/trainer manifests from v2.2.0 by juliusvonkohout · Pull Request #3413 · kubeflow/manifests

juliusvonkohout · 2026-03-20T21:06:02Z

@Raakshass I think we also have to adjust our trainer_job. yaml according to what https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb produces and make sure it satisfies pss restricted.

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments. Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

Copilot

Pull request overview

Updates the repository’s Kubeflow Trainer manifests to the v2.2.0 upstream revision, including refreshed third-party dependencies and new/updated TrainingRuntime examples.

Changes:

Bump Trainer sync target to v2.2.0 and update README upstream reference.
Update third-party controller dependencies (LeaderWorkerSet and JobSet) and manager/runtimes overlays.
Add new runtime manifests (e.g., JAX and XGBoost) and update CRDs/RBAC to match upstream.

Reviewed changes

Copilot reviewed 33 out of 36 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
scripts/synchronize-trainer-manifests.sh	Bumps synced upstream revision to `v2.2.0`.
applications/trainer/upstream/third-party/leaderworkerset/kustomization.yaml	Updates LWS dependency ref to `v0.8.0` and removes now-unused patches.
applications/trainer/upstream/third-party/jobset/kustomization.yaml	Updates JobSet dependency to `v0.11.0`.
applications/trainer/upstream/overlays/runtimes/kustomization.yaml	Updates runtime image tags and adds XGBoost image pinning.
applications/trainer/upstream/overlays/manager/kustomization.yaml	Updates controller image tag and adds a public ConfigMap generator.
applications/trainer/upstream/overlays/kubeflow-platform/kustomization.yaml	Adds deployment patches for Istio and label selectors.
applications/trainer/upstream/overlays/kubeflow-platform/patches/*	Adds Istio inbound-port exclusions and explicit selectors/labels.
applications/trainer/upstream/overlays/data-cache/kustomization.yaml	Updates data-cache overlay wiring and removes JobSet selector patches.
applications/trainer/upstream/overlays/data-cache/cluster_role.yaml	Tightens cache initializer ClusterRole permissions.
applications/trainer/upstream/base/runtimes/kustomization.yaml	Registers new runtime resources.
applications/trainer/upstream/base/runtimes/jax_distributed.yaml	Adds a JAX ClusterTrainingRuntime example.
applications/trainer/upstream/base/runtimes/xgboost_distributed.yaml	Adds an XGBoost ClusterTrainingRuntime example.
applications/trainer/upstream/base/runtimes/torch_distributed.yaml	Updates torch runtime policy shape and container image tag.
applications/trainer/upstream/base/runtimes/data-cache/torch_distributed_with_cache.yaml	Updates cache image version and torch runtime policy shape.
applications/trainer/upstream/base/runtimes/torchtune/**	Switches initializer PVC handling to `volumeClaimPolicies`.
applications/trainer/upstream/base/manager/manager.yaml	Updates security context, ports, probes, and service ports (metrics/status).
applications/trainer/upstream/base/manager/controller_manager_config.yaml	Adds `statusServer` configuration.
applications/trainer/upstream/base/rbac/role.yaml	Adjusts RBAC (events API groups, JobSet delete, etc.).
applications/trainer/upstream/base/rbac/public_configmap_role*.yaml	Adds RBAC to make the public ConfigMap readable.
applications/trainer/upstream/base/rbac/kustomization.yaml	Includes new public ConfigMap RBAC resources.
applications/trainer/upstream/base/crds/trainer.kubeflow.org_*trainingruntimes.yaml	Updates CRD schemas/validations to upstream v2.2.0.
README.md	Updates documented Trainer upstream revision to `v2.2.0`.

Comments suppressed due to low confidence (1)

applications/trainer/upstream/base/runtimes/data-cache/torch_distributed_with_cache.yaml:35

CACHE_IMAGE is hardcoded to v2.2.0-rc.0, which is inconsistent with the synced upstream revision being v2.2.0 elsewhere in this PR. Please update this to the same release version you are shipping (or centralize the version so this does not drift).

                  - name: dataset-initializer
                    image: ghcr.io/kubeflow/trainer/dataset-initializer
                    env:
                      - name: CACHE_IMAGE
                        value: "ghcr.io/kubeflow/trainer/data-cache:v2.2.0-rc.0"
                      - name: TRAIN_JOB_NAME
                        valueFrom:
                          fieldRef:
                            apiVersion: v1
                            fieldPath: metadata.labels['jobset.sigs.k8s.io/jobset-name']
      - name: node

juliusvonkohout · 2026-03-20T21:19:31Z

@andreyvelich for review

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

….com/kubeflow/manifests into synchronize-trainer-manifests-v2.2.0

tarekabouzeid · 2026-03-20T21:29:51Z

/lgtm

- Add sidecar.istio.io/inject: false label to Trainer and JobSet controller pod templates (consistent with training-operator pattern) - Add traffic.sidecar.istio.io/excludeInboundPorts annotation for webhook ports 9443 and 10443 - Implement caBundle readiness check before applying runtimes - Add retry loop for runtimes apply to handle Secret volume mount propagation delay Addresses CI failure in PR kubeflow#3413 per Julius's direction. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

andreyvelich · 2026-03-20T22:11:41Z

@juliusvonkohout For testing you can use the simplest TrainJob YAML:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: trainjob-test
spec:
  runtimeRef:
    name: torch-distributed

juliusvonkohout · 2026-03-20T22:51:59Z

@juliusvonkohout For testing you can use the simplest TrainJob YAML:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: trainjob-test
spec:
  runtimeRef:
    name: torch-distributed

May you provide a full example with PSS restricted ?

Raakshass · 2026-03-21T11:43:29Z

the webhook timeout fix in #3415 is passing now — the CI failure there is now the trainer_job.yaml step. the trainjob gets accepted but the controller creates 0 pods (likely PSS restricted violation since there's no securityContext on the generated pods).

should i update trainer_job.yaml + trainer_test.sh in #3415, or would you prefer that as a separate commit on this PR?

waiting on @andreyvelich's PSS restricted example before touching the job spec.

juliusvonkohout · 2026-03-21T13:17:17Z

the webhook timeout fix in #3415 is passing now — the CI failure there is now the trainer_job.yaml step. the trainjob gets accepted but the controller creates 0 pods (likely PSS restricted violation since there's no securityContext on the generated pods).

should i update trainer_job.yaml + trainer_test.sh in #3415, or would you prefer that as a separate commit on this PR?

waiting on @andreyvelich's PSS restricted example before touching the job spec.

Yes please update it in 3415

* fix(trainer): resolve webhook timeout in v2.2.0 CI Root cause (proven from CI diagnostic logs): 1. Install: Redundant runtimes re-apply triggered unreachable webhook. Fix: Remove the step; runtimes already deployed before webhook registers. 2. Test: Istio sidecar intercepts outbound traffic from trainer controller to K8s API server on port 443. When API server calls jobset mutating webhook, the call times out because it routes through the trainer's Envoy sidecar. Fix: Add excludeOutboundPorts: 443 annotation to both trainer and jobset controller pod templates, alongside existing excludeInboundPorts: 9443. 3. Timing: kubectl rollout restart creates new jobset pod on a different node. kube-proxy endpoint propagation delay causes transient webhook unreachability. Fix: Add 30s sleep after restart for endpoint propagation. Add retry loop around TrainJob creation in test script. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix(trainer): correct jobset-webhook podSelector labels The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix: address review — remove non-essential changes Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

juliusvonkohout · 2026-03-23T14:02:07Z

/approve

google-oss-prow · 2026-03-23T14:02:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [juliusvonkohout]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kunal-511

/lgtm

* Update kubeflow/trainer manifests from v2.2.0 Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> * Update pull request paths in trainer_test.yaml Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> * Remove patches for deployments in kustomization.yaml Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments. Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> * fixes Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> * fix(trainer): resolve webhook timeout in v2.2.0 CI (kubeflow#3415) * fix(trainer): resolve webhook timeout in v2.2.0 CI Root cause (proven from CI diagnostic logs): 1. Install: Redundant runtimes re-apply triggered unreachable webhook. Fix: Remove the step; runtimes already deployed before webhook registers. 2. Test: Istio sidecar intercepts outbound traffic from trainer controller to K8s API server on port 443. When API server calls jobset mutating webhook, the call times out because it routes through the trainer's Envoy sidecar. Fix: Add excludeOutboundPorts: 443 annotation to both trainer and jobset controller pod templates, alongside existing excludeInboundPorts: 9443. 3. Timing: kubectl rollout restart creates new jobset pod on a different node. kube-proxy endpoint propagation delay causes transient webhook unreachability. Fix: Add 30s sleep after restart for endpoint propagation. Add retry loop around TrainJob creation in test script. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix(trainer): correct jobset-webhook podSelector labels The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix: address review — remove non-essential changes Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> Co-authored-by: Siddhant Jain <149181251+Raakshass@users.noreply.github.com>

…es (#3474) Restructures the Upgrading and Extending section per maintainer request in #3428. Adds general immutable field troubleshooting section and 26.03 upgrade path documenting jobset immutable selector (#3413) and KServe llmisvc rolebinding cleanup. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

Update kubeflow/trainer manifests from v2.2.0

79e60e6

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 20, 2026 21:06

google-oss-prow Bot requested review from kimwnasptd and tarekabouzeid March 20, 2026 21:06

google-oss-prow Bot added the size/XXL label Mar 20, 2026

juliusvonkohout mentioned this pull request Mar 20, 2026

Update kubeflow/trainer manifests from v2.2.0-rc.0 #3395

Closed

Copilot started reviewing on behalf of juliusvonkohout March 20, 2026 21:06 View session

juliusvonkohout added 2 commits March 20, 2026 22:07

Update pull request paths in trainer_test.yaml

8a870ee

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

Remove patches for deployments in kustomization.yaml

f9c0411

Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments. Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

juliusvonkohout mentioned this pull request Mar 20, 2026

cleanup trainer overlay #3314

Closed

Copilot AI reviewed Mar 20, 2026

View reviewed changes

juliusvonkohout added 2 commits March 20, 2026 22:25

fixes

6d72a07

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

Merge branch 'synchronize-trainer-manifests-v2.2.0' of https://github…

70cf580

….com/kubeflow/manifests into synchronize-trainer-manifests-v2.2.0

google-oss-prow Bot assigned tarekabouzeid Mar 20, 2026

google-oss-prow Bot added the lgtm label Mar 20, 2026

Raakshass mentioned this pull request Mar 21, 2026

fix(trainer): resolve webhook timeout in v2.2.0 CI #3415

Merged

5 tasks

google-oss-prow Bot removed the lgtm label Mar 23, 2026

google-oss-prow Bot added the approved label Mar 23, 2026

kunal-511 approved these changes Mar 23, 2026

View reviewed changes

google-oss-prow Bot assigned kunal-511 Mar 23, 2026

google-oss-prow Bot added the lgtm label Mar 23, 2026

google-oss-prow Bot merged commit 8b00e94 into master Mar 23, 2026
13 checks passed

google-oss-prow Bot deleted the synchronize-trainer-manifests-v2.2.0 branch March 23, 2026 14:28

Raakshass mentioned this pull request Mar 24, 2026

fix(tests): modernize trainer test to use Kubeflow SDK #3421

Merged

2 tasks

christian-heusel mentioned this pull request Mar 27, 2026

Upgrading error on jobset-controller-manager Deployment #3428

Closed

8 tasks

Raakshass mentioned this pull request May 22, 2026

docUMENTATION: restructure Upgrading section with version-specific upgrade notes #3474

Merged

Conversation

juliusvonkohout commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juliusvonkohout commented Mar 20, 2026

Uh oh!

tarekabouzeid commented Mar 20, 2026

Uh oh!

andreyvelich commented Mar 20, 2026

Uh oh!

juliusvonkohout commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Raakshass commented Mar 21, 2026

Uh oh!

juliusvonkohout commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliusvonkohout commented Mar 23, 2026

Uh oh!

google-oss-prow Bot commented Mar 23, 2026

Uh oh!

kunal-511 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

juliusvonkohout commented Mar 20, 2026 •

edited

Loading

juliusvonkohout commented Mar 21, 2026 •

edited

Loading