Skip to content

Update kubeflow/trainer manifests from v2.2.0#3413

Merged
google-oss-prow[bot] merged 6 commits into
masterfrom
synchronize-trainer-manifests-v2.2.0
Mar 23, 2026
Merged

Update kubeflow/trainer manifests from v2.2.0#3413
google-oss-prow[bot] merged 6 commits into
masterfrom
synchronize-trainer-manifests-v2.2.0

Conversation

@juliusvonkohout
Copy link
Copy Markdown
Member

@tarekabouzeid

@Raakshass I think we also have to adjust our trainer_job. yaml according to what https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb produces and make sure it satisfies pss restricted.

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments.

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the repository’s Kubeflow Trainer manifests to the v2.2.0 upstream revision, including refreshed third-party dependencies and new/updated TrainingRuntime examples.

Changes:

  • Bump Trainer sync target to v2.2.0 and update README upstream reference.
  • Update third-party controller dependencies (LeaderWorkerSet and JobSet) and manager/runtimes overlays.
  • Add new runtime manifests (e.g., JAX and XGBoost) and update CRDs/RBAC to match upstream.

Reviewed changes

Copilot reviewed 33 out of 36 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/synchronize-trainer-manifests.sh Bumps synced upstream revision to v2.2.0.
applications/trainer/upstream/third-party/leaderworkerset/kustomization.yaml Updates LWS dependency ref to v0.8.0 and removes now-unused patches.
applications/trainer/upstream/third-party/jobset/kustomization.yaml Updates JobSet dependency to v0.11.0.
applications/trainer/upstream/overlays/runtimes/kustomization.yaml Updates runtime image tags and adds XGBoost image pinning.
applications/trainer/upstream/overlays/manager/kustomization.yaml Updates controller image tag and adds a public ConfigMap generator.
applications/trainer/upstream/overlays/kubeflow-platform/kustomization.yaml Adds deployment patches for Istio and label selectors.
applications/trainer/upstream/overlays/kubeflow-platform/patches/* Adds Istio inbound-port exclusions and explicit selectors/labels.
applications/trainer/upstream/overlays/data-cache/kustomization.yaml Updates data-cache overlay wiring and removes JobSet selector patches.
applications/trainer/upstream/overlays/data-cache/cluster_role.yaml Tightens cache initializer ClusterRole permissions.
applications/trainer/upstream/base/runtimes/kustomization.yaml Registers new runtime resources.
applications/trainer/upstream/base/runtimes/jax_distributed.yaml Adds a JAX ClusterTrainingRuntime example.
applications/trainer/upstream/base/runtimes/xgboost_distributed.yaml Adds an XGBoost ClusterTrainingRuntime example.
applications/trainer/upstream/base/runtimes/torch_distributed.yaml Updates torch runtime policy shape and container image tag.
applications/trainer/upstream/base/runtimes/data-cache/torch_distributed_with_cache.yaml Updates cache image version and torch runtime policy shape.
applications/trainer/upstream/base/runtimes/torchtune/** Switches initializer PVC handling to volumeClaimPolicies.
applications/trainer/upstream/base/manager/manager.yaml Updates security context, ports, probes, and service ports (metrics/status).
applications/trainer/upstream/base/manager/controller_manager_config.yaml Adds statusServer configuration.
applications/trainer/upstream/base/rbac/role.yaml Adjusts RBAC (events API groups, JobSet delete, etc.).
applications/trainer/upstream/base/rbac/public_configmap_role*.yaml Adds RBAC to make the public ConfigMap readable.
applications/trainer/upstream/base/rbac/kustomization.yaml Includes new public ConfigMap RBAC resources.
applications/trainer/upstream/base/crds/trainer.kubeflow.org_*trainingruntimes.yaml Updates CRD schemas/validations to upstream v2.2.0.
README.md Updates documented Trainer upstream revision to v2.2.0.
Comments suppressed due to low confidence (1)

applications/trainer/upstream/base/runtimes/data-cache/torch_distributed_with_cache.yaml:35

  • CACHE_IMAGE is hardcoded to v2.2.0-rc.0, which is inconsistent with the synced upstream revision being v2.2.0 elsewhere in this PR. Please update this to the same release version you are shipping (or centralize the version so this does not drift).
                  - name: dataset-initializer
                    image: ghcr.io/kubeflow/trainer/dataset-initializer
                    env:
                      - name: CACHE_IMAGE
                        value: "ghcr.io/kubeflow/trainer/data-cache:v2.2.0-rc.0"
                      - name: TRAIN_JOB_NAME
                        valueFrom:
                          fieldRef:
                            apiVersion: v1
                            fieldPath: metadata.labels['jobset.sigs.k8s.io/jobset-name']
      - name: node

Comment thread applications/trainer/upstream/overlays/manager/kustomization.yaml
Comment thread applications/trainer/upstream/overlays/runtimes/kustomization.yaml Outdated
Comment thread applications/trainer/upstream/base/rbac/public_configmap_role.yaml
Comment thread README.md
@juliusvonkohout
Copy link
Copy Markdown
Member Author

@andreyvelich for review

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
@tarekabouzeid
Copy link
Copy Markdown
Member

/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm label Mar 20, 2026
Raakshass added a commit to Raakshass/manifests that referenced this pull request Mar 20, 2026
- Add sidecar.istio.io/inject: false label to Trainer and JobSet
  controller pod templates (consistent with training-operator pattern)
- Add traffic.sidecar.istio.io/excludeInboundPorts annotation for
  webhook ports 9443 and 10443
- Implement caBundle readiness check before applying runtimes
- Add retry loop for runtimes apply to handle Secret volume mount
  propagation delay

Addresses CI failure in PR kubeflow#3413 per Julius's direction.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@andreyvelich
Copy link
Copy Markdown
Member

@juliusvonkohout For testing you can use the simplest TrainJob YAML:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: trainjob-test
spec:
  runtimeRef:
    name: torch-distributed

@juliusvonkohout
Copy link
Copy Markdown
Member Author

juliusvonkohout commented Mar 20, 2026

@juliusvonkohout For testing you can use the simplest TrainJob YAML:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: trainjob-test
spec:
  runtimeRef:
    name: torch-distributed

May you provide a full example with PSS restricted ?

@Raakshass
Copy link
Copy Markdown
Contributor

the webhook timeout fix in #3415 is passing now — the CI failure there is now the trainer_job.yaml step. the trainjob gets accepted but the controller creates 0 pods (likely PSS restricted violation since there's no securityContext on the generated pods).

should i update trainer_job.yaml + trainer_test.sh in #3415, or would you prefer that as a separate commit on this PR?

waiting on @andreyvelich's PSS restricted example before touching the job spec.

@juliusvonkohout
Copy link
Copy Markdown
Member Author

juliusvonkohout commented Mar 21, 2026

the webhook timeout fix in #3415 is passing now — the CI failure there is now the trainer_job.yaml step. the trainjob gets accepted but the controller creates 0 pods (likely PSS restricted violation since there's no securityContext on the generated pods).

should i update trainer_job.yaml + trainer_test.sh in #3415, or would you prefer that as a separate commit on this PR?

waiting on @andreyvelich's PSS restricted example before touching the job spec.

Yes please update it in 3415

* fix(trainer): resolve webhook timeout in v2.2.0 CI

Root cause (proven from CI diagnostic logs):

1. Install: Redundant runtimes re-apply triggered unreachable webhook.
   Fix: Remove the step; runtimes already deployed before webhook
   registers.

2. Test: Istio sidecar intercepts outbound traffic from trainer
   controller to K8s API server on port 443. When API server calls
   jobset mutating webhook, the call times out because it routes
   through the trainer's Envoy sidecar.
   Fix: Add excludeOutboundPorts: 443 annotation to both trainer and
   jobset controller pod templates, alongside existing
   excludeInboundPorts: 9443.

3. Timing: kubectl rollout restart creates new jobset pod on a
   different node. kube-proxy endpoint propagation delay causes
   transient webhook unreachability.
   Fix: Add 30s sleep after restart for endpoint propagation. Add
   retry loop around TrainJob creation in test script.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

* fix(trainer): correct jobset-webhook podSelector labels

The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

* fix: address review — remove non-essential changes

Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

---------

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@google-oss-prow google-oss-prow Bot removed the lgtm label Mar 23, 2026
@juliusvonkohout
Copy link
Copy Markdown
Member Author

/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@kunal-511 kunal-511 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm label Mar 23, 2026
@google-oss-prow google-oss-prow Bot merged commit 8b00e94 into master Mar 23, 2026
13 checks passed
@google-oss-prow google-oss-prow Bot deleted the synchronize-trainer-manifests-v2.2.0 branch March 23, 2026 14:28
Raakshass added a commit to Raakshass/manifests that referenced this pull request Mar 27, 2026
* Update kubeflow/trainer manifests from v2.2.0

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

* Update pull request paths in trainer_test.yaml

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

* Remove patches for deployments in kustomization.yaml

Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments.

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

* fixes

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

* fix(trainer): resolve webhook timeout in v2.2.0 CI (kubeflow#3415)

* fix(trainer): resolve webhook timeout in v2.2.0 CI

Root cause (proven from CI diagnostic logs):

1. Install: Redundant runtimes re-apply triggered unreachable webhook.
   Fix: Remove the step; runtimes already deployed before webhook
   registers.

2. Test: Istio sidecar intercepts outbound traffic from trainer
   controller to K8s API server on port 443. When API server calls
   jobset mutating webhook, the call times out because it routes
   through the trainer's Envoy sidecar.
   Fix: Add excludeOutboundPorts: 443 annotation to both trainer and
   jobset controller pod templates, alongside existing
   excludeInboundPorts: 9443.

3. Timing: kubectl rollout restart creates new jobset pod on a
   different node. kube-proxy endpoint propagation delay causes
   transient webhook unreachability.
   Fix: Add 30s sleep after restart for endpoint propagation. Add
   retry loop around TrainJob creation in test script.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

* fix(trainer): correct jobset-webhook podSelector labels

The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

* fix: address review — remove non-essential changes

Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

---------

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>

---------

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
Co-authored-by: Siddhant Jain <149181251+Raakshass@users.noreply.github.com>
google-oss-prow Bot pushed a commit that referenced this pull request May 24, 2026
…es (#3474)

Restructures the Upgrading and Extending section per maintainer request
in #3428. Adds general immutable field troubleshooting section and
26.03 upgrade path documenting jobset immutable selector (#3413) and
KServe llmisvc rolebinding cleanup.

Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants