Update kubeflow/trainer manifests from v2.2.0#3413
Conversation
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments. Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates the repository’s Kubeflow Trainer manifests to the v2.2.0 upstream revision, including refreshed third-party dependencies and new/updated TrainingRuntime examples.
Changes:
- Bump Trainer sync target to
v2.2.0and update README upstream reference. - Update third-party controller dependencies (LeaderWorkerSet and JobSet) and manager/runtimes overlays.
- Add new runtime manifests (e.g., JAX and XGBoost) and update CRDs/RBAC to match upstream.
Reviewed changes
Copilot reviewed 33 out of 36 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/synchronize-trainer-manifests.sh | Bumps synced upstream revision to v2.2.0. |
| applications/trainer/upstream/third-party/leaderworkerset/kustomization.yaml | Updates LWS dependency ref to v0.8.0 and removes now-unused patches. |
| applications/trainer/upstream/third-party/jobset/kustomization.yaml | Updates JobSet dependency to v0.11.0. |
| applications/trainer/upstream/overlays/runtimes/kustomization.yaml | Updates runtime image tags and adds XGBoost image pinning. |
| applications/trainer/upstream/overlays/manager/kustomization.yaml | Updates controller image tag and adds a public ConfigMap generator. |
| applications/trainer/upstream/overlays/kubeflow-platform/kustomization.yaml | Adds deployment patches for Istio and label selectors. |
| applications/trainer/upstream/overlays/kubeflow-platform/patches/* | Adds Istio inbound-port exclusions and explicit selectors/labels. |
| applications/trainer/upstream/overlays/data-cache/kustomization.yaml | Updates data-cache overlay wiring and removes JobSet selector patches. |
| applications/trainer/upstream/overlays/data-cache/cluster_role.yaml | Tightens cache initializer ClusterRole permissions. |
| applications/trainer/upstream/base/runtimes/kustomization.yaml | Registers new runtime resources. |
| applications/trainer/upstream/base/runtimes/jax_distributed.yaml | Adds a JAX ClusterTrainingRuntime example. |
| applications/trainer/upstream/base/runtimes/xgboost_distributed.yaml | Adds an XGBoost ClusterTrainingRuntime example. |
| applications/trainer/upstream/base/runtimes/torch_distributed.yaml | Updates torch runtime policy shape and container image tag. |
| applications/trainer/upstream/base/runtimes/data-cache/torch_distributed_with_cache.yaml | Updates cache image version and torch runtime policy shape. |
| applications/trainer/upstream/base/runtimes/torchtune/** | Switches initializer PVC handling to volumeClaimPolicies. |
| applications/trainer/upstream/base/manager/manager.yaml | Updates security context, ports, probes, and service ports (metrics/status). |
| applications/trainer/upstream/base/manager/controller_manager_config.yaml | Adds statusServer configuration. |
| applications/trainer/upstream/base/rbac/role.yaml | Adjusts RBAC (events API groups, JobSet delete, etc.). |
| applications/trainer/upstream/base/rbac/public_configmap_role*.yaml | Adds RBAC to make the public ConfigMap readable. |
| applications/trainer/upstream/base/rbac/kustomization.yaml | Includes new public ConfigMap RBAC resources. |
| applications/trainer/upstream/base/crds/trainer.kubeflow.org_*trainingruntimes.yaml | Updates CRD schemas/validations to upstream v2.2.0. |
| README.md | Updates documented Trainer upstream revision to v2.2.0. |
Comments suppressed due to low confidence (1)
applications/trainer/upstream/base/runtimes/data-cache/torch_distributed_with_cache.yaml:35
CACHE_IMAGEis hardcoded tov2.2.0-rc.0, which is inconsistent with the synced upstream revision beingv2.2.0elsewhere in this PR. Please update this to the same release version you are shipping (or centralize the version so this does not drift).
- name: dataset-initializer
image: ghcr.io/kubeflow/trainer/dataset-initializer
env:
- name: CACHE_IMAGE
value: "ghcr.io/kubeflow/trainer/data-cache:v2.2.0-rc.0"
- name: TRAIN_JOB_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['jobset.sigs.k8s.io/jobset-name']
- name: node
|
@andreyvelich for review |
….com/kubeflow/manifests into synchronize-trainer-manifests-v2.2.0
|
/lgtm |
- Add sidecar.istio.io/inject: false label to Trainer and JobSet controller pod templates (consistent with training-operator pattern) - Add traffic.sidecar.istio.io/excludeInboundPorts annotation for webhook ports 9443 and 10443 - Implement caBundle readiness check before applying runtimes - Add retry loop for runtimes apply to handle Secret volume mount propagation delay Addresses CI failure in PR kubeflow#3413 per Julius's direction. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
|
@juliusvonkohout For testing you can use the simplest TrainJob YAML: apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: trainjob-test
spec:
runtimeRef:
name: torch-distributed |
May you provide a full example with PSS restricted ? |
|
the webhook timeout fix in #3415 is passing now — the CI failure there is now the trainer_job.yaml step. the trainjob gets accepted but the controller creates 0 pods (likely PSS restricted violation since there's no should i update trainer_job.yaml + trainer_test.sh in #3415, or would you prefer that as a separate commit on this PR? waiting on @andreyvelich's PSS restricted example before touching the job spec. |
Yes please update it in 3415 |
* fix(trainer): resolve webhook timeout in v2.2.0 CI Root cause (proven from CI diagnostic logs): 1. Install: Redundant runtimes re-apply triggered unreachable webhook. Fix: Remove the step; runtimes already deployed before webhook registers. 2. Test: Istio sidecar intercepts outbound traffic from trainer controller to K8s API server on port 443. When API server calls jobset mutating webhook, the call times out because it routes through the trainer's Envoy sidecar. Fix: Add excludeOutboundPorts: 443 annotation to both trainer and jobset controller pod templates, alongside existing excludeInboundPorts: 9443. 3. Timing: kubectl rollout restart creates new jobset pod on a different node. kube-proxy endpoint propagation delay causes transient webhook unreachability. Fix: Add 30s sleep after restart for endpoint propagation. Add retry loop around TrainJob creation in test script. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix(trainer): correct jobset-webhook podSelector labels The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix: address review — remove non-essential changes Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juliusvonkohout The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Update kubeflow/trainer manifests from v2.2.0 Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> * Update pull request paths in trainer_test.yaml Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> * Remove patches for deployments in kustomization.yaml Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments. Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> * fixes Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> * fix(trainer): resolve webhook timeout in v2.2.0 CI (kubeflow#3415) * fix(trainer): resolve webhook timeout in v2.2.0 CI Root cause (proven from CI diagnostic logs): 1. Install: Redundant runtimes re-apply triggered unreachable webhook. Fix: Remove the step; runtimes already deployed before webhook registers. 2. Test: Istio sidecar intercepts outbound traffic from trainer controller to K8s API server on port 443. When API server calls jobset mutating webhook, the call times out because it routes through the trainer's Envoy sidecar. Fix: Add excludeOutboundPorts: 443 annotation to both trainer and jobset controller pod templates, alongside existing excludeInboundPorts: 9443. 3. Timing: kubectl rollout restart creates new jobset pod on a different node. kube-proxy endpoint propagation delay causes transient webhook unreachability. Fix: Add 30s sleep after restart for endpoint propagation. Add retry loop around TrainJob creation in test script. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix(trainer): correct jobset-webhook podSelector labels The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix: address review — remove non-essential changes Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> Co-authored-by: Siddhant Jain <149181251+Raakshass@users.noreply.github.com>
…es (#3474) Restructures the Upgrading and Extending section per maintainer request in #3428. Adds general immutable field troubleshooting section and 26.03 upgrade path documenting jobset immutable selector (#3413) and KServe llmisvc rolebinding cleanup. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
@tarekabouzeid
@Raakshass I think we also have to adjust our trainer_job. yaml according to what https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb produces and make sure it satisfies pss restricted.