Skip to content

Commit 43b800e

Browse files
committed
Update kubeflow/trainer manifests from v2.2.0 (kubeflow#3413)
* Update kubeflow/trainer manifests from v2.2.0 Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> * Update pull request paths in trainer_test.yaml Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> * Remove patches for deployments in kustomization.yaml Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments. Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> * fixes Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> * fix(trainer): resolve webhook timeout in v2.2.0 CI (kubeflow#3415) * fix(trainer): resolve webhook timeout in v2.2.0 CI Root cause (proven from CI diagnostic logs): 1. Install: Redundant runtimes re-apply triggered unreachable webhook. Fix: Remove the step; runtimes already deployed before webhook registers. 2. Test: Istio sidecar intercepts outbound traffic from trainer controller to K8s API server on port 443. When API server calls jobset mutating webhook, the call times out because it routes through the trainer's Envoy sidecar. Fix: Add excludeOutboundPorts: 443 annotation to both trainer and jobset controller pod templates, alongside existing excludeInboundPorts: 9443. 3. Timing: kubectl rollout restart creates new jobset pod on a different node. kube-proxy endpoint propagation delay causes transient webhook unreachability. Fix: Add 30s sleep after restart for endpoint propagation. Add retry loop around TrainJob creation in test script. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix(trainer): correct jobset-webhook podSelector labels The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> * fix: address review — remove non-essential changes Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain. Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> --------- Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com> Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com> Co-authored-by: Siddhant Jain <149181251+Raakshass@users.noreply.github.com>
1 parent 9c991d1 commit 43b800e

43 files changed

Lines changed: 6631 additions & 4834 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/trainer_test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ on:
44
paths:
55
- tests/install_KinD_create_KinD_cluster_install_kustomize.sh
66
- .github/workflows/trainer_test.yaml
7-
- applications/trainer/upstream/**
7+
- applications/trainer/**
88
- tests/trainer*
99
- tests/istio*
1010
- tests/oauth2-proxy_install.sh

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ This repository periodically synchronizes all official Kubeflow components from
6363
| Component | Local Manifests Path | Upstream Revision | CPU (millicores) | Memory (Mi) | PVC Storage (GB) |
6464
| - | - | - | - | - | - |
6565
| Training Operator | applications/training-operator/upstream | [v1.9.2](https://github.com/kubeflow/training-operator/tree/v1.9.2/manifests) | 3m | 25Mi | 0GB |
66-
| Trainer | applications/trainer/upstream | [v2.1.0](https://github.com/kubeflow/trainer/tree/v2.1.0/manifests) | 8m | 143Mi | 0GB |
66+
| Trainer | applications/trainer/upstream | [v2.2.0](https://github.com/kubeflow/trainer/tree/v2.2.0/manifests) | 8m | 143Mi | 0GB |
6767
| Notebook Controller | applications/jupyter/notebook-controller/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/notebook-controller/config) | 5m | 93Mi | 0GB |
6868
| PVC Viewer Controller | applications/pvcviewer-controller/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/pvcviewer-controller/config) | 15m | 128Mi | 0GB |
6969
| Tensorboard Controller | applications/tensorboard/tensorboard-controller/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/tensorboard-controller/config) | 15m | 128Mi | 0GB |

applications/trainer/overlays/kustomization.yaml

Lines changed: 0 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -4,45 +4,3 @@ namespace: kubeflow-system
44

55
resources:
66
- ../upstream/overlays/kubeflow-platform
7-
8-
patches:
9-
- target:
10-
group: apps
11-
version: v1
12-
kind: Deployment
13-
name: jobset-controller-manager
14-
patch: |-
15-
apiVersion: apps/v1
16-
kind: Deployment
17-
metadata:
18-
name: jobset-controller-manager
19-
namespace: kubeflow-system
20-
spec:
21-
template:
22-
metadata:
23-
annotations:
24-
traffic.sidecar.istio.io/excludeInboundPorts: "9443"
25-
spec:
26-
securityContext:
27-
seccompProfile:
28-
type: RuntimeDefault
29-
- target:
30-
group: apps
31-
version: v1
32-
kind: Deployment
33-
name: kubeflow-trainer-controller-manager
34-
patch: |-
35-
apiVersion: apps/v1
36-
kind: Deployment
37-
metadata:
38-
name: kubeflow-trainer-controller-manager
39-
namespace: kubeflow-system
40-
spec:
41-
template:
42-
metadata:
43-
annotations:
44-
traffic.sidecar.istio.io/excludeInboundPorts: "9443"
45-
spec:
46-
securityContext:
47-
seccompProfile:
48-
type: RuntimeDefault

applications/trainer/upstream/base/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml

Lines changed: 627 additions & 567 deletions
Large diffs are not rendered by default.

applications/trainer/upstream/base/crds/trainer.kubeflow.org_trainingruntimes.yaml

Lines changed: 627 additions & 567 deletions
Large diffs are not rendered by default.

applications/trainer/upstream/base/crds/trainer.kubeflow.org_trainjobs.yaml

Lines changed: 5098 additions & 3470 deletions
Large diffs are not rendered by default.

applications/trainer/upstream/base/manager/controller_manager_config.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,8 @@ certManagement:
4242
clientConnection:
4343
qps: 50
4444
burst: 100
45+
46+
statusServer:
47+
port: 10443
48+
qps: 5
49+
burst: 10

applications/trainer/upstream/base/manager/manager.yaml

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,31 +24,50 @@ spec:
2424
- name: manager
2525
image: ghcr.io/kubeflow/trainer/trainer-controller-manager
2626
securityContext:
27+
readOnlyRootFilesystem: true
2728
allowPrivilegeEscalation: false
2829
runAsNonRoot: true
2930
capabilities:
3031
drop:
3132
- ALL
3233
seccompProfile:
3334
type: RuntimeDefault
35+
36+
ports:
37+
- name: health
38+
containerPort: 8081
39+
protocol: TCP
40+
- name: metrics
41+
containerPort: 8443
42+
protocol: TCP
43+
- name: webhook
44+
containerPort: 9443
45+
protocol: TCP
46+
- name: status-server
47+
containerPort: 10443
48+
protocol: TCP
49+
3450
volumeMounts:
3551
- mountPath: /tmp/k8s-webhook-server/serving-certs
3652
name: cert
3753
readOnly: true
54+
3855
livenessProbe:
3956
httpGet:
4057
path: /healthz
41-
port: 8081
58+
port: health
4259
initialDelaySeconds: 15
4360
periodSeconds: 20
4461
timeoutSeconds: 3
62+
4563
readinessProbe:
4664
httpGet:
4765
path: /readyz
48-
port: 8081
66+
port: health
4967
initialDelaySeconds: 10
5068
periodSeconds: 15
5169
timeoutSeconds: 3
70+
5271
serviceAccountName: kubeflow-trainer-controller-manager
5372
volumes:
5473
- name: cert
@@ -63,11 +82,15 @@ metadata:
6382
spec:
6483
ports:
6584
- name: monitoring-port
66-
port: 8080
67-
targetPort: 8080
85+
port: 8443
86+
targetPort: metrics
6887
- name: webhook-server
6988
port: 443
7089
protocol: TCP
71-
targetPort: 9443
90+
targetPort: webhook
91+
- name: status-server
92+
port: 10443
93+
protocol: TCP
94+
targetPort: status-server
7295
selector:
7396
app.kubernetes.io/component: manager

applications/trainer/upstream/base/rbac/kustomization.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,5 @@ resources:
22
- role.yaml
33
- role_binding.yaml
44
- service_account.yaml
5+
- public_configmap_role.yaml
6+
- public_configmap_role_binding.yaml
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: Role
3+
metadata:
4+
name: kubeflow-trainer-public
5+
rules:
6+
- apiGroups:
7+
- ""
8+
resources:
9+
- configmaps
10+
resourceNames:
11+
- kubeflow-trainer-public
12+
verbs:
13+
- get
14+
- list
15+
- watch

0 commit comments

Comments
 (0)