Commit 43b800e
committed
Update kubeflow/trainer manifests from v2.2.0 (kubeflow#3413)
* Update kubeflow/trainer manifests from v2.2.0
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
* Update pull request paths in trainer_test.yaml
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
* Remove patches for deployments in kustomization.yaml
Removed patches for jobset-controller-manager and kubeflow-trainer-controller-manager deployments.
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
* fixes
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
* fix(trainer): resolve webhook timeout in v2.2.0 CI (kubeflow#3415)
* fix(trainer): resolve webhook timeout in v2.2.0 CI
Root cause (proven from CI diagnostic logs):
1. Install: Redundant runtimes re-apply triggered unreachable webhook.
Fix: Remove the step; runtimes already deployed before webhook
registers.
2. Test: Istio sidecar intercepts outbound traffic from trainer
controller to K8s API server on port 443. When API server calls
jobset mutating webhook, the call times out because it routes
through the trainer's Envoy sidecar.
Fix: Add excludeOutboundPorts: 443 annotation to both trainer and
jobset controller pod templates, alongside existing
excludeInboundPorts: 9443.
3. Timing: kubectl rollout restart creates new jobset pod on a
different node. kube-proxy endpoint propagation delay causes
transient webhook unreachability.
Fix: Add 30s sleep after restart for endpoint propagation. Add
retry loop around TrainJob creation in test script.
Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
* fix(trainer): correct jobset-webhook podSelector labels
The jobset-webhook NetworkPolicy was incorrectly targeting pods with the label 'service.istio.io/canonical-name: jobset-controller-manager'. The jobset pods actually carry the label 'app.kubernetes.io/name: jobset'. This mismatch resulted in a default-deny rule dropping webhook validation requests, causing TrainJob deployments to time out. Patched the podSelector to match the correct standard labels.
Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
* fix: address review — remove non-essential changes
Per reviewer feedback, stripped all symptom patches that are not strictly needed now that the root cause (jobset-webhook NetworkPolicy podSelector mismatch) is fixed: - Removed excludeOutboundPorts annotations from both istio patches - Reverted trainer_test.sh to base (no duplicated cert waits, no retry loop, no timeout bump) - Restored runtimes build line in trainer_install.sh Only the NetworkPolicy fix and install.sh webhook cert waits remain.
Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
---------
Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
---------
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Siddhant Jain <siddhantjainofficial26@gmail.com>
Co-authored-by: Siddhant Jain <149181251+Raakshass@users.noreply.github.com>1 parent 9c991d1 commit 43b800e
43 files changed
Lines changed: 6631 additions & 4834 deletions
File tree
- .github/workflows
- applications/trainer
- overlays
- upstream
- base
- crds
- manager
- rbac
- runtimes
- data-cache
- torchtune
- llama3_2
- qwen2_5
- webhook
- overlays
- data-cache
- jobset_patches
- kubeflow-platform
- patches
- manager
- runtimes
- third-party
- jobset
- leaderworkerset
- patches
- common/kubeflow-namespace/base/kubeflow-system
- scripts
- tests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | | - | |
| 66 | + | |
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
11 | | - | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
Lines changed: 627 additions & 567 deletions
Large diffs are not rendered by default.
Lines changed: 627 additions & 567 deletions
Large diffs are not rendered by default.
Lines changed: 5098 additions & 3470 deletions
Large diffs are not rendered by default.
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
34 | 50 | | |
35 | 51 | | |
36 | 52 | | |
37 | 53 | | |
| 54 | + | |
38 | 55 | | |
39 | 56 | | |
40 | 57 | | |
41 | | - | |
| 58 | + | |
42 | 59 | | |
43 | 60 | | |
44 | 61 | | |
| 62 | + | |
45 | 63 | | |
46 | 64 | | |
47 | 65 | | |
48 | | - | |
| 66 | + | |
49 | 67 | | |
50 | 68 | | |
51 | 69 | | |
| 70 | + | |
52 | 71 | | |
53 | 72 | | |
54 | 73 | | |
| |||
63 | 82 | | |
64 | 83 | | |
65 | 84 | | |
66 | | - | |
67 | | - | |
| 85 | + | |
| 86 | + | |
68 | 87 | | |
69 | 88 | | |
70 | 89 | | |
71 | | - | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
72 | 95 | | |
73 | 96 | | |
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
| 6 | + | |
Lines changed: 15 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
0 commit comments