Skip to content

test(spark): harden webhook readiness checks#3367

Merged
google-oss-prow[bot] merged 3 commits into
kubeflow:masterfrom
danish9039:spark-test-robustness
Feb 27, 2026
Merged

test(spark): harden webhook readiness checks#3367
google-oss-prow[bot] merged 3 commits into
kubeflow:masterfrom
danish9039:spark-test-robustness

Conversation

@danish9039

Copy link
Copy Markdown
Member

✏️ Summary of Changes

This PR makes the Spark CI install/test path more robust by strengthening webhook readiness checks in tests/spark_install.sh.

The original failure mode was not in the Spark manifests themselves. The Test Spark workflow could proceed after deploy/spark-operator-webhook became Available, but the API server still hit a transient admission failure when creating SparkApplication resources:

  • the Spark operator webhook deployment existed,
  • but the webhook endpoint was not yet fully reachable on spark-operator-webhook-svc:9443,
  • and the first SparkApplication apply failed with connection refused.

This PR hardens the test setup by:

  • increasing the controller availability wait from 60s to 180s,
  • increasing the webhook availability wait from 30s to 180s,
  • waiting for endpoints/spark-operator-webhook-svc to contain a pod-backed address before the Spark test continues.

This keeps the fix narrowly scoped to test robustness and avoids changing the Spark component manifests.

📦 Dependencies

  • None

🐛 Related Issues

  • Follow-up to review feedback on #3366 asking to make the Spark test more robust and identify the exact failing step.

✅ Validation

Local validation was run on a fresh kind cluster using the same workflow shape as the GitHub Actions Spark job:

  • kustomize build common/kubeflow-namespace/base | kubectl apply -f -
  • ./tests/istio-cni_install.sh
  • ./tests/oauth2-proxy_install.sh
  • ./tests/cert_manager_install.sh
  • ./tests/multi_tenancy_install.sh
  • kustomize build common/user-namespace/base | kubectl apply -f -
  • cd applications/spark && ../../tests/spark_install.sh && ../../tests/spark_test.sh kubeflow-user-example-com

Observed result after this patch:

  • first SparkApplication apply succeeded,
  • no webhook connection refused,
  • Spark application reached RUNNING,
  • driver pod completed with Succeeded.

✅ Contributor Checklist

  • I have tested these changes with kustomize. See Installation Prerequisites.
  • All commits are signed-off to satisfy the DCO check.
  • I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue (see 1. and 2.).

Copilot AI review requested due to automatic review settings February 27, 2026 09:48
@github-actions

Copy link
Copy Markdown

Welcome to the Kubeflow Manifests Repository

Thanks for opening your first PR. Your contribution means a lot to the Kubeflow community.

Before making more PRs:
Please ensure your PR follows our Contributing Guide.
Please also be aware that many components are synchronizes from upstream via the scripts in /scripts.
So in some cases you have to fix the problem in the upstream repositories first, but you can use a PR against kubeflow/manifests to test the platform integration.

Community Resources:

Thanks again for helping to improve Kubeflow.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens the Spark operator test suite by addressing a race condition where the API server could attempt to call the webhook before it was fully ready. The changes add more robust webhook readiness checks following the same pattern successfully used in cert-manager tests.

Changes:

  • Increased deployment readiness timeouts from 60s/30s to 180s to align with other test scripts
  • Added endpoint-level readiness check to ensure webhook service has pod-backed addresses before proceeding

@danish9039

Copy link
Copy Markdown
Member Author

@juliusvonkohout Kustomize install is failing on checksum verification in CI , The issue is that the grep pattern "linux_amd64" is too broad , it can match multiple lines in the checksums file, and sha256sum --check fails when the checksum doesn't match the downloaded asset.
we can tightened it to grep for the exact filename ,this should resolve the checksum mismatch deterministically https://github.com/kubeflow/manifests/blob/27d176e1900cbd45917662ea6370f40fc9488fde/tests/install_KinD_create_KinD_cluster_install_kustomize.sh#L72

@danish9039 danish9039 marked this pull request as ready for review February 27, 2026 11:09
Copilot AI review requested due to automatic review settings February 27, 2026 11:09
@danish9039

danish9039 commented Feb 27, 2026

Copy link
Copy Markdown
Member Author

@juliusvonkohout Kustomize install is failing on checksum verification in CI , The issue is that the grep pattern "linux_amd64" is too broad , it can match multiple lines in the checksums file, and sha256sum --check fails when the checksum doesn't match the downloaded asset. we can tightened it to grep for the exact filename ,this should resolve the checksum mismatch deterministically

https://github.com/kubeflow/manifests/blob/27d176e1900cbd45917662ea6370f40fc9488fde/tests/install_KinD_create_KinD_cluster_install_kustomize.sh#L72

logs :
Run ./tests/install_KinD_create_KinD_cluster_install_kustomize.sh
./tests/install_KinD_create_KinD_cluster_install_kustomize.sh
shell: /usr/bin/bash -e {0}
env:
KF_PROFILE: kubeflow-user-example-com

Install KinD...
kind-linux-amd64: OK
Creating KinD cluster ...
Creating cluster "kind" ...

• Ready after 7s 💚
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Kubernetes control plane is running at https://127.0.0.1:38055/
CoreDNS is running at https://127.0.0.1:38055/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Install Kustomize ...

sha256sum: WARNING: 1 computed checksum did NOT match

kustomize_v5.7.1_linux_amd64.tar.gz: FAILED

Failed to verify Kustomize checksums <-------------------------------------- checksum failled

Error: Process completed with exit code 1.
0s
0s

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@juliusvonkohout

Copy link
Copy Markdown
Member

/lgtm
/approve

@juliusvonkohout

Copy link
Copy Markdown
Member

/hold

please check again

+ kubectl get namespaces --selector=istio-injection=enabled
NAME                        STATUS   AGE
kubeflow                    Active   111s
kubeflow-system             Active   111s
kubeflow-user-example-com   Active   26s
+ kubectl -n kubeflow-user-example-com apply -f /home/runner/work/manifests/manifests/applications/spark/sparkapplication_example.yaml
Error from server (InternalError): error when creating "/home/runner/work/manifests/manifests/applications/spark/sparkapplication_example.yaml": Internal error occurred: failed calling webhook "mutate-sparkoperator-k8s-io-v1beta2-sparkapplication.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook-svc.kubeflow.svc:9443/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication?timeout=10s": dial tcp 10.96.220.135:9443: connect: connection refused
Error: Process completed with exit code 1.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Copilot AI review requested due to automatic review settings February 27, 2026 16:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@danish9039

Copy link
Copy Markdown
Member Author

/hold

please check again

+ kubectl get namespaces --selector=istio-injection=enabled
NAME                        STATUS   AGE
kubeflow                    Active   111s
kubeflow-system             Active   111s
kubeflow-user-example-com   Active   26s
+ kubectl -n kubeflow-user-example-com apply -f /home/runner/work/manifests/manifests/applications/spark/sparkapplication_example.yaml
Error from server (InternalError): error when creating "/home/runner/work/manifests/manifests/applications/spark/sparkapplication_example.yaml": Internal error occurred: failed calling webhook "mutate-sparkoperator-k8s-io-v1beta2-sparkapplication.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook-svc.kubeflow.svc:9443/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication?timeout=10s": dial tcp 10.96.220.135:9443: connect: connection refused
Error: Process completed with exit code 1.

@juliusvonkohout All checks are green

@juliusvonkohout

Copy link
Copy Markdown
Member

thank you
/lgtm
/approve

@google-oss-prow google-oss-prow Bot added the lgtm label Feb 27, 2026
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@juliusvonkohout

Copy link
Copy Markdown
Member

/unhold

@google-oss-prow google-oss-prow Bot merged commit c75bfe3 into kubeflow:master Feb 27, 2026
11 of 12 checks passed
Raakshass added a commit to Raakshass/manifests that referenced this pull request Mar 27, 2026
* test(spark): harden webhook readiness checks

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>

* chore(spark): rerun CI

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>

* test(spark): wait for webhook pod readiness

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>

---------

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants