test(spark): harden webhook readiness checks#3367
Conversation
|
Welcome to the Kubeflow Manifests Repository Thanks for opening your first PR. Your contribution means a lot to the Kubeflow community. Before making more PRs: Community Resources:
Thanks again for helping to improve Kubeflow. |
There was a problem hiding this comment.
Pull request overview
This PR strengthens the Spark operator test suite by addressing a race condition where the API server could attempt to call the webhook before it was fully ready. The changes add more robust webhook readiness checks following the same pattern successfully used in cert-manager tests.
Changes:
- Increased deployment readiness timeouts from 60s/30s to 180s to align with other test scripts
- Added endpoint-level readiness check to ensure webhook service has pod-backed addresses before proceeding
|
@juliusvonkohout Kustomize install is failing on checksum verification in CI , The issue is that the grep pattern "linux_amd64" is too broad , it can match multiple lines in the checksums file, and sha256sum --check fails when the checksum doesn't match the downloaded asset. |
logs : Install KinD... • Ready after 7s 💚 kubectl cluster-info --context kind-kind Kubernetes control plane is running at https://127.0.0.1:38055/ To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'. Install Kustomize ... sha256sum: WARNING: 1 computed checksum did NOT match kustomize_v5.7.1_linux_amd64.tar.gz: FAILED Failed to verify Kustomize checksums <-------------------------------------- checksum failled Error: Process completed with exit code 1. |
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
15cc384 to
b35ff72
Compare
|
/lgtm |
|
/hold please check again |
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@juliusvonkohout All checks are green |
|
thank you |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juliusvonkohout The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
* test(spark): harden webhook readiness checks Signed-off-by: danish9039 <danishsiddiqui040@gmail.com> * chore(spark): rerun CI Signed-off-by: danish9039 <danishsiddiqui040@gmail.com> * test(spark): wait for webhook pod readiness Signed-off-by: danish9039 <danishsiddiqui040@gmail.com> --------- Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
✏️ Summary of Changes
This PR makes the Spark CI install/test path more robust by strengthening webhook readiness checks in
tests/spark_install.sh.The original failure mode was not in the Spark manifests themselves. The
Test Sparkworkflow could proceed afterdeploy/spark-operator-webhookbecameAvailable, but the API server still hit a transient admission failure when creatingSparkApplicationresources:spark-operator-webhook-svc:9443,SparkApplicationapply failed withconnection refused.This PR hardens the test setup by:
60sto180s,30sto180s,endpoints/spark-operator-webhook-svcto contain a pod-backed address before the Spark test continues.This keeps the fix narrowly scoped to test robustness and avoids changing the Spark component manifests.
📦 Dependencies
🐛 Related Issues
#3366asking to make the Spark test more robust and identify the exact failing step.✅ Validation
Local validation was run on a fresh kind cluster using the same workflow shape as the GitHub Actions Spark job:
kustomize build common/kubeflow-namespace/base | kubectl apply -f -./tests/istio-cni_install.sh./tests/oauth2-proxy_install.sh./tests/cert_manager_install.sh./tests/multi_tenancy_install.shkustomize build common/user-namespace/base | kubectl apply -f -cd applications/spark && ../../tests/spark_install.sh && ../../tests/spark_test.sh kubeflow-user-example-comObserved result after this patch:
SparkApplicationapply succeeded,connection refused,RUNNING,Succeeded.✅ Contributor Checklist