-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: Detect pod configuration errors early instead of timeout #9197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@vdemeester: GitHub didn't allow me to request PR reviews from the following users: arewm. Note that only tektoncd members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
e3a61ca to
256002d
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
WRT to this, users may be relying on the current behaviour in cases where a config map or secret is provisioned roughly at the same as the |
Ah I should have written a better description. Essentially, if they use Volumes/VolumeMounts for configmaps or secrets, the behavior doesn't change at all, as it is a recoverable issue (from the Pod and thus TaskRun stand-point). But if they use the EnvSource (aka loading an environment variable value from a ConfigMap or a Secret), this is where it will fail-fast as it won't ever be recovered. See #9144 (comment). So users who were relying on current behavior will either see a failure early instead of a timeout (always), or it won't change a thing. If there was a failure before, there is a failure now, just quicker ; if there was no failure before, there is no failure now either (from the user point of view). I'll update the description. |
aThorp96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems good to me! Couple questions and a few optional suggestions/nits (mostly regarding patterns which were already in the code)
| func isPodHitConfigError(pod *corev1.Pod) bool { | ||
| // hasContainerWaitingReason checks if any container (init or regular) is waiting with a reason | ||
| // that matches the provided predicate function | ||
| func hasContainerWaitingReason(pod *corev1.Pod, predicate func(corev1.ContainerStateWaiting) bool) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this helper
pkg/reconciler/taskrun/taskrun.go
Outdated
| image := step.ImageID | ||
| message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: no need to alloc
| image := step.ImageID | |
| message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message) | |
| message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, step.ImageID, step.Waiting.Message) |
pkg/reconciler/taskrun/taskrun.go
Outdated
| image := step.ImageID | ||
| message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: no need to alloc
| image := step.ImageID | |
| message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message) | |
| message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, step.ImageID, step.Waiting.Message) |
pkg/reconciler/taskrun/taskrun.go
Outdated
| image := sidecar.ImageID | ||
| message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional nit: no need to alloc
| image := sidecar.ImageID | |
| message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message) | |
| message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, sidecar.Waiting.Message) |
pkg/reconciler/taskrun/taskrun.go
Outdated
| image := sidecar.ImageID | ||
| message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional nit: no need to alloc
| image := sidecar.ImageID | |
| message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message) | |
| message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, sidecar.Waiting.Message) |
| expectedReason = "PodCreationFailed" | ||
| expectedMessage = fmt.Sprintf(`the %s "unnamed-%d" in TaskRun "test-imagepull-fail" failed to start. The pod errored with the message: "%s."`, tc.failure, stepNumber, tc.message) | ||
| wantFailedEvent = fmt.Sprintf(`Warning Failed the %s "unnamed-%d" in TaskRun "test-imagepull-fail" failed to start. The pod errored with the message: "%s.`, tc.failure, stepNumber, tc.message) | ||
| default: // Image pull errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional suggestion:
Better to be explicit IMO. WDYT?
| default: // Image pull errors | |
| case "InvalidImageName", "ImagePullBackOff": |
pkg/reconciler/taskrun/taskrun.go
Outdated
| } | ||
|
|
||
| // Handle other image-related errors | ||
| if step.Waiting.Reason == "ErrImagePull" || step.Waiting.Reason == "InvalidImageName" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug:
IIUC ErrImagePull can happen transiently and is automatically retried before the ImagePullBackoff. If that's the case, would this logic cause TaskRuns to fail instead of the image pull being retried? If I am reading this right, if we include step.Waiting.Reason == "ErrImagePull" then we'll fast fail before we ever allow the retry to get to ErrImagePullBackoff. Is that right? Maybe that's the point of the PR, but I do worry that it could make Tekton seem "flakier" to users if TaskRuns are more prone to fast-fail on transient issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you are right, I'll remove that one.
aThorp96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Two minor suggestions to utilize the new constants
pkg/reconciler/taskrun/taskrun.go
Outdated
| } | ||
|
|
||
| // Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.) | ||
| if step.Waiting.Reason == "CreateContainerConfigError" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if step.Waiting.Reason == "CreateContainerConfigError" { | |
| if step.Waiting.Reason == CreateContainerConfigError { |
pkg/reconciler/taskrun/taskrun.go
Outdated
| // Handle InvalidImageName (unrecoverable error) | ||
| // Note: ErrImagePull is not handled here as it's a transient state that Kubernetes | ||
| // will automatically retry before transitioning to ImagePullBackOff | ||
| if step.Waiting.Reason == "InvalidImageName" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if step.Waiting.Reason == "InvalidImageName" { | |
| if step.Waiting.Reason == InvalidImageName { |
pkg/reconciler/taskrun/taskrun.go
Outdated
| } | ||
|
|
||
| // Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.) | ||
| if sidecar.Waiting.Reason == "CreateContainerConfigError" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if sidecar.Waiting.Reason == "CreateContainerConfigError" { | |
| if sidecar.Waiting.Reason == CreateContainerConfigError { |
pkg/reconciler/taskrun/taskrun.go
Outdated
| // Handle InvalidImageName (unrecoverable error) | ||
| // Note: ErrImagePull is not handled here as it's a transient state that Kubernetes | ||
| // will automatically retry before transitioning to ImagePullBackOff | ||
| if sidecar.Waiting.Reason == "InvalidImageName" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if sidecar.Waiting.Reason == "InvalidImageName" { | |
| if sidecar.Waiting.Reason == InvalidImageName { |
pkg/reconciler/taskrun/taskrun.go
Outdated
| // Note: ErrImagePull is not handled here as it's a transient state that Kubernetes | ||
| // will automatically retry before transitioning to ImagePullBackOff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏🏼
twoGiants
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! Thank you for this 🥇
See my comments below. I would cleanup a bit, add more tests and decompose the big test function to have the container failures tested in its own more concise test functions.
pkg/reconciler/taskrun/taskrun.go
Outdated
| if step.Waiting != nil { | ||
| if _, found := podFailureReasons[step.Waiting.Reason]; found { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great opportunity to clean this up a bit. You could use the guard pattern twice here (and below for the sidecars) and reduce the nesting of the ifs by 2 and then extract the duplicated checks into a helper 😸 =>
if step.Waiting == nil {
continue
}
if _, found := podFailureReasons[step.Waiting.Reason]; !found {
continue
}
c.checkContainerFailure(
ctx,
tr,
step.Waiting,
step.Name,
step.ImageID,
"step",
)And the waiting reason checks can go into a helper and reused below for the sidecars as well:
func (c *Reconciler) checkContainerFailure(
ctx context.Context,
tr *v1.TaskRun,
waiting *corev1.ContainerStateWaiting,
name,
imageID,
containerType string,
) (bool, v1.TaskRunReason, string) {
if waiting.Reason == ImagePullBackOff {
imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout
// only attempt to recover from the imagePullBackOff if specified
if imagePullBackOffTimeOut.Seconds() != 0 {
p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
if err != nil {
message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. Failed to get pod with error: "%s."`, containerType, name, tr.Name, imageID, err)
return true, v1.TaskRunReasonImagePullFailed, message
}
for _, condition := range p.Status.Conditions {
// check the pod condition to get the time when the pod was ready to start containers / initialized.
// keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration
if slices.Contains(imagePullBackOffTimeoutPodConditions, string(condition.Type)) {
if c.Clock.Since(condition.LastTransitionTime.Time) < imagePullBackOffTimeOut {
return false, "", ""
}
}
}
}
// ImagePullBackOff timeout exceeded or not configured
message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, containerType, name, tr.Name, imageID, waiting.Message)
return true, v1.TaskRunReasonImagePullFailed, message
}
// Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.)
if waiting.Reason == CreateContainerConfigError {
message := fmt.Sprintf(`the %s %q in TaskRun %q failed to start. The pod errored with the message: "%s."`, containerType, name, tr.Name, waiting.Message)
return true, v1.TaskRunReasonCreateContainerConfigError, message
}
// Handle InvalidImageName (unrecoverable error)
// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
// will automatically retry before transitioning to ImagePullBackOff
if waiting.Reason == InvalidImageName {
message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, containerType, name, tr.Name, imageID, waiting.Message)
return true, v1.TaskRunReasonImagePullFailed, message
}
// Handle CreateContainerError and other generic failures
message := fmt.Sprintf(`the %s %q in TaskRun %q failed to start. The pod errored with the message: "%s."`, containerType, name, tr.Name, waiting.Message)
return true, v1.TaskRunReasonPodCreationFailed, message
}I also changed this message in the ImagePullBackOff error => the step %q in TaskRun %q failed to pull the image %q and the pod with error, the last part sounds wrong. I used: the %s %q in TaskRun %q failed to pull the image %q. Failed to get pod with error: "%s.
pkg/reconciler/taskrun/taskrun.go
Outdated
| } | ||
|
|
||
| // Handle InvalidImageName (unrecoverable error) | ||
| // Note: ErrImagePull is not handled here as it's a transient state that Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But when ErrImagePull is the reason the method returns true in line 304 and the reconciler fails the TaskRun. It looks like we either should not check for this transient state or if we do then return false with a message like retrying pulling image.
| message: "secret \"secret-for-testing\" not found", | ||
| failure: "sidecar", | ||
| }, { | ||
| desc: "create container error step", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same test case for sidecar is missing.
| message: "Invalid image \"whatever\"", | ||
| failure: "step", | ||
| imagePullBackOffTimeout: "5h", | ||
| }, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ErrImagePull case is added in the implementation it must be added here too for both.
| stepNumber = 1 | ||
| } | ||
|
|
||
| var expectedReason, expectedMessage, wantFailedEvent string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to get this 🥲. But I got it! 🤣
This test is quite complex already with all its conditions down below. The additional switch makes it a even more difficult to understand.
What do you think about extracting the tests for the new container failure logic into it's own table driven test function with a simpler setup, execution and assertions instead of extending this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Enumerating the almost-entirely-identical message and event strings inside the test cases would certainly be verbose and tedious, but the switch statement and various string formatting does add a lot of mental complexity. I think I'd even prefer setting/updating expectedStatus and wantEvents directly in the switch. It would be a lot more straightforward and clear though to either enumerate the events and strings in the cases or add an entirely different test function for the new behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already in TestReconcileContainerFailures, I forgot to remove that part...
d078a03 to
1c64c91
Compare
- Fail fast on missing ConfigMaps/Secrets within seconds of detection - Show actionable error messages instead of generic timeout failures - Add early detection for CreateContainerConfigError and related failures Signed-off-by: Vincent Demeester <[email protected]>
twoGiants
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the rework! Makes the logic easier to follow 😸 👍
/approve
/lgtm
/meow
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aThorp96, twoGiants The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |

Changes
valueFrom, within seconds of detection/kind bug
Fixes #9144
/cc @afrittoli @aThorp96 @arewm
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes