fix: Detect pod configuration errors early instead of timeout #9197

vdemeester · 2025-12-02T13:42:02Z

Changes

Fail fast on missing ConfigMaps/Secrets used in env variables, using valueFrom, within seconds of detection
Show actionable error messages instead of generic timeout failures
Add early detection for CreateContainerConfigError and related failures

/kind bug

Fixes #9144

/cc @afrittoli @aThorp96 @arewm

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
pre-commit Passed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fixed TaskRuns to fail immediately when pod containers have configuration errors (missing ConfigMap/Secret) instead of silently timing out.

tekton-robot · 2025-12-02T13:42:07Z

@vdemeester: GitHub didn't allow me to request PR reviews from the following users: arewm.

Note that only tektoncd members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Changes

Fail fast on missing ConfigMaps/Secrets within seconds of detection

Show actionable error messages instead of generic timeout failures

Add early detection for CreateContainerConfigError and related failures

/kind bug

Fixes #9144

/cc @afrittoli @aThorp96 @arewm

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps

Has Tests included if any functionality added or changed

pre-commit Passed

Follows the commit message standard

Meets the Tekton contributor standards (including functionality, content, code)

Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep

Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.

Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes
Fixed TaskRuns to fail immediately when pod containers have configuration errors (missing ConfigMap/Secret) instead of silently timing out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vdemeester · 2025-12-03T08:33:59Z

/retest

vdemeester · 2025-12-05T07:58:58Z

/retest

vdemeester · 2025-12-09T09:27:37Z

/retest

afrittoli · 2025-12-09T09:42:43Z

Fail fast on missing ConfigMaps/Secrets within seconds of detection

WRT to this, users may be relying on the current behaviour in cases where a config map or secret is provisioned roughly at the same as the TaskRun or PipelineRun and may not yet be on the cluster during the first reconcile. Would it make sense to have a configurable timeout for this or at least a configurable on/off switch for the feature?

vdemeester · 2025-12-09T09:49:50Z

Fail fast on missing ConfigMaps/Secrets within seconds of detection

WRT to this, users may be relying on the current behaviour in cases where a config map or secret is provisioned roughly at the same as the TaskRun or PipelineRun and may not yet be on the cluster during the first reconcile. Would it make sense to have a configurable timeout for this or at least a configurable on/off switch for the feature?

Ah I should have written a better description. Essentially, if they use Volumes/VolumeMounts for configmaps or secrets, the behavior doesn't change at all, as it is a recoverable issue (from the Pod and thus TaskRun stand-point). But if they use the EnvSource (aka loading an environment variable value from a ConfigMap or a Secret), this is where it will fail-fast as it won't ever be recovered. See #9144 (comment).

So users who were relying on current behavior will either see a failure early instead of a timeout (always), or it won't change a thing. If there was a failure before, there is a failure now, just quicker ; if there was no failure before, there is no failure now either (from the user point of view).

I'll update the description.

aThorp96

Overall seems good to me! Couple questions and a few optional suggestions/nits (mostly regarding patterns which were already in the code)

pkg/apis/pipeline/v1/taskrun_types.go

aThorp96 · 2025-12-09T18:03:27Z

pkg/pod/status.go

-func isPodHitConfigError(pod *corev1.Pod) bool {
+// hasContainerWaitingReason checks if any container (init or regular) is waiting with a reason
+// that matches the provided predicate function
+func hasContainerWaitingReason(pod *corev1.Pod, predicate func(corev1.ContainerStateWaiting) bool) bool {


Love this helper

pkg/reconciler/taskrun/taskrun.go

aThorp96 · 2025-12-09T18:21:20Z

pkg/reconciler/taskrun/taskrun.go

+					image := step.ImageID
+					message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)


optional nit: no need to alloc

Suggested change

image := step.ImageID

message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)

message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, step.ImageID, step.Waiting.Message)

aThorp96 · 2025-12-09T18:21:47Z

pkg/reconciler/taskrun/taskrun.go

+					image := step.ImageID
+					message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)


optional nit: no need to alloc

Suggested change

image := step.ImageID

message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)

message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, step.ImageID, step.Waiting.Message)

aThorp96 · 2025-12-09T18:23:31Z

pkg/reconciler/taskrun/taskrun.go

+					image := sidecar.ImageID
+					message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)


Optional nit: no need to alloc

Suggested change

image := sidecar.ImageID

message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)

message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, sidecar.Waiting.Message)

aThorp96 · 2025-12-09T18:24:34Z

pkg/reconciler/taskrun/taskrun.go

+					image := sidecar.ImageID
+					message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)


Optional nit: no need to alloc

Suggested change

image := sidecar.ImageID

message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)

message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, sidecar.Waiting.Message)

aThorp96 · 2025-12-09T18:29:05Z

pkg/reconciler/taskrun/taskrun_test.go

+				expectedReason = "PodCreationFailed"
+				expectedMessage = fmt.Sprintf(`the %s "unnamed-%d" in TaskRun "test-imagepull-fail" failed to start. The pod errored with the message: "%s."`, tc.failure, stepNumber, tc.message)
+				wantFailedEvent = fmt.Sprintf(`Warning Failed the %s "unnamed-%d" in TaskRun "test-imagepull-fail" failed to start. The pod errored with the message: "%s.`, tc.failure, stepNumber, tc.message)
+			default: // Image pull errors


Optional suggestion:

Better to be explicit IMO. WDYT?

Suggested change

default: // Image pull errors

case "InvalidImageName", "ImagePullBackOff":

aThorp96 · 2025-12-09T18:40:22Z

pkg/reconciler/taskrun/taskrun.go

+				}
+
+				// Handle other image-related errors
+				if step.Waiting.Reason == "ErrImagePull" || step.Waiting.Reason == "InvalidImageName" {


Potential bug:
IIUC ErrImagePull can happen transiently and is automatically retried before the ImagePullBackoff. If that's the case, would this logic cause TaskRuns to fail instead of the image pull being retried? If I am reading this right, if we include step.Waiting.Reason == "ErrImagePull" then we'll fast fail before we ever allow the retry to get to ErrImagePullBackoff. Is that right? Maybe that's the point of the PR, but I do worry that it could make Tekton seem "flakier" to users if TaskRuns are more prone to fast-fail on transient issues.

Ah you are right, I'll remove that one.

aThorp96

/lgtm

Two minor suggestions to utilize the new constants

aThorp96 · 2025-12-11T13:01:16Z

pkg/reconciler/taskrun/taskrun.go

+				}
+
+				// Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.)
+				if step.Waiting.Reason == "CreateContainerConfigError" {


Suggested change

if step.Waiting.Reason == "CreateContainerConfigError" {

if step.Waiting.Reason == CreateContainerConfigError {

aThorp96 · 2025-12-11T13:01:31Z

pkg/reconciler/taskrun/taskrun.go

+				// Handle InvalidImageName (unrecoverable error)
+				// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
+				// will automatically retry before transitioning to ImagePullBackOff
+				if step.Waiting.Reason == "InvalidImageName" {


Suggested change

if step.Waiting.Reason == "InvalidImageName" {

if step.Waiting.Reason == InvalidImageName {

aThorp96 · 2025-12-11T13:01:44Z

pkg/reconciler/taskrun/taskrun.go

+				}
+
+				// Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.)
+				if sidecar.Waiting.Reason == "CreateContainerConfigError" {


Suggested change

if sidecar.Waiting.Reason == "CreateContainerConfigError" {

if sidecar.Waiting.Reason == CreateContainerConfigError {

aThorp96 · 2025-12-11T13:01:56Z

pkg/reconciler/taskrun/taskrun.go

+				// Handle InvalidImageName (unrecoverable error)
+				// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
+				// will automatically retry before transitioning to ImagePullBackOff
+				if sidecar.Waiting.Reason == "InvalidImageName" {


Suggested change

if sidecar.Waiting.Reason == "InvalidImageName" {

if sidecar.Waiting.Reason == InvalidImageName {

aThorp96 · 2025-12-11T13:03:03Z

pkg/reconciler/taskrun/taskrun.go

+				// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
+				// will automatically retry before transitioning to ImagePullBackOff


twoGiants

Great stuff! Thank you for this 🥇

See my comments below. I would cleanup a bit, add more tests and decompose the big test function to have the container failures tested in its own more concise test functions.

twoGiants · 2025-12-12T16:57:39Z

pkg/reconciler/taskrun/taskrun.go

 		if step.Waiting != nil {
 			if _, found := podFailureReasons[step.Waiting.Reason]; found {


Great opportunity to clean this up a bit. You could use the guard pattern twice here (and below for the sidecars) and reduce the nesting of the ifs by 2 and then extract the duplicated checks into a helper 😸 =>

if step.Waiting == nil { continue } if _, found := podFailureReasons[step.Waiting.Reason]; !found { continue } c.checkContainerFailure( ctx, tr, step.Waiting, step.Name, step.ImageID, "step", )

And the waiting reason checks can go into a helper and reused below for the sidecars as well:

func (c *Reconciler) checkContainerFailure( ctx context.Context, tr *v1.TaskRun, waiting *corev1.ContainerStateWaiting, name, imageID, containerType string, ) (bool, v1.TaskRunReason, string) { if waiting.Reason == ImagePullBackOff { imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout // only attempt to recover from the imagePullBackOff if specified if imagePullBackOffTimeOut.Seconds() != 0 { p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{}) if err != nil { message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. Failed to get pod with error: "%s."`, containerType, name, tr.Name, imageID, err) return true, v1.TaskRunReasonImagePullFailed, message } for _, condition := range p.Status.Conditions { // check the pod condition to get the time when the pod was ready to start containers / initialized. // keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration if slices.Contains(imagePullBackOffTimeoutPodConditions, string(condition.Type)) { if c.Clock.Since(condition.LastTransitionTime.Time) < imagePullBackOffTimeOut { return false, "", "" } } } } // ImagePullBackOff timeout exceeded or not configured message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, containerType, name, tr.Name, imageID, waiting.Message) return true, v1.TaskRunReasonImagePullFailed, message } // Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.) if waiting.Reason == CreateContainerConfigError { message := fmt.Sprintf(`the %s %q in TaskRun %q failed to start. The pod errored with the message: "%s."`, containerType, name, tr.Name, waiting.Message) return true, v1.TaskRunReasonCreateContainerConfigError, message } // Handle InvalidImageName (unrecoverable error) // Note: ErrImagePull is not handled here as it's a transient state that Kubernetes // will automatically retry before transitioning to ImagePullBackOff if waiting.Reason == InvalidImageName { message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, containerType, name, tr.Name, imageID, waiting.Message) return true, v1.TaskRunReasonImagePullFailed, message } // Handle CreateContainerError and other generic failures message := fmt.Sprintf(`the %s %q in TaskRun %q failed to start. The pod errored with the message: "%s."`, containerType, name, tr.Name, waiting.Message) return true, v1.TaskRunReasonPodCreationFailed, message }

I also changed this message in the ImagePullBackOff error => the step %q in TaskRun %q failed to pull the image %q and the pod with error, the last part sounds wrong. I used: the %s %q in TaskRun %q failed to pull the image %q. Failed to get pod with error: "%s.

twoGiants · 2025-12-12T17:09:09Z

pkg/reconciler/taskrun/taskrun.go

+				}
+
+				// Handle InvalidImageName (unrecoverable error)
+				// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes


But when ErrImagePull is the reason the method returns true in line 304 and the reconciler fails the TaskRun. It looks like we either should not check for this transient state or if we do then return false with a message like retrying pulling image.

twoGiants · 2025-12-12T17:23:20Z

pkg/reconciler/taskrun/taskrun_test.go

+		message: "secret \"secret-for-testing\" not found",
+		failure: "sidecar",
+	}, {
+		desc:    "create container error step",


The same test case for sidecar is missing.

twoGiants · 2025-12-12T17:27:23Z

pkg/reconciler/taskrun/taskrun_test.go

 		message:                 "Invalid image \"whatever\"",
 		failure:                 "step",
 		imagePullBackOffTimeout: "5h",
+	}, {


If the ErrImagePull case is added in the implementation it must be added here too for both.

twoGiants · 2025-12-12T17:48:49Z

pkg/reconciler/taskrun/taskrun_test.go

 				stepNumber = 1
 			}

+			var expectedReason, expectedMessage, wantFailedEvent string


It took me a while to get this 🥲. But I got it! 🤣

This test is quite complex already with all its conditions down below. The additional switch makes it a even more difficult to understand.

What do you think about extracting the tests for the new container failure logic into it's own table driven test function with a simpler setup, execution and assertions instead of extending this one?

+1
Enumerating the almost-entirely-identical message and event strings inside the test cases would certainly be verbose and tedious, but the switch statement and various string formatting does add a lot of mental complexity. I think I'd even prefer setting/updating expectedStatus and wantEvents directly in the switch. It would be a lot more straightforward and clear though to either enumerate the events and strings in the cases or add an entirely different test function for the new behavior

It's already in TestReconcileContainerFailures, I forgot to remove that part...

- Fail fast on missing ConfigMaps/Secrets within seconds of detection - Show actionable error messages instead of generic timeout failures - Add early detection for CreateContainerConfigError and related failures Signed-off-by: Vincent Demeester <[email protected]>

twoGiants

Thanks for the rework! Makes the logic easier to follow 😸 👍

/approve
/lgtm
/meow

tekton-robot · 2025-12-22T16:27:51Z

@twoGiants:

Details

In response to this:

Thanks for the rework! Makes the logic easier to follow 😸 👍

/approve
/lgtm
/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2025-12-22T16:27:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aThorp96, twoGiants

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [twoGiants]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-project-automation bot added this to Tekton Community Roadmap Dec 2, 2025

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Dec 2, 2025

tekton-robot requested a review from afrittoli December 2, 2025 13:42

github-project-automation bot moved this to Todo in Tekton Community Roadmap Dec 2, 2025

tekton-robot requested a review from aThorp96 December 2, 2025 13:42

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 2, 2025

vdemeester force-pushed the fix-9144 branch 2 times, most recently from e3a61ca to 256002d Compare December 2, 2025 13:55

vdemeester added this to the v1.8.0 milestone Dec 9, 2025

aThorp96 reviewed Dec 9, 2025

View reviewed changes

vdemeester force-pushed the fix-9144 branch from 256002d to cf97ab3 Compare December 11, 2025 11:00

aThorp96 approved these changes Dec 11, 2025

View reviewed changes

tekton-robot assigned aThorp96 Dec 11, 2025

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2025

vdemeester force-pushed the fix-9144 branch from cf97ab3 to 8b445c6 Compare December 11, 2025 14:39

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2025

twoGiants requested changes Dec 12, 2025

View reviewed changes

vdemeester force-pushed the fix-9144 branch 3 times, most recently from d078a03 to 1c64c91 Compare December 18, 2025 12:30

vdemeester force-pushed the fix-9144 branch from 1c64c91 to 24bc71a Compare December 18, 2025 12:31

twoGiants approved these changes Dec 22, 2025

View reviewed changes

tekton-robot assigned twoGiants Dec 22, 2025

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 22, 2025

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025

tekton-robot merged commit 264d5fa into tektoncd:main Dec 22, 2025
24 checks passed

github-project-automation bot moved this from Todo to Done in Tekton Community Roadmap Dec 22, 2025

vdemeester deleted the fix-9144 branch December 22, 2025 18:56

vdemeester modified the milestones: v1.8.0, v1.9.0 (LTS) Jan 27, 2026

		image := step.ImageID
		message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)

		image := sidecar.ImageID
		message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)

	default: // Image pull errors
	case "InvalidImageName", "ImagePullBackOff":

	if step.Waiting.Reason == "CreateContainerConfigError" {
	if step.Waiting.Reason == CreateContainerConfigError {

	if step.Waiting.Reason == "InvalidImageName" {
	if step.Waiting.Reason == InvalidImageName {

	if sidecar.Waiting.Reason == "CreateContainerConfigError" {
	if sidecar.Waiting.Reason == CreateContainerConfigError {

	if sidecar.Waiting.Reason == "InvalidImageName" {
	if sidecar.Waiting.Reason == InvalidImageName {

		// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
		// will automatically retry before transitioning to ImagePullBackOff

		if step.Waiting != nil {
		if _, found := podFailureReasons[step.Waiting.Reason]; found {

fix: Detect pod configuration errors early instead of timeout #9197

fix: Detect pod configuration errors early instead of timeout #9197

Uh oh!

Conversation

vdemeester commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Submitter Checklist

Release Notes

Uh oh!

tekton-robot commented Dec 2, 2025

Changes

Submitter Checklist

Release Notes

Uh oh!

vdemeester commented Dec 3, 2025

Uh oh!

vdemeester commented Dec 5, 2025

Uh oh!

vdemeester commented Dec 9, 2025

Uh oh!

afrittoli commented Dec 9, 2025

Uh oh!

vdemeester commented Dec 9, 2025

Uh oh!

aThorp96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aThorp96 Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aThorp96 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twoGiants left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aThorp96 Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdemeester commented Dec 2, 2025 •

edited

Loading

aThorp96 Dec 9, 2025 •

edited

Loading

aThorp96 Dec 12, 2025 •

edited

Loading