Skip to content

Conversation

@vdemeester
Copy link
Member

@vdemeester vdemeester commented Dec 2, 2025

Changes

  • Fail fast on missing ConfigMaps/Secrets used in env variables, using valueFrom, within seconds of detection
  • Show actionable error messages instead of generic timeout failures
  • Add early detection for CreateContainerConfigError and related failures

/kind bug

Fixes #9144

/cc @afrittoli @aThorp96 @arewm

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fixed TaskRuns to fail immediately when pod containers have configuration errors (missing ConfigMap/Secret) instead of silently timing out.

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Dec 2, 2025
@tekton-robot tekton-robot requested a review from aThorp96 December 2, 2025 13:42
@tekton-robot
Copy link
Collaborator

@vdemeester: GitHub didn't allow me to request PR reviews from the following users: arewm.

Note that only tektoncd members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Changes

  • Fail fast on missing ConfigMaps/Secrets within seconds of detection
  • Show actionable error messages instead of generic timeout failures
  • Add early detection for CreateContainerConfigError and related failures

/kind bug

Fixes #9144

/cc @afrittoli @aThorp96 @arewm

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fixed TaskRuns to fail immediately when pod containers have configuration errors (missing ConfigMap/Secret) instead of silently timing out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 2, 2025
@vdemeester vdemeester force-pushed the fix-9144 branch 2 times, most recently from e3a61ca to 256002d Compare December 2, 2025 13:55
@vdemeester
Copy link
Member Author

/retest

2 similar comments
@vdemeester
Copy link
Member Author

/retest

@vdemeester
Copy link
Member Author

/retest

@vdemeester vdemeester added this to the v1.8.0 milestone Dec 9, 2025
@afrittoli
Copy link
Member

  • Fail fast on missing ConfigMaps/Secrets within seconds of detection

WRT to this, users may be relying on the current behaviour in cases where a config map or secret is provisioned roughly at the same as the TaskRun or PipelineRun and may not yet be on the cluster during the first reconcile. Would it make sense to have a configurable timeout for this or at least a configurable on/off switch for the feature?

@vdemeester
Copy link
Member Author

  • Fail fast on missing ConfigMaps/Secrets within seconds of detection

WRT to this, users may be relying on the current behaviour in cases where a config map or secret is provisioned roughly at the same as the TaskRun or PipelineRun and may not yet be on the cluster during the first reconcile. Would it make sense to have a configurable timeout for this or at least a configurable on/off switch for the feature?

Ah I should have written a better description. Essentially, if they use Volumes/VolumeMounts for configmaps or secrets, the behavior doesn't change at all, as it is a recoverable issue (from the Pod and thus TaskRun stand-point). But if they use the EnvSource (aka loading an environment variable value from a ConfigMap or a Secret), this is where it will fail-fast as it won't ever be recovered. See #9144 (comment).

So users who were relying on current behavior will either see a failure early instead of a timeout (always), or it won't change a thing. If there was a failure before, there is a failure now, just quicker ; if there was no failure before, there is no failure now either (from the user point of view).

I'll update the description.

Copy link
Member

@aThorp96 aThorp96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems good to me! Couple questions and a few optional suggestions/nits (mostly regarding patterns which were already in the code)

func isPodHitConfigError(pod *corev1.Pod) bool {
// hasContainerWaitingReason checks if any container (init or regular) is waiting with a reason
// that matches the provided predicate function
func hasContainerWaitingReason(pod *corev1.Pod, predicate func(corev1.ContainerStateWaiting) bool) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this helper

Comment on lines 278 to 279
image := step.ImageID
message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: no need to alloc

Suggested change
image := step.ImageID
message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)
message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, step.ImageID, step.Waiting.Message)

Comment on lines 291 to 292
image := step.ImageID
message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: no need to alloc

Suggested change
image := step.ImageID
message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)
message := fmt.Sprintf(`the step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, step.ImageID, step.Waiting.Message)

Comment on lines 326 to 327
image := sidecar.ImageID
message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: no need to alloc

Suggested change
image := sidecar.ImageID
message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)
message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, sidecar.Waiting.Message)

Comment on lines 339 to 340
image := sidecar.ImageID
message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: no need to alloc

Suggested change
image := sidecar.ImageID
message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)
message := fmt.Sprintf(`the sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, sidecar.Waiting.Message)

expectedReason = "PodCreationFailed"
expectedMessage = fmt.Sprintf(`the %s "unnamed-%d" in TaskRun "test-imagepull-fail" failed to start. The pod errored with the message: "%s."`, tc.failure, stepNumber, tc.message)
wantFailedEvent = fmt.Sprintf(`Warning Failed the %s "unnamed-%d" in TaskRun "test-imagepull-fail" failed to start. The pod errored with the message: "%s.`, tc.failure, stepNumber, tc.message)
default: // Image pull errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional suggestion:

Better to be explicit IMO. WDYT?

Suggested change
default: // Image pull errors
case "InvalidImageName", "ImagePullBackOff":

}

// Handle other image-related errors
if step.Waiting.Reason == "ErrImagePull" || step.Waiting.Reason == "InvalidImageName" {
Copy link
Member

@aThorp96 aThorp96 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug:
IIUC ErrImagePull can happen transiently and is automatically retried before the ImagePullBackoff. If that's the case, would this logic cause TaskRuns to fail instead of the image pull being retried? If I am reading this right, if we include step.Waiting.Reason == "ErrImagePull" then we'll fast fail before we ever allow the retry to get to ErrImagePullBackoff. Is that right? Maybe that's the point of the PR, but I do worry that it could make Tekton seem "flakier" to users if TaskRuns are more prone to fast-fail on transient issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are right, I'll remove that one.

Copy link
Member

@aThorp96 aThorp96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Two minor suggestions to utilize the new constants

}

// Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.)
if step.Waiting.Reason == "CreateContainerConfigError" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if step.Waiting.Reason == "CreateContainerConfigError" {
if step.Waiting.Reason == CreateContainerConfigError {

// Handle InvalidImageName (unrecoverable error)
// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
// will automatically retry before transitioning to ImagePullBackOff
if step.Waiting.Reason == "InvalidImageName" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if step.Waiting.Reason == "InvalidImageName" {
if step.Waiting.Reason == InvalidImageName {

}

// Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.)
if sidecar.Waiting.Reason == "CreateContainerConfigError" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if sidecar.Waiting.Reason == "CreateContainerConfigError" {
if sidecar.Waiting.Reason == CreateContainerConfigError {

// Handle InvalidImageName (unrecoverable error)
// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
// will automatically retry before transitioning to ImagePullBackOff
if sidecar.Waiting.Reason == "InvalidImageName" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if sidecar.Waiting.Reason == "InvalidImageName" {
if sidecar.Waiting.Reason == InvalidImageName {

Comment on lines 295 to 296
// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
// will automatically retry before transitioning to ImagePullBackOff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏🏼

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2025
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2025
Copy link
Member

@twoGiants twoGiants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Thank you for this 🥇

See my comments below. I would cleanup a bit, add more tests and decompose the big test function to have the container failures tested in its own more concise test functions.

Comment on lines 261 to 262
if step.Waiting != nil {
if _, found := podFailureReasons[step.Waiting.Reason]; found {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great opportunity to clean this up a bit. You could use the guard pattern twice here (and below for the sidecars) and reduce the nesting of the ifs by 2 and then extract the duplicated checks into a helper 😸 =>

		if step.Waiting == nil {
			continue
		}

		if _, found := podFailureReasons[step.Waiting.Reason]; !found {
			continue
		}

		c.checkContainerFailure(
			ctx,
			tr,
			step.Waiting,
			step.Name,
			step.ImageID,
			"step",
		)

And the waiting reason checks can go into a helper and reused below for the sidecars as well:

func (c *Reconciler) checkContainerFailure(
	ctx context.Context,
	tr *v1.TaskRun,
	waiting *corev1.ContainerStateWaiting,
	name,
	imageID,
	containerType string,
) (bool, v1.TaskRunReason, string) {
	if waiting.Reason == ImagePullBackOff {
		imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout
		// only attempt to recover from the imagePullBackOff if specified
		if imagePullBackOffTimeOut.Seconds() != 0 {
			p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
			if err != nil {
				message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. Failed to get pod with error: "%s."`, containerType, name, tr.Name, imageID, err)
				return true, v1.TaskRunReasonImagePullFailed, message
			}
			for _, condition := range p.Status.Conditions {
				// check the pod condition to get the time when the pod was ready to start containers / initialized.
				// keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration
				if slices.Contains(imagePullBackOffTimeoutPodConditions, string(condition.Type)) {
					if c.Clock.Since(condition.LastTransitionTime.Time) < imagePullBackOffTimeOut {
						return false, "", ""
					}
				}
			}
		}
		// ImagePullBackOff timeout exceeded or not configured
		message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, containerType, name, tr.Name, imageID, waiting.Message)
		return true, v1.TaskRunReasonImagePullFailed, message
	}

	// Handle CreateContainerConfigError (missing ConfigMap/Secret, invalid env vars, etc.)
	if waiting.Reason == CreateContainerConfigError {
		message := fmt.Sprintf(`the %s %q in TaskRun %q failed to start. The pod errored with the message: "%s."`, containerType, name, tr.Name, waiting.Message)
		return true, v1.TaskRunReasonCreateContainerConfigError, message
	}

	// Handle InvalidImageName (unrecoverable error)
	// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
	// will automatically retry before transitioning to ImagePullBackOff
	if waiting.Reason == InvalidImageName {
		message := fmt.Sprintf(`the %s %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, containerType, name, tr.Name, imageID, waiting.Message)
		return true, v1.TaskRunReasonImagePullFailed, message
	}

	// Handle CreateContainerError and other generic failures
	message := fmt.Sprintf(`the %s %q in TaskRun %q failed to start. The pod errored with the message: "%s."`, containerType, name, tr.Name, waiting.Message)
	return true, v1.TaskRunReasonPodCreationFailed, message
}

I also changed this message in the ImagePullBackOff error => the step %q in TaskRun %q failed to pull the image %q and the pod with error, the last part sounds wrong. I used: the %s %q in TaskRun %q failed to pull the image %q. Failed to get pod with error: "%s.

}

// Handle InvalidImageName (unrecoverable error)
// Note: ErrImagePull is not handled here as it's a transient state that Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when ErrImagePull is the reason the method returns true in line 304 and the reconciler fails the TaskRun. It looks like we either should not check for this transient state or if we do then return false with a message like retrying pulling image.

message: "secret \"secret-for-testing\" not found",
failure: "sidecar",
}, {
desc: "create container error step",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same test case for sidecar is missing.

message: "Invalid image \"whatever\"",
failure: "step",
imagePullBackOffTimeout: "5h",
}, {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ErrImagePull case is added in the implementation it must be added here too for both.

stepNumber = 1
}

var expectedReason, expectedMessage, wantFailedEvent string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to get this 🥲. But I got it! 🤣

This test is quite complex already with all its conditions down below. The additional switch makes it a even more difficult to understand.

What do you think about extracting the tests for the new container failure logic into it's own table driven test function with a simpler setup, execution and assertions instead of extending this one?

Copy link
Member

@aThorp96 aThorp96 Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Enumerating the almost-entirely-identical message and event strings inside the test cases would certainly be verbose and tedious, but the switch statement and various string formatting does add a lot of mental complexity. I think I'd even prefer setting/updating expectedStatus and wantEvents directly in the switch. It would be a lot more straightforward and clear though to either enumerate the events and strings in the cases or add an entirely different test function for the new behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already in TestReconcileContainerFailures, I forgot to remove that part...

@vdemeester vdemeester force-pushed the fix-9144 branch 3 times, most recently from d078a03 to 1c64c91 Compare December 18, 2025 12:30
- Fail fast on missing ConfigMaps/Secrets within seconds of detection
- Show actionable error messages instead of generic timeout failures
- Add early detection for CreateContainerConfigError and related failures

Signed-off-by: Vincent Demeester <[email protected]>
Copy link
Member

@twoGiants twoGiants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rework! Makes the logic easier to follow 😸 👍

/approve
/lgtm
/meow

@tekton-robot
Copy link
Collaborator

@twoGiants: cat image

Details

In response to this:

Thanks for the rework! Makes the logic easier to follow 😸 👍

/approve
/lgtm
/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 22, 2025
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aThorp96, twoGiants

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025
@tekton-robot tekton-robot merged commit 264d5fa into tektoncd:main Dec 22, 2025
24 checks passed
@vdemeester vdemeester deleted the fix-9144 branch December 22, 2025 18:56
@vdemeester vdemeester modified the milestones: v1.8.0, v1.9.0 (LTS) Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Misconfigured TaskRun pods cause PipelineRuns to silently timeout

5 participants