Skip to content

test(e2e): move flaky retry/matrix tests to no-ci temporarily#9242

Merged
tekton-robot merged 2 commits intotektoncd:mainfrom
vdemeester:fix-matrix-timeout
Dec 17, 2025
Merged

test(e2e): move flaky retry/matrix tests to no-ci temporarily#9242
tekton-robot merged 2 commits intotektoncd:mainfrom
vdemeester:fix-matrix-timeout

Conversation

@vdemeester
Copy link
Member

@vdemeester vdemeester commented Dec 17, 2025

Changes

Move two consistently flaky e2e tests to no-ci/ directory to unblock CI while we investigate and fix the root cause:

  1. pipelinerun-with-matrix (examples/v1/pipelineruns/beta/pipelinerun-with-matrix.yaml)

    • Times out at ~900 seconds on k8s-oldest + alpha features
    • Multiple failures: Dec 16 (run 20265445080), Dec 15 (runs 20242616379, 20231100917)
    • Suspected issue: matrix-with-task-retries task may have bug in context variable substitution
  2. using-retries-and-retry-count-variables (examples/v1/pipelineruns/using-retries-and-retry-count-variables.yaml)

    • Times out at ~900 seconds on k8s-oldest + alpha features
    • Failure: Dec 16 (run 20276197820)
    • Similar retry logic that may be affected by same root cause

Analysis

Both tests timeout at exactly 15 minutes (global test timeout in test/wait.go:64) when running on k8s-oldest with alpha features enabled. The common pattern:

  • ✅ Pass on k8s-latest
  • ❌ Timeout on k8s-oldest + alpha
  • Both involve retry logic with context variables
  • Timeout is consistent (~900-903 seconds)

Hypothesis: The retry logic with $(context.pipelineTask.retries) or $(context.task.retry-count) may not work correctly on older Kubernetes versions with alpha features, causing infinite retry loops until timeout.

Next Steps

  • Reproduce locally with k8s 1.28.0 + alpha features
  • Investigate context variable substitution with retries
  • Identify and fix root cause
  • Move tests back to their original locations
  • Verify tests run reliably in CI

Related

Closes #9201
Related to #9062

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

- Add startupProbe to accommodate slow Docker daemon initialization
- Allow up to 30 seconds for daemon startup before marking as failed
- Verify daemon functionality with 'docker info' instead of cert existence
- Tries to address flaky test failures in CI with k8s native sidecar support

Signed-off-by: Vincent Demeester <[email protected]>
Move two consistently flaky e2e tests to no-ci to unblock CI while we
investigate and fix the root cause:

1. pipelinerun-with-matrix - Times out at 900s on k8s-oldest + alpha
2. using-retries-and-retry-count-variables - Times out at 900s on k8s-oldest + alpha

Both tests timeout at exactly 15 minutes (global test timeout) when
running on k8s-oldest with alpha features enabled. The issue appears to
be related to retry logic with matrix expansion or context variable
substitution.

Related to tektoncd#9201, tektoncd#9062
@tekton-robot tekton-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Dec 17, 2025
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 17, 2025
@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 17, 2025
@vdemeester
Copy link
Member Author

/kind flake

@tekton-robot tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Dec 17, 2025
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 17, 2025
Copy link
Contributor

@khrm khrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2025
@tekton-robot tekton-robot merged commit 57927d7 into tektoncd:main Dec 17, 2025
24 of 27 checks passed
@vdemeester vdemeester deleted the fix-matrix-timeout branch December 17, 2025 11:34
@vdemeester
Copy link
Member Author

/cherry-pick release-v1.0.x

@tekton-robot
Copy link
Collaborator

Cherry-pick to release-v1.0.x successful!

A new pull request has been created to cherry-pick this change to release-v1.0.x.

Please review and merge the cherry-pick PR.

@vdemeester
Copy link
Member Author

/cherry-pick release-v1.0.x

@tekton-robot
Copy link
Collaborator

ℹ️ Cherry-pick to release-v1.0.x already exists!

A pull request for this cherry-pick already exists: #9315

PR: #9315

vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this pull request Jan 29, 2026
Add startupProbe with failureThreshold of 30 to allow more time for Docker
daemon initialization. Change readinessProbe to use 'docker info' command
which verifies the daemon is actually ready to accept commands.

This is a cherry-pick of the probe changes from tektoncd#9242 on main.
tekton-robot pushed a commit that referenced this pull request Jan 30, 2026
Add startupProbe with failureThreshold of 30 to allow more time for Docker
daemon initialization. Change readinessProbe to use 'docker info' command
which verifies the daemon is actually ready to accept commands.

This is a cherry-pick of the probe changes from #9242 on main.
@vdemeester
Copy link
Member Author

/cherry-pick release-v1.0.x

@tekton-robot
Copy link
Collaborator

Cherry-pick to release-v1.0.x failed!

The automatic cherry-pick to release-v1.0.x failed.

Output:

🤖 Starting cherry-pick process...
Fetching PR #9242 information...
Found merge commit: 57927d74a611802726dcc68505eb60ac5078291d
PR title: test(e2e): move flaky retry/matrix tests to no-ci temporarily
Fetching target branch: release-v1.0.x...
From https://github.com/tektoncd/pipeline
 * branch                release-v1.0.x -> FETCH_HEAD
Checking for existing cherry-pick PR...
Creating cherry-pick branch: cherry-pick-9242-to-release-v1.0.x...
Switched to a new branch 'cherry-pick-9242-to-release-v1.0.x'
branch 'cherry-pick-9242-to-release-v1.0.x' set up to track 'origin/release-v1.0.x'.
Fetching commits from PR #9242...
Found 2 commit(s) to cherry-pick
Cherry-picking commit 1/2: 209e1b55ebddad2e6728f31f40dc95f39aad2c48...
fatal: bad object 209e1b55ebddad2e6728f31f40dc95f39aad2c48
❌ ERROR: Cherry-pick failed for commit 209e1b55ebddad2e6728f31f40dc95f39aad2c48 due to conflicts or other errors

Next steps:

  • Check the action logs for complete details
  • If the PR is not merged, merge it first and try again
  • If there are conflicts, you'll need to manually cherry-pick this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/flake Categorizes issue or PR as related to a flakey test lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

e2e: Re-enable and fix flaky pipelinerun-with-matrix example test

4 participants