Skip to content

failed daemon leaves workflow in bad state #14715

@rwong2888

Description

@rwong2888

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

possibly related to #14544

Image

i have transient retries on things such as pod deletion

that staging app server errored out with status 143
the other app server deleted
not sure why
then it looks like it spun back up due to transient errors and retries we have with pod deletions
then it looks like it tried to do 2 more retries with the git clone steps again
and workflows are just stuck in a weird state now
the workflow looks failed, but still has pods running

Version(s)

3.7.0

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

n/a

Logs from the workflow controller

running into character limit. linking to logs in slack snippet
https://cloud-native.slack.com/archives/C01QW9QSSSK/p1753901691851319?thread_ts=1753901385.723079&cid=C01QW9QSSSK

Logs from in your workflow's wait container

garbage collection ran i think


kubectl logs -n frontend -c wait -l workflows.argoproj.io/workflow=playwright-simple-site-dg7wr,workflow.argoproj.io/phase!=Succeeded
No resources found in frontend namespace.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions