Suggest a worker based flow cancellation strategy #19938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

joshuastagner wants to merge 1 commit into main from stagner/suggested-worker-cancellation-changes

Contributor

joshuastagner commented Dec 23, 2025

Checklist

This pull request references any related issue by including "closes <link to issue>"
- If no issue exists and your change is not a small fix, please create an issue first.
If this pull request adds new functionality, it includes unit tests that cover the changes
If this pull request removes docs files, it includes redirect settings in mint.json.
If this pull request adds functions or classes, it includes helpful docstrings.


          Suggest a worker based flow cancellation strategy

2ed2a7f

joshuastagner requested review from chrisguidry, cicdw, desertaxle and zzstoatzz as code owners

December 23, 2025 14:57

github-actions bot added the docs label

mintlify bot deployed to staging - docs

December 23, 2025 14:58

View deployment

desertaxle reviewed

View reviewed changes

Member

desertaxle left a comment

This looks like a solid plan! I left some comments and questions that I had while reading through.

It would be nice to have a section of potential/rejected alternatives. I think having 1 or 2 other implementations to compare against helps to give us confidence that we're choosing the correct approach.

docs/plans/worker-flow-run-cancellation.md


		2. WebSocket Events: Subscribe to `prefect.flow-run.Cancelling` events for real-time detection of new cancellations.

		No continuous polling is needed—the startup poll handles the restart case, and WebSocket events handle the steady-state case.

Member

desertaxle Dec 23, 2025

The Runner has a fallback polling mechanism for cancellation if there are issues with the websocket. We might want something similar for the worker.

Contributor Author

joshuastagner Dec 23, 2025 •

edited

Loading

I was considering that. I had questions around the load that polling would put on the server. Maybe this is a non-issue or we could poll at a slower cadence since this is a fallback to the fallback

docs/plans/worker-flow-run-cancellation.md

+. Parse `infrastructure_pid` to verify it's infrastructure this worker can manage
+. Calculate time remaining until grace period expires based on `state.timestamp`
+. Schedule an async task to check again after the remaining time

Member

desertaxle Dec 23, 2025

If the grace period has already passed we can kill the infrastructure immedidate, right?

Also, how should the worker handle a requested shut down when there are scheduled cancellation tasks?

Contributor Author

joshuastagner Dec 23, 2025

That's what I was thinking. I think the async task here is just to decouple the shutdown task from the event. The propose code looks like the async task will sleep for n which in the event that it is passed, that would be 0.

As for scheduled cancellation task, I would think it should wait for those. The reason for the wait is too allow for cancellation_hooks. We could cancel immediately if flow does not have cancellation_hooks. This is something to consider when we implement this

docs/plans/worker-flow-run-cancellation.md

Comment on lines +67 to +69

		1. Startup Poll: On worker startup, query the API for any flow runs in `CANCELLING` state belonging to this work pool. This catches orphaned flow runs from before the worker started (e.g., after a worker restart).

		2. WebSocket Events: Subscribe to `prefect.flow-run.Cancelling` events for real-time detection of new cancellations.

Member

desertaxle Dec 23, 2025

I'd be good to specify the scope of the polling/subscription. Is it for all flow runs in the work pool? Only the flow runs from the work queues the worker is polling?

Contributor Author

joshuastagner Dec 23, 2025

I don't have much context for how the worker currently polls for work, but I would think we would only want cancellation events from any source where the worker would initiate work. So probably work queues if we can filter down to that without server side changes to start.

My naive thought was that there are likely fewer cancellation events, so if we had to do client side filtering that seems like a reasonable trade off.

My primary concern is adding load to the server

docs/plans/worker-flow-run-cancellation.md

+              - Set a longer grace period to give the Runner more time
+              - Set grace period to `-1` to disable worker-side cancellation entirely (accepting that stuck flows may remain in `CANCELLING` indefinitely)
+              ### Duplicate Cancellation Attempts

Member

desertaxle Dec 23, 2025

Is there any consideration needed for scenarios where more than one worker is running for a work pool?

Contributor Author

joshuastagner Dec 23, 2025

I think in this case we would just duplicate the work. I think to have a leasing system would be ideal, but to keep this all client side I don't see harm in multiple workers trying to delete the same thing. We will just need to be diligent to ensure that deleting something that is already deleted is still marked as a successful cancellation

docs/plans/worker-flow-run-cancellation.md

+. Calculate time remaining until grace period expires based on `state.timestamp`
+. Schedule an async task to check again after the remaining time
+. When the task fires, re-fetch the flow run state
+. If still `CANCELLING`, call `kill_infrastructure()` and mark as `CANCELLED`

Member

desertaxle Dec 23, 2025

How should we handle scenarios where kill_infrastructure fails?

Contributor Author

joshuastagner Dec 23, 2025

This would not be a new failure case. This problem currently exists for runner cancellations, so I think the current error handling would still be relevant. We may need to mirror it into the worker code. If the the worker calling this gets an exception, we should raise the exception and log it as an error while keeping the flow in cancelling state

docs/plans/worker-flow-run-cancellation.md

Comment on lines +58 to +59

		\| `PREFECT_WORKER_ENABLE_CANCELLATION` \| `false` \| Feature flag to enable worker-side cancellation \|
		\| `PREFECT_WORKER_CANCELLATION_GRACE_PERIOD_SECONDS` \| `60` \| Seconds to wait before force-cancelling. Set to `-1` to disable \|

Member

desertaxle Dec 23, 2025

If users can disable worker cancellation by providing -1 to PREFECT_WORKER_CANCELLATION_GRACE_PERIOD_SECONDS, then maybe we don't need PREFECT_WORKER_ENABLE_CANCELLATION.

Contributor Author

joshuastagner Dec 23, 2025

I am not sure if claude fully understood my intent here. I don't think we would never set -1 as an env var. My thought was that eventually we could make server side changes to allow for a "cancellation_grace_period" for a given flow. If that flow has complex cancellation_hooks that naturally take longer that the sensible cancellation grace_period, those hooks would not complete. The user could then set an extended grace period, or disable any worker force kill with the understanding that their flows could get stuck in cancelling. A user could also set a grace period to 0 to immediately cancel that flow

docs/plans/worker-flow-run-cancellation.md

+              This works well when the Runner is healthy and responsive. However, cancellation fails when:
+              - The flow process is stuck or hung (infinite loop, deadlock, blocking I/O)
+              - The Runner crashed before receiving the cancellation signal

Member

desertaxle Dec 23, 2025

Handling this case might be out of scope for this, since in cases like this the flow run should be marked as crashed by the worker. If we're seeing issues here, then improving the crash detection functionality of the workers is the way to go.

Contributor Author

joshuastagner Dec 23, 2025

Again, I think this is just claude digging deep to show off. I don't think this is a real case. I should have looked at those cases a little closer and cleaned them up

docs/plans/worker-flow-run-cancellation.md


		This works well when the Runner is healthy and responsive. However, cancellation fails when:

		- The flow process is stuck or hung (infinite loop, deadlock, blocking I/O)

Member

desertaxle Dec 23, 2025

Even if the flow process is stuck, the Runner process is the one that handles cancellation by killing the flow process, so I'm not sure this is a valid failure scenario.

Contributor Author

joshuastagner Dec 23, 2025

Sorry this is just claude being a know it all... I don't think all these cases are real

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

desertaxle desertaxle left review comments

cicdw Awaiting requested review from cicdw cicdw is a code owner

zzstoatzz Awaiting requested review from zzstoatzz zzstoatzz is a code owner

chrisguidry Awaiting requested review from chrisguidry chrisguidry is a code owner

At least 1 approving review is required to merge this pull request.

Labels

docs