-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Bug summary
Hi Prefect Team,
I'm using a self-hosted Prefect Server setup in Kubernetes, deployed using the official Helm Chart (version 2025.11.24182903). The infrastructure setup is the following:
- AWS EKS (Auto Mode) cluster
- AWS RDS (PostgreSQL) instance
- AWS ElastiCache (Redis) instance
While these are the running services:
- Prefect Server 3.6.4
- Background Services split up
- Single Prefect Worker (via same Helm version)
All the workloads are running in the same namespace (prefect) and have the network policies needed to allow full ingress and egress access across the whole namespace. I can provide the Helm chart values if needed.
I'm using the following prefect.yaml file to deploy my flow (for reference):
# Generic metadata about this project
name: api-extractor
prefect-version: 3.6.4
# build section allows you to manage and build docker images
build:
- prefect_docker.deployments.steps.build_docker_image:
id: build_image
requires: prefect-docker>=0.3.1
image_name: 0000.dkr.ecr.us-xxxx-1.amazonaws.com/extractor/api-extractor
tag: 0.0.1
dockerfile: Dockerfile
platform: linux/amd64
# push section allows you to manage if and how this project is uploaded to remote locations
push:
- prefect_docker.deployments.steps.push_docker_image:
requires: prefect-docker>=0.3.1
image_name: '{{ build_image.image_name }}'
tag: '{{ build_image.tag }}'
# pull section allows you to provide instructions
pull:
- prefect.deployments.steps.set_working_directory:
directory: /app
# the deployments section allows you to provide configuration for deploying flows
deployments:
- name: extractor-deployment
version: 1.0.0
description: This deployment orchestrates extractor
schedule: {cron: "00 18 * * *", slug: "utc-schedule", timezone: "UTC", active: true}
flow_name: extractor
entrypoint: flows/extractor.py:extract_from_api
parameters:
sources:
- users
- transactions
targets:
sessions: "dev-users"
events: "dev-transactions"
target_type: "s3"
output_format: "json"
start_time: yesterday
end_time: yesterday
work_pool:
name: extractor-work-pool
work_queue_name: null
job_variables:
image: '{{ build_image.image }}'
finished_job_ttl: 100
memory: 16Gi
image_pull_policy: Always
service_account_name: extractor-account
node_selector:
karpenter.sh/capacity-type: on-demand
env:
PREFECT_RUNNER_HEARTBEAT_FREQUENCY: "30"The flow runs smoothly in the K8s cluster as a proper Job with its customization and parameters, but if I Cancel it from the UI, it stays in Cancelling state even if the Job Pod (and its state accordingly) is properly stopped.
To reproduce the issue do the following:
- Deploy a whole flow as usual (with or without schedule)
- Run a flow (manually or with a schedule, with Quick run or Custom run)
- Wait for the flow to provision and set to Running state
- In the UI (or via APIs) click on the Cancel flow button
- Check that the flow is in Cancelling state from the UI/APIs
- Check that the flow Pod (from K8s Job) is being stopped at some point (takes a while sometimes)
- After the Pod is stopped (gracefully), the flow run in Prefect stays in Cancelling state forever.
This behavior was not fixed by using the proposed Automations, since the Cancelling state doesn't trigger the automation rule, while other states do.
Am I missing something relevant in the workflow about managing these kind of states while using the Kubernetes work pools? I'll be happy to help by providing any info needed by the team to troubleshoot the issue even further.
Thanks in advance!
Version info
Version: 3.6.4
API version: 0.8.4
Python version: 3.12.10
Git commit: d3c3ed50
Built: Fri, Nov 21, 2025 06:04 PM
OS/Arch: darwin/arm64
Profile: dev
Server type: server
Pydantic version: 2.12.2
Server:
Database: sqlite
SQLite version: 3.51.0
Integrations:
prefect-docker: 0.6.6
prefect-kubernetes: 0.6.5
Additional context
No response