Skip to content

Flows stuck in Cancelling state while using Kubernetes work pool #19593

@simone201

Description

@simone201

Bug summary

Hi Prefect Team,

I'm using a self-hosted Prefect Server setup in Kubernetes, deployed using the official Helm Chart (version 2025.11.24182903). The infrastructure setup is the following:

  • AWS EKS (Auto Mode) cluster
  • AWS RDS (PostgreSQL) instance
  • AWS ElastiCache (Redis) instance

While these are the running services:

  • Prefect Server 3.6.4
  • Background Services split up
  • Single Prefect Worker (via same Helm version)

All the workloads are running in the same namespace (prefect) and have the network policies needed to allow full ingress and egress access across the whole namespace. I can provide the Helm chart values if needed.

I'm using the following prefect.yaml file to deploy my flow (for reference):

# Generic metadata about this project
name: api-extractor
prefect-version: 3.6.4

# build section allows you to manage and build docker images
build:
- prefect_docker.deployments.steps.build_docker_image:
    id: build_image
    requires: prefect-docker>=0.3.1
    image_name: 0000.dkr.ecr.us-xxxx-1.amazonaws.com/extractor/api-extractor
    tag: 0.0.1
    dockerfile: Dockerfile
    platform: linux/amd64

# push section allows you to manage if and how this project is uploaded to remote locations
push:
- prefect_docker.deployments.steps.push_docker_image:
    requires: prefect-docker>=0.3.1
    image_name: '{{ build_image.image_name }}'
    tag: '{{ build_image.tag }}'

# pull section allows you to provide instructions
pull:
- prefect.deployments.steps.set_working_directory:
    directory: /app

# the deployments section allows you to provide configuration for deploying flows
deployments:
- name: extractor-deployment
  version: 1.0.0
  description: This deployment orchestrates extractor
  schedule: {cron: "00 18 * * *", slug: "utc-schedule", timezone: "UTC", active: true}
  flow_name: extractor
  entrypoint: flows/extractor.py:extract_from_api
  parameters:
    sources: 
      - users
      - transactions
    targets:
      sessions: "dev-users"
      events: "dev-transactions"
    target_type: "s3"
    output_format: "json"
    start_time: yesterday
    end_time: yesterday
  work_pool:
    name: extractor-work-pool
    work_queue_name: null
    job_variables:
      image: '{{ build_image.image }}'
      finished_job_ttl: 100
      memory: 16Gi
      image_pull_policy: Always
      service_account_name: extractor-account
      node_selector:
        karpenter.sh/capacity-type: on-demand
      env:
        PREFECT_RUNNER_HEARTBEAT_FREQUENCY: "30"

The flow runs smoothly in the K8s cluster as a proper Job with its customization and parameters, but if I Cancel it from the UI, it stays in Cancelling state even if the Job Pod (and its state accordingly) is properly stopped.

To reproduce the issue do the following:

  1. Deploy a whole flow as usual (with or without schedule)
  2. Run a flow (manually or with a schedule, with Quick run or Custom run)
  3. Wait for the flow to provision and set to Running state
  4. In the UI (or via APIs) click on the Cancel flow button
  5. Check that the flow is in Cancelling state from the UI/APIs
  6. Check that the flow Pod (from K8s Job) is being stopped at some point (takes a while sometimes)
  7. After the Pod is stopped (gracefully), the flow run in Prefect stays in Cancelling state forever.

This behavior was not fixed by using the proposed Automations, since the Cancelling state doesn't trigger the automation rule, while other states do.

Am I missing something relevant in the workflow about managing these kind of states while using the Kubernetes work pools? I'll be happy to help by providing any info needed by the team to troubleshoot the issue even further.

Thanks in advance!

Version info

Version:              3.6.4
API version:          0.8.4
Python version:       3.12.10
Git commit:           d3c3ed50
Built:                Fri, Nov 21, 2025 06:04 PM
OS/Arch:              darwin/arm64
Profile:              dev
Server type:          server
Pydantic version:     2.12.2
Server:
  Database:           sqlite
  SQLite version:     3.51.0
Integrations:
  prefect-docker:     0.6.6
  prefect-kubernetes: 0.6.5

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions