Skip to content

stuck terminating: check_if_node_is_quorum_critical false positive #1902

Open
@awoimbee

Description

@awoimbee

Describe the bug

I had a pod stuck in terminating state because of rabbitmq-diagnostics check_if_node_is_quorum_critical telling me 4 queues are critical when they seem to have 1 leader + 2 followers each.
It can also happen that check_if_node_is_quorum_critical gets stuck if a queue just doesn't have enough followers as it's not growing the queues following.
And there are no alerts sent if the process takes time or even times out.

To Reproduce

This only happens on my big cluster (>300 queues, a small amount of churn).

Expected behavior

  • No false positives
  • The preStop is not passive (grows the qeues following)
  • Alerts are sent

Version and environment information

  • RabbitMQ: 4.1.1
  • RabbitMQ Cluster Operator: 2.15.0
  • Kubernetes: 1.32.5
  • Cloud provider or hardware configuration: AWS EKS

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions