Skip to content

Snuba doesn't drop connection when Kafka node dies while still creating the healthcheck file #7763

@jiriks74

Description

@jiriks74

Environment

What version are you running? 25.9.0

Steps to Reproduce

  1. Have a Kafka Cluster
  2. Use SSL
  3. Rip out one node to test fail-over (either network or power so it's completely unreachable)
  4. Snuba doesn't drop the connection to the dead broker
  5. Snuba is stuck in a SSL handshake error loop until restarted

The following ENVs are set:

  • DEFAULT_BROKERS: kafka-broker-1,kafka-broker-2,...,kafka-broker-9
  • KAFKA_SECURITY_PROTOCOL: SSL
  • KAFKA_SSL_CA_PATH: /etc/ssl/certs/my-ca.pem
  • KAFKA_SSL_CERT_PATH: client.crt
  • KAFKA_SSL_KEY_PATH: client.key

Expected Result

Snuba drops the broken connection and connects to another working broker.

Healthcheck file not being created since the consumer is in a non-working state:

--health-check-file /tmp/health.txt

Actual Result

Snuba keeps trying to do a SSL handshake to the dead broker.

%4|1771925946.628|FAIL|rdkafka#producer-1| [thrd:ssl://kafka-broker-1:9093/bootstra]: ssl://kafka-broker-1:9093/1: Connection setup timed out in state CONNECT (after 30027ms in state CONNECT, 1 identical error(s) suppressed

Health-check file is still being created therefore the container cannot be restarted automatically.

Additional information

Some Snuba consumers do drop the connection (I see like 2-5 errors in the log) and connect to a working one while others don't. I haven't found out why it sometimes works and sometimes doesn't.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for issues without a type.

    Projects

    Status

    Waiting for: Product Owner

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions