Skip to content

Long consumer recovery time when non-group coordinator is downed. #2266

@andykernahan

Description

@andykernahan

Description

Consumers can take a long time to recover when a broker is taken down that is not the coordinator of their consumer group. In some cases, it has taken ten minutes for the consumer to recover.

Is this behaviour expected? If so, is there a configuration property which influences the timeout?

How to reproduce

  1. Create a cluster of three brokers (broker[3-5] below).
  2. Create a test_topic (replication-factor=3, partitions=1).
  3. Assign a consumer to test_topic [0] offset BEGINNING, ensuring the consumer's group coordinator is not the leader.
  4. Kill Kafka on the leader.

Given the above, we observe the following excerpted logs:

11:16:37.8310 consumer-3: created: config=bootstrap.servers=broker3:9092,broker4:9092,broker5:9092,client.id=consumer-ddb3537cc0814470819d21d13a93c061,debug=all,enable.auto.commit=False,enable.partition.eof=True,fetch.error.backoff.ms=50,fetch.wait.max.ms=50,group.id=b496e32e-0ccd-498c-bf50-d01a010d9767,socket.nagle.disable=True
11:16:37.8466 consumer-3.BRKDELGT: [thrd:main]: test_topic [0]: broker broker5.local:9092/5 is now leader for partition with 0 messages (0 bytes) queued
11:16:38.8324 consumer-3.CGRPCOORD: [thrd:main]: broker5:9092/bootstrap: Group "b496e32e-0ccd-498c-bf50-d01a010d9767" coordinator is broker3.local:9092 id 3

broker5 is downed, at which point we observe the disconnection, followed by repeated leader re-queries:

11:19:09.7174 consumer-3.FAIL: [thrd:broker5:9092/bootstrap]: broker5:9092/bootstrap: Disconnected (after 151886ms in state UP)
11:19:09.7174 consumer-3.CONNECT: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Connecting to ipv4#192.168.212.97:9092 (plaintext) with socket 1520
11:19:10.2345 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:10.2345 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:11.2342 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:11.2342 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:11.7498 consumer-3.BROKERFAIL: [thrd:broker5.local:9092/5]: broker5.local:9092/5: failed: err: Local: Broker transport failure: (errno: Unknown error)
11:19:11.7498 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state CONNECT -> DOWN
11:19:11.7498 consumer-3: error_cb: broker5.local:9092/5: Connect to ipv4#192.168.212.97:9092 failed: No connection could be made because the target machine actively refused it... (after 2020ms in state CONNECT)
11:19:11.7498 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:12.2350 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:12.2350 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:12.7663 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state DOWN -> INIT
11:19:12.7663 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:12.7663 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state INIT -> TRY_CONNECT
11:19:12.7663 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:12.7663 consumer-3.CONNECT: [thrd:broker5.local:9092/5]: broker5.local:9092/5: broker in state TRY_CONNECT connecting
11:19:12.7663 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state TRY_CONNECT -> CONNECT
11:19:12.7663 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:12.7663 consumer-3.CONNECT: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Connecting to ipv4#192.168.212.97:9092 (plaintext) with socket 2576
11:19:13.2361 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:13.2361 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:14.2370 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:14.2370 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:14.7682 consumer-3.BROKERFAIL: [thrd:broker5.local:9092/5]: broker5.local:9092/5: failed: err: Local: Broker transport failure: (errno: Unknown error)
11:19:14.7682 consumer-3.FAIL: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Connect to ipv4#192.168.212.97:9092 failed: No connection could be made because the target machine actively refused it... (after 2001ms in state CONNECT)
11:19:14.7682 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state CONNECT -> DOWN
11:19:14.7682 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:15.2514 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:15.2514 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:15.7784 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state DOWN -> INIT
11:19:15.7784 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:15.7784 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state INIT -> TRY_CONNECT
11:19:15.7784 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:15.7784 consumer-3.CONNECT: [thrd:broker5.local:9092/5]: broker5.local:9092/5: broker in state TRY_CONNECT connecting
11:19:15.7784 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state TRY_CONNECT -> CONNECT
11:19:15.7784 consumer-3.BROADCAST: [thrd:broker5.local:9092/5]: Broadcasting state change
11:19:15.7789 consumer-3.CONNECT: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Connecting to ipv4#192.168.212.97:9092 (plaintext) with socket 2312
11:19:16.2526 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:16.2526 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:17.2538 consumer-3.QRYLEADER: [thrd:main]: Topic test_topic [0]: leader is down: re-query
11:19:17.2538 consumer-3.METADATA: [thrd:main]: Skipping metadata refresh of 1 topic(s): no usable brokers
11:19:17.7861 consumer-3.BROKERFAIL: [thrd:broker5.local:9092/5]: broker5.local:9092/5: failed: err: Local: Broker transport failure: (errno: Unknown error)
11:19:17.7861 consumer-3.FAIL: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Connect to ipv4#192.168.212.97:9092 failed: No connection could be made because the target machine actively refused it... (after 2007ms in state CONNECT)
11:19:17.7861 consumer-3.STATE: [thrd:broker5.local:9092/5]: broker5.local:9092/5: Broker changed state CONNECT -> DOWN

Approximately seven minutes later, the disconnection is reported by the group coordinator, at which point the consumer recovers:

11:26:39.1003 DEBUG consumer-3.FAIL: [thrd:GroupCoordinator]: GroupCoordinator: Disconnected (after 600268ms in state UP)
11:26:39.5859 DEBUG consumer-3.BRKDELGT: [thrd:main]: test_topic [0]: broker broker5.local:9092/5 no longer leader
11:26:39.5859 DEBUG consumer-3.BRKDELGT: [thrd:main]: test_topic [0]: broker broker4.local:9092/4 is now leader for partition with 0 messages (0 bytes) queued

Checklist

Please provide the following information:

  • librdkafka version:
librdkafka.redist 1.0.0-RC9 (Confluent.Kafka 1.0.0-RC1)
  • Apache Kafka version:
2.1.1-cp1 (confluent-community-5.1.2-2.11)
  • librdkafka client configuration:
bootstrap.servers=broker3:9092,broker4:9092,broker5:9092,
client.id=consumer-ddb3537cc0814470819d21d13a93c061
debug=all
enable.auto.commit=False
enable.partition.eof=True
fetch.error.backoff.ms=50
fetch.wait.max.ms=50
group.id=b496e32e-0ccd-498c-bf50-d01a010d9767
socket.nagle.disable=True
  • Operating system:
CentOS Linux 7 (Core)
  • Provide logs (with debug=.. as necessary) from librdkafka

consumer.log

  • Provide broker log excerpts
  • Critical issue

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions