Skip to content

Receive: Endless loop of retried replication with capnproto and distributors #8254

@TheReal1604

Description

@TheReal1604

Thanos, Prometheus and Golang version used: 0.38.0

bitnami/thanos:0.38.0-debian-12-r3

Object Storage Provider:
Openstack-s3

What happened:
We use a thanos setup with 3-5 receivers and dedicated thanos-receive routing instances, which use capnproto as a replication protocol. The replication_factor is set to 3.

Currently we only have a static hashring configuration.

data:
  hashrings.json: |-
    [
      {
        "endpoints": [
          "thanos-receive-0.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-3.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-4.thanos-receive-headless.customer1.svc.cluster.local:19391"
        ]
      }
    ]

Unfortunately we can trigger something like an endless replication retry loop, if the receive instances restart in a chaotic way (this is triggered by k8s node-rollovers in our custom clusters).

If we see the errors, the distributor pods are logging the following errors very often:

Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: 2 errors: forwarding request to endpoint {thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded; forwarding request to endpoint {thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded

The receive pods restart at the same time, but the distributor can not recover from that error state. I ultimately have to kill the distributor pods to work correctly again. The normal receive-pods are working fine (dont need a restart or anything).

The interesting thing: We reverted back to grpc with protobuf and couldnt reproduce this issue, it seems to be something with the capnproto implementation.

What you expected to happen:
Recover successful from the above error if receive pods and router / distributor pods are restarted.

How to reproduce it (as minimally and precisely as possible):

  1. Create a thanos setup (3-5 receive pods, 2 distributor / router pods)
  2. Setup replication with replication factor 3 and use capnproto as the replication protocol
  3. setup fixed hashring.json replication configmap
  4. Send test data to the distributor pods
  5. restart the distributor and receive pods in a chaotic way
  6. -> Stuck in this error loop
    Anything else we need to know:
  • We are using the thanos bitnami helmchart.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions