Receive: Endless loop of retried replication with capnproto and distributors

**Thanos, Prometheus and Golang version used**: 0.38.0


bitnami/thanos:0.38.0-debian-12-r3

**Object Storage Provider**:
Openstack-s3

**What happened**:
We use a thanos setup with 3-5 receivers and dedicated thanos-receive routing instances, which use capnproto as a replication protocol. The replication_factor is set to 3.

Currently we only have a static hashring configuration.

```
data:
  hashrings.json: |-
    [
      {
        "endpoints": [
          "thanos-receive-0.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-3.thanos-receive-headless.customer1.svc.cluster.local:19391",
          "thanos-receive-4.thanos-receive-headless.customer1.svc.cluster.local:19391"
        ]
      }
    ]
```

Unfortunately we can trigger something like an endless replication retry loop, if the receive instances restart in a chaotic way (this is triggered by k8s node-rollovers in our custom clusters).

If we see the errors, the distributor pods are logging the following errors very often:

``` 
Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: 2 errors: forwarding request to endpoint {thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded; forwarding request to endpoint {thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded
```

The receive pods restart at the same time, but the distributor can not recover from that error state. I ultimately have to kill the distributor pods to work correctly again. The normal receive-pods are working fine (dont need a restart or anything).

The interesting thing: We reverted back to grpc with protobuf and couldnt reproduce this issue, it seems to be something with the capnproto implementation.

**What you expected to happen**:
Recover successful from the above error if receive pods and router / distributor pods are restarted.

**How to reproduce it (as minimally and precisely as possible)**:
1. Create a thanos setup (3-5 receive pods,  2 distributor / router pods)
2. Setup replication with replication factor 3 and use capnproto as the replication protocol
3. setup fixed hashring.json replication configmap
4. Send test data to the distributor pods
5. restart the distributor and receive pods in a chaotic way
6. -> Stuck in this error loop
**Anything else we need to know**:
- We are using the thanos bitnami helmchart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive: Endless loop of retried replication with capnproto and distributors #8254

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Receive: Endless loop of retried replication with capnproto and distributors #8254

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions