-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Receive: Endless loop of retried replication with capnproto and distributors #8254
Description
Thanos, Prometheus and Golang version used: 0.38.0
bitnami/thanos:0.38.0-debian-12-r3
Object Storage Provider:
Openstack-s3
What happened:
We use a thanos setup with 3-5 receivers and dedicated thanos-receive routing instances, which use capnproto as a replication protocol. The replication_factor is set to 3.
Currently we only have a static hashring configuration.
data:
hashrings.json: |-
[
{
"endpoints": [
"thanos-receive-0.thanos-receive-headless.customer1.svc.cluster.local:19391",
"thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391",
"thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391",
"thanos-receive-3.thanos-receive-headless.customer1.svc.cluster.local:19391",
"thanos-receive-4.thanos-receive-headless.customer1.svc.cluster.local:19391"
]
}
]
Unfortunately we can trigger something like an endless replication retry loop, if the receive instances restart in a chaotic way (this is triggered by k8s node-rollovers in our custom clusters).
If we see the errors, the distributor pods are logging the following errors very often:
Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: 2 errors: forwarding request to endpoint {thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-1.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded; forwarding request to endpoint {thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 thanos-receive-2.thanos-receive-headless.customer1.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: bootstrap: send message: context deadline exceeded
The receive pods restart at the same time, but the distributor can not recover from that error state. I ultimately have to kill the distributor pods to work correctly again. The normal receive-pods are working fine (dont need a restart or anything).
The interesting thing: We reverted back to grpc with protobuf and couldnt reproduce this issue, it seems to be something with the capnproto implementation.
What you expected to happen:
Recover successful from the above error if receive pods and router / distributor pods are restarted.
How to reproduce it (as minimally and precisely as possible):
- Create a thanos setup (3-5 receive pods, 2 distributor / router pods)
- Setup replication with replication factor 3 and use capnproto as the replication protocol
- setup fixed hashring.json replication configmap
- Send test data to the distributor pods
- restart the distributor and receive pods in a chaotic way
- -> Stuck in this error loop
Anything else we need to know:
- We are using the thanos bitnami helmchart.