Commit 1c44041
committed
Make Thanos Query wait for initial endpoint discovery before becoming ready
Problem:
We observed a race condition where Thanos Query components were marking themselves as ready before discovering any endpoints. This created a timing gap that could lead to query failures:
- Query pods become ready immediately upon startup
- Endpoint discovery happens asynchronously in the background
- Queries arriving between readiness and endpoint discovery fail
Solution:
This commit modifies the Thanos Query readiness behavior to wait for the initial endpoint discovery to complete before marking the pod as ready. This ensures that when a Query pod reports ready, it has already attempted to discover and connect to available endpoints.
Changes:
1. Added synchronization to EndpointSet:
- Added firstUpdateOnce flag and firstUpdateChan channel to track first update completion
- Added WaitForFirstUpdate() method to block until initial discovery completes
2. Modified Query startup sequence:
- gRPC server now waits for WaitForFirstUpdate() before calling statusProber.Ready()
- Leverages existing runutil.Repeat behavior which runs the update function immediately
3. Timeout protection:
- Uses store response timeout or 30 seconds as default timeout
- Logs warning if timeout occurs but still proceeds to ready state
4. Added comprehensive tests for the new WaitForFirstUpdate functionality
Impact:
- Positive: Eliminates the race condition where queries could be routed to Query pods that haven't discovered any endpoints yet
- Negative: Slightly increases startup time as pods won't be ready until endpoint discovery completes (typically <1s in normal conditions)
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>1 parent 8b738c5 commit 1c44041
File tree
4 files changed
+309
-40
lines changed- cmd/thanos
- pkg/query
4 files changed
+309
-40
lines changed
0 commit comments