Skip to content

Commit 1c44041

Browse files
committed
Make Thanos Query wait for initial endpoint discovery before becoming ready
Problem: We observed a race condition where Thanos Query components were marking themselves as ready before discovering any endpoints. This created a timing gap that could lead to query failures: - Query pods become ready immediately upon startup - Endpoint discovery happens asynchronously in the background - Queries arriving between readiness and endpoint discovery fail Solution: This commit modifies the Thanos Query readiness behavior to wait for the initial endpoint discovery to complete before marking the pod as ready. This ensures that when a Query pod reports ready, it has already attempted to discover and connect to available endpoints. Changes: 1. Added synchronization to EndpointSet: - Added firstUpdateOnce flag and firstUpdateChan channel to track first update completion - Added WaitForFirstUpdate() method to block until initial discovery completes 2. Modified Query startup sequence: - gRPC server now waits for WaitForFirstUpdate() before calling statusProber.Ready() - Leverages existing runutil.Repeat behavior which runs the update function immediately 3. Timeout protection: - Uses store response timeout or 30 seconds as default timeout - Logs warning if timeout occurs but still proceeds to ready state 4. Added comprehensive tests for the new WaitForFirstUpdate functionality Impact: - Positive: Eliminates the race condition where queries could be routed to Query pods that haven't discovered any endpoints yet - Negative: Slightly increases startup time as pods won't be ready until endpoint discovery completes (typically <1s in normal conditions) Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
1 parent 8b738c5 commit 1c44041

File tree

4 files changed

+309
-40
lines changed

4 files changed

+309
-40
lines changed

0 commit comments

Comments
 (0)