Skip to content

Fixed race condition causing queries to fail right after querier startup with the "empty ring" error #4068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 14, 2021

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Apr 9, 2021

What this PR does:
We just experienced a query failing with "empty ring" right after the querier has started. This is caused by the fact that the querier worker may connect to the query-frontend (or query-scheduler) and get a job before the ring client has initialised its state (which is when the WatchKey() callback is called for the first time).

In this PR I'm proposing to fix it introducing a starting() function to the Ring client service and read the initial read state while in the starting phase. Due to how dependencies are started and given the ring is a dependency of the querier, the ring is started before the querier worker service and this guarantees that, when the querier worker will be started, the ring has already initialised.

I've also made a change to fast fail the ring if an error occurs while fetching the initial ring state (ring key not existing is still OK). I think having a Cortex service fast failing if unable to talk to the ring backend should be fine (and probably a nice to have property), but if you foresee any potential issue I can change it.

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

return true
})
return nil
}

func (r *Ring) updateRingState(ringDesc *Desc) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just contains what previously was in the WatchKey() callback. No logic changes.

Copy link
Contributor

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I was considering if there is any impact on the other pkg/ring users (as they might take slightly longer to start), but I would assume that this is better behavior for all of them too.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

…tup with the "empty ring" error

Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci force-pushed the fix-initial-ring-state-while-starting branch from 3faf633 to 93f7dc1 Compare April 13, 2021 16:08
Signed-off-by: Marco Pracucci <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants