-
Notifications
You must be signed in to change notification settings - Fork 2k
reduce FSM backpressure from blocked evals queue #27184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The coarse grained lock on the blocked evals queue can cause backpressure on the FSM when there are a large number of evals getting unblocked and there's contention from this lock from a large number of scheduler goroutines. The `watchCapacity` goroutine in the blocked evals queue has a large buffered channel for unblock operations, but it takes the same lock that's used by the unblock methods called from the FSM. Meanwhile, `Eval.Reblock` RPCs arriving from scheduler workers attempt to take this same lock, and we end up with a backlog waiting on this mutex. This PR moves all the operations for the blocked evals queue onto a single goroutine that receives work from a large buffered channel. The `Eval.Reblock` RPCs and the `Unblock` methods called from the FSM push work onto this channel and immediately return. This prevents them from blocking except for during leader transitions where we flush the blocked evals queue, at which point we should not be making `Unblock` method calls from the FSM anyways. This also allows us to move the tracking of stats into one goroutine so we no longer need to copy the stats on each update. This reduces memory allocation and GC pressure significantly. Ref: https://hashicorp.atlassian.net/browse/NMD-1045
06e72ac to
3ec0990
Compare
InvestigationThis section is taken from an internal document around a customer investigation, with identifying information removed. Presented here so that community members have access to this information. A trace profile we received from a customer shows long stretches of 100% utilization of a single proc (aka "P", a logical processor in the Go runtime, which typically maps 1:1 to a "M" or kernel thread). This proc is running the
This same proc will shift to long stretches of
The synchronization blocking profile from this trace shows 3 primary areas of blocking: acquiring a mutex in the
When node status updates or allocations become terminal (failed or complete), the client sends one of several RPCs with the update. All these RPCs end up writing to Raft, which passes through the FSM on the leader. Because we want to unblock evaluations that were waiting on those resources, we call one of several Those methods take a short lock on the blocked evals queue but then write to a large buffered channel that wakes up the The
Meanwhile, when schedulers can't find room for an allocation they write a blocked eval to Raft. If the evaluation they've processed was previously a blocked eval, they instead call the The trace shows that the While we're waiting on the unblock method called by
If a second node or allocation update RPC wants us to call We'd expect the goroutine to wait roughly 144ms, which is quite a while, but by the time the next RPC gets the FSM another
To find evidence of this behavior, we need to revisit the trace profile. We expect to see a very short window for an RPC like
|
pkazmierczak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I've read this carefully and we also discussed it offline. I understand the motivations and the reasoning is sound to me. Following this code is tough, but the write-up and benchmarks convince me.
Amazing work!
Co-authored-by: Piotr Kazmierczak <[email protected]>



















The coarse grained lock on the blocked evals queue can cause backpressure on the FSM when there are a large number of evals getting unblocked and there's contention from this lock from a large number of scheduler goroutines. The
watchCapacitygoroutine in the blocked evals queue has a large buffered channel for unblock operations intended to avoid this backpressure, but it takes the same lock that's used by the unblock methods called from the FSM. Meanwhile,Eval.ReblockRPCs arriving from scheduler workers attempt to take this same lock, and we end up with a backlog waiting on this mutex.This PR moves all the operations for the blocked evals queue onto a single goroutine that receives work from a large buffered channel. The
Eval.ReblockRPCs and theUnblockmethods called from the FSM push work onto this channel and immediately return. This prevents them from blocking except for during leader transitions where we flush the blocked evals queue, at which point we should not be makingUnblockmethod calls from the FSM anyways.This also allows us to move the tracking of stats into one goroutine so we no longer need to copy the stats on each update. This reduces memory allocation and GC pressure significantly.
Ref: https://hashicorp.atlassian.net/browse/NMD-1045
Ref: #27184 (comment) (copy of relevant sections of internal investigation doc)
Testing & Reproduction steps
See the comments below.
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.