Skip to content

Conversation

@tgross
Copy link
Member

@tgross tgross commented Dec 2, 2025

The coarse grained lock on the blocked evals queue can cause backpressure on the FSM when there are a large number of evals getting unblocked and there's contention from this lock from a large number of scheduler goroutines. The watchCapacity goroutine in the blocked evals queue has a large buffered channel for unblock operations intended to avoid this backpressure, but it takes the same lock that's used by the unblock methods called from the FSM. Meanwhile, Eval.Reblock RPCs arriving from scheduler workers attempt to take this same lock, and we end up with a backlog waiting on this mutex.

This PR moves all the operations for the blocked evals queue onto a single goroutine that receives work from a large buffered channel. The Eval.Reblock RPCs and the Unblock methods called from the FSM push work onto this channel and immediately return. This prevents them from blocking except for during leader transitions where we flush the blocked evals queue, at which point we should not be making Unblock method calls from the FSM anyways.

This also allows us to move the tracking of stats into one goroutine so we no longer need to copy the stats on each update. This reduces memory allocation and GC pressure significantly.

Ref: https://hashicorp.atlassian.net/browse/NMD-1045
Ref: #27184 (comment) (copy of relevant sections of internal investigation doc)

Testing & Reproduction steps

See the comments below.

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

The coarse grained lock on the blocked evals queue can cause backpressure on the
FSM when there are a large number of evals getting unblocked and there's
contention from this lock from a large number of scheduler goroutines. The
`watchCapacity` goroutine in the blocked evals queue has a large buffered
channel for unblock operations, but it takes the same lock that's used by the
unblock methods called from the FSM. Meanwhile, `Eval.Reblock` RPCs arriving
from scheduler workers attempt to take this same lock, and we end up with a
backlog waiting on this mutex.

This PR moves all the operations for the blocked evals queue onto a single
goroutine that receives work from a large buffered channel. The `Eval.Reblock`
RPCs and the `Unblock` methods called from the FSM push work onto this channel
and immediately return. This prevents them from blocking except for during
leader transitions where we flush the blocked evals queue, at which point we
should not be making `Unblock` method calls from the FSM anyways.

This also allows us to move the tracking of stats into one goroutine so we no
longer need to copy the stats on each update. This reduces memory allocation and
GC pressure significantly.

Ref: https://hashicorp.atlassian.net/browse/NMD-1045
@tgross tgross force-pushed the b-blocked-evals-coarse-locking branch from 06e72ac to 3ec0990 Compare December 2, 2025 21:37
@tgross tgross added theme/raft type/bug backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/1.11.x backport to 1.11.x release line theme/scheduling labels Dec 2, 2025
@tgross
Copy link
Member Author

tgross commented Dec 3, 2025

Investigation

This section is taken from an internal document around a customer investigation, with identifying information removed. Presented here so that community members have access to this information.

A trace profile we received from a customer shows long stretches of 100% utilization of a single proc (aka "P", a logical processor in the Go runtime, which typically maps 1:1 to a "M" or kernel thread). This proc is running the (*BlockedEvals) processBlock method for many different goroutines. Because this method takes a lock on a mutex at the start, this pattern shows lock contention on that mutex. The Go runtime moves goroutines waiting on the same mutex onto the same proc as an optimization, which is why we see the goroutines piling up on the same proc.

trace01

This same proc will shift to long stretches of (*BlockedEvals) watchCapacity.

trace02

The synchronization blocking profile from this trace shows 3 primary areas of blocking: acquiring a mutex in the (*BlockedEvals).processBlock method, and awaiting Raft for the Node.UpdateAlloc and Job.Register RPCs.

blocked-zoomed-in01

When node status updates or allocations become terminal (failed or complete), the client sends one of several RPCs with the update. All these RPCs end up writing to Raft, which passes through the FSM on the leader. Because we want to unblock evaluations that were waiting on those resources, we call one of several (*BlockedEvals).Unblock methods to unblock on the specific node, node class, or quotas. For example: fsm.go#L1010-L1011.

Those methods take a short lock on the blocked evals queue but then write to a large buffered channel that wakes up the (*BlockedEvals) watchCapacity goroutine (ref blocked_evals.go#L521-L536). That channel was designed with plenty of capacity (8096) to avoid backing-up the FSM.

The watchCapacity goroutine takes the lock on the blocked evals queue and because it has a large set of evaluations to unblock ends up holding that lock for up to 15 seconds in some trace segments.

flow01

Meanwhile, when schedulers can't find room for an allocation they write a blocked eval to Raft. If the evaluation they've processed was previously a blocked eval, they instead call the Eval.Reblock RPC. This RPC calls (*BlockedEvals).Reblock, which in turn takes a lock on the blocked evals queue (inside processBlock (ref blocked_evals.go#L165). There is a code path in Reblock that pushes evals back into the eval broker, but it doesn't appear to be hit frequently in the trace segments we've seen. The Reblock method call is fairly slow in the trace, taking ~1ms. Note that Reblock does not write to Raft so it doesn't have to be serialized through the FSM. While scheduler workers are waiting on Eval.Reblock, they can't do anything else, so this also slows scheduling throughput.

The trace shows that the (*BlockedEvals).Reblock method is on-CPU for nearly 100% of the time of the trace. As soon as the method exits a new goroutine takes over calling the same method, which implies that we always have at least one goroutine waiting to immediately take the lock. In fact, we have roughly 144 goroutines (assuming all servers have num_schedulers = 64), because the cluster has no room so only a few jobs are getting placed at a time and then the overwhelming remainder are reblocked.

While we're waiting on the unblock method called by watchCapacity, we have more opportunity for Eval.Reblock RPCs to arrive, guaranteeing that we have a pile-up of 144 of them waiting to grab the lock. (It also takes a lock on the eval broker when it has to re-enqueue evals, which may contribute to FSM back pressure but we don’t see any evidence of that here)

flow02

If a second node or allocation update RPC wants us to call Unblock, it's now getting in line for the lock behind all 144 scheduler goroutines in Reblock RPC calls and the lock being held by the watchCapacity goroutine. But by the time we've hit this lock contention we're already in the FSM. While that FSM goroutine is waiting for its turn, it's blocking all other goroutines from writing to Raft, including Job.Register RPCs.

We'd expect the goroutine to wait roughly 144ms, which is quite a while, but by the time the next RPC gets the FSM another Eval.Reblock RPC may have taken the lock and as soon as one of the in-flight RPCs is one that calls a (*BlockedEvals).Unblock method it's blocking all writes again.

flow03

To find evidence of this behavior, we need to revisit the trace profile. We expect to see a very short window for an RPC like Node.UpdateAlloc on one proc shortly after the watchCapacity goroutine starts its long stretch over control on another proc. The trace does show this evidence repeatedly, although you need to zoom way in (note in the screenshot below we’ve moved the scale from 200 microseconds to 50 microseconds per tick mark. For example, this event shows the (*Node).UpdateAlloc method is on CPU for only 10000ns. At this point it’s blocked on the FSM and we don't see it exit the function. The outgoing flow for that goroutine appears some time much later.

trace03

@tgross tgross added this to the 1.11.x milestone Dec 3, 2025
@tgross
Copy link
Member Author

tgross commented Dec 3, 2025

Benchmarking

To test this fix, I deployed a cluster with 3 server nodes and 25 simulated nodes. The servers are AWS EC2 m5.4xlarge instances with 16 cores each. Each simulated node can allocate 5000 compute and 5000 MiB RAM, and each simulated node has a metadata field "rack".

The "before" build is a custom build forked from ENT/main of c66209d2613465acbb3c017dd59d2268c41b4f4a, with a patch that allows unbounded num_schedulers to emphasize the concurrency problem. The server configuration was updated to 64 scheduler goroutines. The "after" build is this PR, with the same patch allowing unbounded num_schedulers. The job GC and batch eval GC thresholds were set to 1h to avoid noise from GC.

I registered 5 parameterized dispatch jobs, each with group.count=4. Each group has 3 tasks, each with 100 cpu/memo, for a total of 300 cpu/mem per alloc. Each simulated node can support up to 400 of these allocs (enough to run 100 concurrent dispatches). The jobspecs have a constraint so that they all fall on one of the five "racks", to ensure we're exercising a couple different unblock code paths. Jobs run for 5s.

For each test I dispatched 2000 copies of each with 200 parallel API clients, for a total of 10000 dispatches. This leaves 99% of initial dispatches unable to find capacity until jobs have completed and unblocked the evals. I took CPU, heap, and trace profiles at several points along the way and collected metrics.

Overall the data suggests this patch is a significant improvement for the pathological case it's intended to address.

Plan apply times

The graphs below show before and after of applying plans, at the plan applier and the FSM. The FSM apply times (the green line here) were reduced by 10x, with the exception of a couple of short spikes.

before-plan-apply after-plan-apply

Overall FSM apply times and Raft append entries RPCs

The graphs below show before and after of the Raft FSM apply metrics, and the Raft Append RPC metrics. The FSM apply times (the yellow lines here) were reduced by 10x again, with the exception of a couple of short spikes.

before-raft after-raft

Relevant RPC times

The graphs below show before and after of the Client.UpdateAlloc and Eval.Reblock RPCs involved in the original investigation. These were reduced by 5x or more, with the exception of a couple of short spikes again.

before-relevant-rpcs after-relevant-rpcs

Eval broker process times

The graphs below show before and after of the eval process metric that we take in the eval broker. This shows reduction by 3-4x.

before-broker-process after-broker-process

Blocking trace data

The diagrams below show blocking profile samples for the before and after for goroutines that handle RPCs. In the before, we can see a significant amount of time spent blocking on the blocked evals mutex. In the after, the reblock RPC doesn't even show up in any the samples.

before-trace-rpc-conn-blocking after-trace-rpc-conn-blocking

CPU profiles

The flamegraph below shows before and after of the CPU profile of the leader. The leader is on-CPU significantly more after the fix. Most of this appears to be because it's own scheduler goroutines are blocking less. The heap profile shows similar increase in heap use that can be attributed to the scheduler.

before-profile-leader-cpu after-profile-leader-cpu

@tgross tgross marked this pull request as ready for review December 3, 2025 20:42
@tgross tgross requested review from a team as code owners December 3, 2025 20:42
@tgross tgross requested review from chrisroberts, jrasell, pkazmierczak and schmichael and removed request for jrasell December 3, 2025 20:42
pkazmierczak
pkazmierczak previously approved these changes Dec 8, 2025
Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've read this carefully and we also discussed it offline. I understand the motivations and the reasoning is sound to me. Following this code is tough, but the write-up and benchmarks convince me.

Amazing work!

Co-authored-by: Piotr Kazmierczak <[email protected]>
@tgross tgross merged commit 1279a00 into main Dec 8, 2025
38 checks passed
@tgross tgross deleted the b-blocked-evals-coarse-locking branch December 8, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/1.11.x backport to 1.11.x release line theme/raft theme/scheduling type/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants