|
| 1 | +# Forkchoice Concurrency + Stale Snapshot Analysis |
| 2 | + |
| 3 | +Date: 2026-01-22 |
| 4 | +Branch: forkchoice-graph (local) |
| 5 | + |
| 6 | +## Executive summary |
| 7 | +- Forkchoice can be mutated concurrently with chain processing in the current production path because gossip callbacks may run on the Rust bridge thread while `onInterval` runs on the xev loop thread. |
| 8 | +- The specific “canonical view + analysis mismatch” issue existed on `main` and has now been fixed locally by combining view + analysis under a single shared lock. |
| 9 | +- A broader “analysis result can become stale before it is used” risk remains anywhere analysis results are computed and then used later while forkchoice can still mutate concurrently. |
| 10 | +- These concurrency/staleness risks were **not introduced** by the forkchoice visualization PR; they already exist on `main`. |
| 11 | + |
| 12 | +## Evidence of concurrent mutation |
| 13 | +### Rust bridge thread calls gossip handlers directly |
| 14 | +- `pkgs/network/src/ethlibp2p.zig` spawns a Rust bridge thread (`Thread.spawn`) and uses it to deliver gossip. |
| 15 | +- In `handleGossipFromRustBridge`, gossip is dispatched via `gossipHandler.onGossip(..., scheduleOnLoop=false)`. That path invokes handlers immediately on the Rust thread. |
| 16 | + |
| 17 | +### gossip handler does not schedule on xev loop |
| 18 | +- `pkgs/network/src/interface.zig:GenericGossipHandler.onGossip` only schedules via xev when `scheduleOnLoop=true`. The ethlibp2p path passes `false`, so handlers run synchronously on the Rust thread. |
| 19 | + |
| 20 | +### chain mutation from gossip vs onInterval |
| 21 | +- `pkgs/node/src/node.zig:onGossip` -> `pkgs/node/src/chain.zig:onGossip` -> `forkChoice.onBlock/onAttestation` can run on the Rust bridge thread. |
| 22 | +- `pkgs/node/src/node.zig:onInterval` -> `pkgs/node/src/chain.zig:onInterval` -> `forkChoice.onInterval` runs on the xev loop. |
| 23 | + |
| 24 | +Net: forkchoice (and chain state) can be mutated concurrently. |
| 25 | + |
| 26 | +## Stale analysis: what it is |
| 27 | +A “stale analysis” happens when: |
| 28 | +1) canonical analysis is computed from snapshot S |
| 29 | +2) forkchoice mutates to snapshot S' |
| 30 | +3) the results from S are used later (pruning, DB updates, rebase) |
| 31 | + |
| 32 | +This can lead to misclassification or pruning of blocks that are now canonical. |
| 33 | + |
| 34 | +## Where stale analysis can occur |
| 35 | +### 1) Finalization processing |
| 36 | +File: `pkgs/node/src/chain.zig` in `processFinalizationAdvancement` |
| 37 | +- Local change (this branch) now uses `getCanonicalViewAndAnalysis(...)` to build view + analysis under one shared lock (fixing the *view/analysis mismatch*). |
| 38 | +- **Remaining risk:** analysis results can still be stale before DB updates/pruning if forkchoice mutates concurrently after analysis. |
| 39 | + |
| 40 | +### 2) Periodic pruning |
| 41 | +File: `pkgs/node/src/chain.zig` near line ~210 |
| 42 | +- Uses `getCanonicalityAnalysis(..., null)`, which is internally consistent (view built inside analysis call). |
| 43 | +- **Remaining risk:** analysis results can become stale before pruning if forkchoice mutates concurrently. |
| 44 | + |
| 45 | +### 3) Observability (lower severity) |
| 46 | +- `pkgs/node/src/chain.zig:printSlot` reads multiple forkchoice values in separate calls. These can be inconsistent under concurrency, but this is observability only. |
| 47 | + |
| 48 | +## Specific bug fixed in this branch |
| 49 | +### View + analysis mismatch in finalization |
| 50 | +On `main`, `processFinalizationAdvancement` did: |
| 51 | +1) `getCanonicalView` (shared lock) |
| 52 | +2) `getCanonicalityAnalysis` (shared lock) |
| 53 | + |
| 54 | +If forkchoice mutated in between, analysis could be computed against a stale view. This was present on `main` before the PR. |
| 55 | + |
| 56 | +Local fix: |
| 57 | +- Added `getCanonicalViewAndAnalysis(...)` in `forkchoice.zig` to compute both under one shared lock. |
| 58 | +- Updated `processFinalizationAdvancement` to call the combined API. |
| 59 | + |
| 60 | +## Other forkchoice lock issue fixed |
| 61 | +- `computeDeltas(...)` previously used a shared lock while mutating `self.deltas` and `self.attestations`. |
| 62 | +- Updated to use exclusive lock in this branch. |
| 63 | + |
| 64 | +## Were these issues introduced by the forkchoice visualization PR? |
| 65 | +No. |
| 66 | +- The Rust bridge concurrency path exists on `main`. |
| 67 | +- The view/analysis split in finalization existed on `main`. |
| 68 | +- These are pre-existing issues; current branch fixes a subset of them. |
| 69 | + |
| 70 | +## Options to fully address concurrency/staleness |
| 71 | +### Option A — Single-threaded chain (recommended) |
| 72 | +- Route all gossip callbacks onto the xev loop thread (set `scheduleOnLoop=true` and fix the scheduling bug). |
| 73 | +- This makes chain + forkchoice effectively single-threaded and removes the stale-analysis hazard. |
| 74 | + |
| 75 | +### Option B — Chain-level mutex |
| 76 | +- Add a `Chain` mutex and lock at entry of `onGossip`, `onInterval`, `onBlock`, finalization pruning, etc. |
| 77 | +- Ensures serialized access without reworking network callbacks. |
| 78 | + |
| 79 | +### Option C — Full concurrent correctness |
| 80 | +- Introduce explicit locking around all shared chain state and make all analysis/pruning operate on snapshots. |
| 81 | +- Higher effort, higher complexity. |
| 82 | + |
| 83 | +## Recommendation |
| 84 | +Keep the forkchoice visualization PR scoped: |
| 85 | +- Accept the local fixes (combined view+analysis, computeDeltas lock). |
| 86 | +- Track the broader concurrency model decision separately (Option A or B). |
| 87 | + |
0 commit comments