Skip to content

Commit 62f8e18

Browse files
chetanybg11tech
andauthored
add root to slot caching (#540)
* feat: add global root-to-slot cache * feat: optimize root to slot caching and add pruning * ttest: add integration test for root-to-slot cache lifecycle * fix: address comment * fix: move pruning to processFinaslizationAdvancement * fix: address comment * fix: initialize root-to-slot cache at chain startup * fix: address PR review comments for root-to-slot cache - Remove cache filling from process_block_header (done externally) - Use orelse semantics for cache allocation in process_attestations - Use block_cache directly without if check - Add cache update in chain.zig onBlock before STF - Update ArrayList API for Zig 0.15.2 - Strip debug info for zkvm targets to fix RISC-V linker overflow - Update test to match new cache lifecycle design * fix: resolve double-deinit panic in attestation processing * fix: post merge changes and comment addressal --------- Co-authored-by: g11tech <gajinder@zeam.in>
1 parent 624366e commit 62f8e18

File tree

8 files changed

+333
-35
lines changed

8 files changed

+333
-35
lines changed

build.zig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -810,6 +810,7 @@ fn build_zkvm_targets(
810810
.root_source_file = b.path("pkgs/state-transition-runtime/src/main.zig"),
811811
.target = target,
812812
.optimize = optimize,
813+
.strip = true, // Strip debug info to avoid RISC-V relocation overflow
813814
}),
814815
});
815816
// addimport to root module is even required afer declaring it in mod

forkchoice_concurrency_analysis.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Forkchoice Concurrency + Stale Snapshot Analysis
2+
3+
Date: 2026-01-22
4+
Branch: forkchoice-graph (local)
5+
6+
## Executive summary
7+
- Forkchoice can be mutated concurrently with chain processing in the current production path because gossip callbacks may run on the Rust bridge thread while `onInterval` runs on the xev loop thread.
8+
- The specific “canonical view + analysis mismatch” issue existed on `main` and has now been fixed locally by combining view + analysis under a single shared lock.
9+
- A broader “analysis result can become stale before it is used” risk remains anywhere analysis results are computed and then used later while forkchoice can still mutate concurrently.
10+
- These concurrency/staleness risks were **not introduced** by the forkchoice visualization PR; they already exist on `main`.
11+
12+
## Evidence of concurrent mutation
13+
### Rust bridge thread calls gossip handlers directly
14+
- `pkgs/network/src/ethlibp2p.zig` spawns a Rust bridge thread (`Thread.spawn`) and uses it to deliver gossip.
15+
- In `handleGossipFromRustBridge`, gossip is dispatched via `gossipHandler.onGossip(..., scheduleOnLoop=false)`. That path invokes handlers immediately on the Rust thread.
16+
17+
### gossip handler does not schedule on xev loop
18+
- `pkgs/network/src/interface.zig:GenericGossipHandler.onGossip` only schedules via xev when `scheduleOnLoop=true`. The ethlibp2p path passes `false`, so handlers run synchronously on the Rust thread.
19+
20+
### chain mutation from gossip vs onInterval
21+
- `pkgs/node/src/node.zig:onGossip` -> `pkgs/node/src/chain.zig:onGossip` -> `forkChoice.onBlock/onAttestation` can run on the Rust bridge thread.
22+
- `pkgs/node/src/node.zig:onInterval` -> `pkgs/node/src/chain.zig:onInterval` -> `forkChoice.onInterval` runs on the xev loop.
23+
24+
Net: forkchoice (and chain state) can be mutated concurrently.
25+
26+
## Stale analysis: what it is
27+
A “stale analysis” happens when:
28+
1) canonical analysis is computed from snapshot S
29+
2) forkchoice mutates to snapshot S'
30+
3) the results from S are used later (pruning, DB updates, rebase)
31+
32+
This can lead to misclassification or pruning of blocks that are now canonical.
33+
34+
## Where stale analysis can occur
35+
### 1) Finalization processing
36+
File: `pkgs/node/src/chain.zig` in `processFinalizationAdvancement`
37+
- Local change (this branch) now uses `getCanonicalViewAndAnalysis(...)` to build view + analysis under one shared lock (fixing the *view/analysis mismatch*).
38+
- **Remaining risk:** analysis results can still be stale before DB updates/pruning if forkchoice mutates concurrently after analysis.
39+
40+
### 2) Periodic pruning
41+
File: `pkgs/node/src/chain.zig` near line ~210
42+
- Uses `getCanonicalityAnalysis(..., null)`, which is internally consistent (view built inside analysis call).
43+
- **Remaining risk:** analysis results can become stale before pruning if forkchoice mutates concurrently.
44+
45+
### 3) Observability (lower severity)
46+
- `pkgs/node/src/chain.zig:printSlot` reads multiple forkchoice values in separate calls. These can be inconsistent under concurrency, but this is observability only.
47+
48+
## Specific bug fixed in this branch
49+
### View + analysis mismatch in finalization
50+
On `main`, `processFinalizationAdvancement` did:
51+
1) `getCanonicalView` (shared lock)
52+
2) `getCanonicalityAnalysis` (shared lock)
53+
54+
If forkchoice mutated in between, analysis could be computed against a stale view. This was present on `main` before the PR.
55+
56+
Local fix:
57+
- Added `getCanonicalViewAndAnalysis(...)` in `forkchoice.zig` to compute both under one shared lock.
58+
- Updated `processFinalizationAdvancement` to call the combined API.
59+
60+
## Other forkchoice lock issue fixed
61+
- `computeDeltas(...)` previously used a shared lock while mutating `self.deltas` and `self.attestations`.
62+
- Updated to use exclusive lock in this branch.
63+
64+
## Were these issues introduced by the forkchoice visualization PR?
65+
No.
66+
- The Rust bridge concurrency path exists on `main`.
67+
- The view/analysis split in finalization existed on `main`.
68+
- These are pre-existing issues; current branch fixes a subset of them.
69+
70+
## Options to fully address concurrency/staleness
71+
### Option A — Single-threaded chain (recommended)
72+
- Route all gossip callbacks onto the xev loop thread (set `scheduleOnLoop=true` and fix the scheduling bug).
73+
- This makes chain + forkchoice effectively single-threaded and removes the stale-analysis hazard.
74+
75+
### Option B — Chain-level mutex
76+
- Add a `Chain` mutex and lock at entry of `onGossip`, `onInterval`, `onBlock`, finalization pruning, etc.
77+
- Ensures serialized access without reworking network callbacks.
78+
79+
### Option C — Full concurrent correctness
80+
- Introduce explicit locking around all shared chain state and make all analysis/pruning operate on snapshots.
81+
- Higher effort, higher complexity.
82+
83+
## Recommendation
84+
Keep the forkchoice visualization PR scoped:
85+
- Accept the local fixes (combined view+analysis, computeDeltas lock).
86+
- Track the broader concurrency model decision separately (Option A or B).
87+

pkgs/node/src/chain.zig

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,8 @@ pub const BeamChain = struct {
107107
// Cache for validator public keys to avoid repeated SSZ deserialization during signature verification.
108108
// Significantly reduces CPU overhead when processing blocks with many attestations.
109109
public_key_cache: xmss.PublicKeyCache,
110+
// Cache for root to slot mapping to optimize block processing performance.
111+
root_to_slot_cache: types.RootToSlotCache,
110112

111113
// Callback for pruning cached blocks after finalization advances
112114
prune_cached_blocks_ctx: ?*anyopaque = null,
@@ -139,7 +141,7 @@ pub const BeamChain = struct {
139141
try types.sszClone(allocator, types.BeamState, opts.anchorState.*, cloned_anchor_state);
140142
try states.put(fork_choice.head.blockRoot, cloned_anchor_state);
141143

142-
return Self{
144+
var chain = Self{
143145
.nodeId = opts.nodeId,
144146
.config = opts.config,
145147
.forkChoice = fork_choice,
@@ -157,8 +159,13 @@ pub const BeamChain = struct {
157159
.node_registry = opts.node_registry,
158160
.force_block_production = opts.force_block_production,
159161
.public_key_cache = xmss.PublicKeyCache.init(allocator),
162+
.root_to_slot_cache = types.RootToSlotCache.init(allocator),
160163
.pending_blocks = .empty,
161164
};
165+
// Initialize cache with anchor block root and any post-finalized entries from state
166+
try chain.root_to_slot_cache.put(fork_choice.head.blockRoot, opts.anchorState.slot);
167+
try chain.anchor_state.initRootToSlotCache(&chain.root_to_slot_cache);
168+
return chain;
162169
}
163170

164171
pub fn setPruneCachedBlocksCallback(self: *Self, ctx: *anyopaque, func: PruneCachedBlocksFn) void {
@@ -186,6 +193,9 @@ pub const BeamChain = struct {
186193
// Clean up public key cache
187194
self.public_key_cache.deinit();
188195

196+
// Clean up root to slot cache
197+
self.root_to_slot_cache.deinit();
198+
189199
// Clean up any blocks that were queued waiting for the forkchoice clock
190200
for (self.pending_blocks.items) |*block| {
191201
block.deinit();
@@ -453,7 +463,7 @@ pub const BeamChain = struct {
453463
self.module_logger.debug("node-{d}::going for block production opts={any} raw block={s}", .{ self.nodeId, opts, block_str });
454464

455465
// 2. apply STF to get post state & update post state root & cache it
456-
try stf.apply_raw_block(self.allocator, post_state, &block, self.block_building_logger);
466+
try stf.apply_raw_block(self.allocator, post_state, &block, self.block_building_logger, &self.root_to_slot_cache);
457467

458468
const block_str_2 = try block.toJsonString(self.allocator);
459469
defer self.allocator.free(block_str_2);
@@ -774,13 +784,16 @@ pub const BeamChain = struct {
774784

775785
// 3. apply state transition assuming signatures are valid (STF does not re-verify)
776786
try stf.apply_transition(self.allocator, cpost_state, block, .{
777-
//
778787
.logger = self.stf_logger,
779788
.validSignatures = true,
789+
.rootToSlotCache = &self.root_to_slot_cache,
780790
});
781791
break :computedstate cpost_state;
782792
};
783793

794+
// Add current block's root to cache AFTER STF (ensures cache stays in sync with historical_block_hashes)
795+
try self.root_to_slot_cache.put(block_root, block.slot);
796+
784797
var missing_roots: std.ArrayList(types.Root) = .empty;
785798
errdefer missing_roots.deinit(self.allocator);
786799

@@ -1187,6 +1200,9 @@ pub const BeamChain = struct {
11871200
// }
11881201
// }
11891202

1203+
// Prune root-to-slot cache up to finalized slot
1204+
try self.root_to_slot_cache.prune(latestFinalized.slot);
1205+
11901206
// Record successful finalization
11911207
zeam_metrics.metrics.lean_finalizations_total.incr(.{ .result = "success" }) catch {};
11921208

pkgs/state-transition/src/mock.zig

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -330,7 +330,7 @@ pub fn genMockChain(allocator: Allocator, numBlocks: usize, from_genesis: ?types
330330
agg_att_cleanup = false;
331331

332332
// prepare pre state to process block for that slot, may be rename prepare_pre_state
333-
try transition.apply_raw_block(allocator, &beam_state, &block, block_building_logger);
333+
try transition.apply_raw_block(allocator, &beam_state, &block, block_building_logger, null);
334334
try zeam_utils.hashTreeRoot(types.BeamBlock, block, &block_root, allocator);
335335

336336
// generate the signed beam block and add to block list

pkgs/state-transition/src/transition.zig

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
const std = @import("std");
22
const json = std.json;
33
const types = @import("@zeam/types");
4-
const utils = types.utils;
54

65
const params = @import("@zeam/params");
76
const zeam_utils = @import("@zeam/utils");
@@ -20,6 +19,7 @@ pub const StateTransitionOpts = struct {
2019
validSignatures: bool = true,
2120
validateResult: bool = true,
2221
logger: zeam_utils.ModuleLogger,
22+
rootToSlotCache: ?*types.RootToSlotCache = null,
2323
};
2424

2525
// pub fn process_epoch(state: types.BeamState) void {
@@ -36,7 +36,7 @@ fn process_execution_payload_header(state: *types.BeamState, block: types.BeamBl
3636
}
3737
}
3838

39-
pub fn apply_raw_block(allocator: Allocator, state: *types.BeamState, block: *types.BeamBlock, logger: zeam_utils.ModuleLogger) !void {
39+
pub fn apply_raw_block(allocator: Allocator, state: *types.BeamState, block: *types.BeamBlock, logger: zeam_utils.ModuleLogger, cache: ?*types.RootToSlotCache) !void {
4040
// prepare pre state to process block for that slot, may be rename prepare_pre_stateCollapse comment
4141
const transition_timer = zeam_metrics.lean_state_transition_time_seconds.start();
4242
defer _ = transition_timer.observe();
@@ -45,7 +45,7 @@ pub fn apply_raw_block(allocator: Allocator, state: *types.BeamState, block: *ty
4545
try state.process_slots(allocator, block.slot, logger);
4646

4747
// process block and modify the pre state to post state
48-
try state.process_block(allocator, block.*, logger);
48+
try state.process_block(allocator, block.*, logger, cache);
4949

5050
logger.debug("extracting state root\n", .{});
5151
// extract the post state root
@@ -205,7 +205,7 @@ pub fn apply_transition(allocator: Allocator, state: *types.BeamState, block: ty
205205
try state.process_slots(allocator, block.slot, opts.logger);
206206

207207
// process the block
208-
try state.process_block(allocator, block, opts.logger);
208+
try state.process_block(allocator, block, opts.logger, opts.rootToSlotCache);
209209

210210
const validateResult = opts.validateResult;
211211
if (validateResult) {

pkgs/types/src/lib.zig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ pub const GenesisSpec = utils.GenesisSpec;
6868
pub const ChainSpec = utils.ChainSpec;
6969
pub const sszClone = utils.sszClone;
7070
pub const IsJustifiableSlot = utils.IsJustifiableSlot;
71+
pub const RootToSlotCache = utils.RootToSlotCache;
7172

7273
const zk = @import("./zk.zig");
7374
pub const ZkVm = zk.ZkVm;

0 commit comments

Comments
 (0)