-
Notifications
You must be signed in to change notification settings - Fork 952
Fix bugs in proposer calculation post-Fulu #8101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in proposer calculation post-Fulu #8101
Conversation
|
I looked at this with @michaelsproul today and the approach makes sense. I need a few more passes on my own to fully get the changes. |
Squashed commit of the following: commit 9b4e4eb Author: Michael Sproul <[email protected]> Date: Thu Sep 25 13:16:44 2025 +1000 Use correct decision block root in missed block test commit 6da576c Author: Michael Sproul <[email protected]> Date: Thu Sep 25 12:57:33 2025 +1000 Oops commit 83128c4 Author: Michael Sproul <[email protected]> Date: Thu Sep 25 12:56:03 2025 +1000 Document and refine proposer_shuffling_root_for_child_block commit a08fd93 Author: Michael Sproul <[email protected]> Date: Thu Sep 25 12:55:53 2025 +1000 Remove FIXME in favour of comment commit 353fdcb Author: Michael Sproul <[email protected]> Date: Thu Sep 25 11:56:20 2025 +1000 Use min_seed_lookahead in ChainSpec::proposer_shuffling_decision_slot commit 4ceccd2 Author: Michael Sproul <[email protected]> Date: Thu Sep 25 11:23:41 2025 +1000 Re-implement OnceCell optimisation commit 0ed80ff Author: Michael Sproul <[email protected]> Date: Wed Sep 24 19:02:33 2025 +1000 Remove resolved FIXME commit 645c869 Author: Michael Sproul <[email protected]> Date: Wed Sep 24 18:57:53 2025 +1000 Simplify blob verification commit 0917bfb Author: Michael Sproul <[email protected]> Date: Wed Sep 24 18:51:37 2025 +1000 More bug fixes and simplifications commit d41c236 Author: Michael Sproul <[email protected]> Date: Wed Sep 24 14:09:44 2025 +1000 Use new proposer cache method in block verif commit b34af93 Author: Michael Sproul <[email protected]> Date: Wed Sep 24 12:43:09 2025 +1000 Start using with_proposer_cache commit dd799d8 Author: Michael Sproul <[email protected]> Date: Wed Sep 24 12:18:57 2025 +1000 Laying the groundwork for safer proposer index calcs commit b1c5308 Author: Michael Sproul <[email protected]> Date: Mon Sep 22 17:54:02 2025 +1000 Add sanity checks in BeaconState::compute_proposer_indices
| // We do not run this check if this function is called from `upgrade_to_fulu`, | ||
| // which runs *after* the slot is incremented, and needs to compute the proposer | ||
| // shuffling for the epoch that was just transitioned into. | ||
| if self.fork_name_unchecked().fulu_enabled() | ||
| && epoch < current_epoch.safe_add(spec.min_seed_lookahead)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtraglia This is the exception to the assert I mentioned, which allows us to compute the proposer indices with 0 lookahead only in the case of upgrade_to_fulu (when the state itself has not yet become the Fulu variant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this makes sense. I think these checks (this one & the one mentioned in the comment below) might be a little too complex to add to the specs. I still prefer just removing these two helper functions sometime in the future. I'll try to implement these in the specs locally just to get a feel for what it would look like though.
| // This isn't in the spec, but we remove the footgun that is requesting the current epoch | ||
| // for a Fulu state. | ||
| if let Ok(proposer_lookahead) = self.proposer_lookahead() | ||
| && epoch >= self.current_epoch() | ||
| && epoch <= self.next_epoch()? | ||
| { | ||
| let slots_per_epoch = E::slots_per_epoch() as usize; | ||
| let start_offset = if epoch == self.current_epoch() { | ||
| 0 | ||
| } else { | ||
| slots_per_epoch | ||
| }; | ||
| return Ok(proposer_lookahead | ||
| .iter_from(start_offset)? | ||
| .take(slots_per_epoch) | ||
| .map(|x| *x as usize) | ||
| .collect()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtraglia Similarly, this block only comes into effect outside of the upgrade_to_fulu logic.
| let epoch_cache: &EpochCache = state.epoch_cache(); | ||
| let decision_block_root = state | ||
| .proposer_shuffling_decision_root(Hash256::zero()) | ||
| .epoch_cache_decision_root(Hash256::zero()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The epoch cache is keyed on the last block of the previous epoch. Prior to Fulu this was the same as the proposer shuffling decision root, however that is no longer the case. So we use a separate function.
| /// | ||
| /// This function assumes that `child_block_epoch >= self.epoch`. It is the responsibility of | ||
| /// the caller to check this condition, or else incorrect results will be produced. | ||
| pub fn proposer_shuffling_root_for_child_block( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a bit subtle. It was extracted from the 2-3 places where we used it previously (block/blob/data column verification).
It relates the proposer shuffling to the attester shufflings. The relationship is slightly different depending on whether Fulu is enabled. Prior to Fulu everything is shifted by an epoch, and then post Fulu things line up (proposer shuffling and attester shuffling for epoch N have the same decision block root).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice refactoring, much better keeping this kind of logic in one place
| beacon_proposer_cache.lock().insert( | ||
| epoch, | ||
| duplicate_block_root, | ||
| decision_block_root, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this test assumed that the duplicate_block_root was the decision root, but that is no longer true post Fulu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the case for this one too?
| .proposer_shuffling_decision_root(duplicate_block_root, &harness1.chain.spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is wrong, yeah, but doesn't actually do anything, because the block_root argument to proposer_shuffling_decision_root isn't used unless we're right on an epoch boundary (pre-Fulu), or there's a whole epoch of skipped slots (post Fulu).
I think I would prefer to completely rewrite these tests than tweaking them any more. There is an open issue for this:
| let dependent_root = new_snapshot | ||
| .beacon_state | ||
| .proposer_shuffling_decision_root(self.genesis_block_root); | ||
| .attester_shuffling_decision_root(self.genesis_block_root, RelativeEpoch::Next); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These used to be equivalent, but are no longer. This dependent root is meant to be the attestation shuffling for the next epoch, so it makes a lot more sense to use that explicitly!
| parent_block.proposer_shuffling_root_for_child_block(block_epoch, &chain.spec); | ||
|
|
||
| // We assign to a variable instead of using `if let Some` directly to ensure we drop the | ||
| // write lock before trying to acquire it again in the `else` clause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment above is no longer relevant here, now we've moved the lock management into with_proposer_cache
| /// | ||
| /// - ProtoBlock::proposer_shuffling_root_for_child_block, and | ||
| /// - BeaconState::proposer_shuffling_decision_root{_at_epoch} | ||
| async fn proposer_shuffling_root_consistency_test(parent_slot: u64, child_slot: u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little strange that these tests are under store_tests.rs, i guess it's more for convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just write everything under store tests haha
| Ok(()) | ||
| ); | ||
|
|
||
| let decision_block_root = _state2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be renamed to _state now it's being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests do this a lot already, I hate them 🤣
Again, I reckon we just:
|
We have this deployed to both live testnets and |
|
I think the attack on a live network might require a proposer cache miss. So potentially multiple skipped slots near an epoch boundary, plus the effective balance changes |
| } | ||
| } | ||
|
|
||
| pub fn with_proposer_cache<V, E: From<BeaconChainError> + From<BeaconStateError>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces the caller to implement From for both error types, it feels a bit strange but given this works with existing code, i think it's fine to leave it as it is.
| let proposer_shuffling_root = | ||
| parent_block.proposer_shuffling_root_for_child_block(column_epoch, &chain.spec); | ||
|
|
||
| let proposer = chain.with_proposer_cache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, we also avoid building committee caches unnecessarily with cheap_state_advance_to_obtain_committees here and use ensure_state_can_determine_proposers_for_epoch instead
| /// - It must be the case that `state.canonical_root() == state_root`, but this function will not | ||
| /// check that. | ||
| pub fn ensure_state_is_in_epoch<E: EthSpec>( | ||
| pub fn ensure_state_can_determine_proposers_for_epoch<E: EthSpec>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be worth adding instrumenting tracing for this function (we used to call cheap_state_advance_to_obtain_committees which shows up in the gossip processing traces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a difficult PR to review, as it covers quite a lot of code paths and edge cases, but after a few passes, i think it looks very solid! Impressed with how thorough you are with this one. I think you've handled all the code paths I can find and can't really spot any issues with correctness.
I think this is good to merge and we can address the comments next week.
|
|
||
| #[tokio::test] | ||
| async fn proposer_shuffling_root_consistency_same_epoch() { | ||
| proposer_shuffling_root_consistency_test(32, 39).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do in another PR, but writing the numbers like will help understand them
| proposer_shuffling_root_consistency_test(32, 39).await; | |
| proposer_shuffling_root_consistency_test( | |
| E::slots_per_epoch().as_u64() * 4, | |
| E::slots_per_epoch().as_u64() * 5 - 1, | |
| ).await; |
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn proposer_shuffling_changing_with_lookahead() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test 💯
| index: pre_deposit_state.eth1_deposit_index(), | ||
| pubkey: validator_to_topup.pubkey, | ||
| withdrawal_credentials: validator_to_topup.withdrawal_credentials, | ||
| amount: 63_000_000_000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you chose this value to find a balance that will trigger a change in the proposer selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The churn limit is 64 ETH, so this is just the largest value that fits under that threshold and gets applied quickly. It also happens to change the shuffling with the default seed, and I think should be likely to change it with such a small # of validators
N/A In #8101 , when we modified the logic to get the proposer index post fulu, we seem to have missed advancing the state at the fork boundaries to get the right `Fork` for signature verification. This led to lighthouse failing all gossip verification right after transitioning to fulu that was observed on the holesky shadow fork ``` Sep 26 14:24:00.088 DEBUG Rejected gossip block error: "InvalidSignature(ProposerSignature)", graffiti: "grandine-geth-super-1", slot: 640 Sep 26 14:24:00.099 WARN Could not verify block for gossip. Rejecting the block error: InvalidSignature(ProposerSignature) ``` I'm not completely sure this is the correct fix, but this fixes the issue with `InvalidProposerSignature` on the holesky shadow fork. Thanks to @eserilev for helping debug this Co-Authored-By: Pawan Dhananjay <[email protected]>
Follow-up to the bug fixed in: - #8121 This fixes the root cause of that bug, which was introduced by me in: - #8101 Lion identified the issue here: - #8101 (comment) In the methods that compute the proposer shuffling decision root, ensure we don't use lookahead for the Fulu fork epoch itself. This is accomplished by checking if Fulu is enabled at `epoch - 1`, i.e. if `epoch > fulu_fork_epoch`. I haven't updated the methods that _compute_ shufflings to use these new corrected bounds (e.g. `BeaconState::compute_proposer_indices`), although we could make this change in future. The `get_beacon_proposer_indices` method already gracefully handles the Fulu boundary case by using the `proposer_lookahead` field (if initialised). Co-Authored-By: Michael Sproul <[email protected]>
Addressing more review comments from: - #8101 I've also tweaked a few more things that I think are minor bugs. - Instrument `ensure_state_can_determine_proposers_for_epoch` - Fix `block_root` usage in `compute_proposer_duties_from_head`. This was a regression introduced in 8101 😬 . - Update the `state_advance_timer` to prime the next-epoch proposer cache post-Fulu. Co-Authored-By: Michael Sproul <[email protected]>
Addressing more review comments from: - sigp#8101 I've also tweaked a few more things that I think are minor bugs. - Instrument `ensure_state_can_determine_proposers_for_epoch` - Fix `block_root` usage in `compute_proposer_duties_from_head`. This was a regression introduced in 8101 😬 . - Update the `state_advance_timer` to prime the next-epoch proposer cache post-Fulu. Co-Authored-By: Michael Sproul <[email protected]>
As identified by a researcher during the Fusaka security competition, we were computing the proposer index incorrectly in some places by computing without lookahead. - [x] Add "low level" checks to computation functions in `consensus/types` to ensure they error cleanly - [x] Re-work the determination of proposer shuffling decision roots, which are now fork aware. - [x] Re-work and simplify the beacon proposer cache to be fork-aware. - [x] Optimise `with_proposer_cache` to use `OnceCell`. - [x] All tests passing. - [x] Resolve all remaining `FIXME(sproul)`s. - [x] Unit tests for `ProtoBlock::proposer_shuffling_root_for_child_block`. - [x] End-to-end regression test. - [x] Test on pre-Fulu network. - [x] Test on post-Fulu network. Co-Authored-By: Michael Sproul <[email protected]>
N/A In sigp#8101 , when we modified the logic to get the proposer index post fulu, we seem to have missed advancing the state at the fork boundaries to get the right `Fork` for signature verification. This led to lighthouse failing all gossip verification right after transitioning to fulu that was observed on the holesky shadow fork ``` Sep 26 14:24:00.088 DEBUG Rejected gossip block error: "InvalidSignature(ProposerSignature)", graffiti: "grandine-geth-super-1", slot: 640 Sep 26 14:24:00.099 WARN Could not verify block for gossip. Rejecting the block error: InvalidSignature(ProposerSignature) ``` I'm not completely sure this is the correct fix, but this fixes the issue with `InvalidProposerSignature` on the holesky shadow fork. Thanks to @eserilev for helping debug this Co-Authored-By: Pawan Dhananjay <[email protected]>
Follow-up to the bug fixed in: - sigp#8121 This fixes the root cause of that bug, which was introduced by me in: - sigp#8101 Lion identified the issue here: - sigp#8101 (comment) In the methods that compute the proposer shuffling decision root, ensure we don't use lookahead for the Fulu fork epoch itself. This is accomplished by checking if Fulu is enabled at `epoch - 1`, i.e. if `epoch > fulu_fork_epoch`. I haven't updated the methods that _compute_ shufflings to use these new corrected bounds (e.g. `BeaconState::compute_proposer_indices`), although we could make this change in future. The `get_beacon_proposer_indices` method already gracefully handles the Fulu boundary case by using the `proposer_lookahead` field (if initialised). Co-Authored-By: Michael Sproul <[email protected]>
Issue Addressed
As identified by a researcher during the Fusaka security competition, we were computing the proposer index incorrectly in some places by computing without lookahead.
Proposed Changes
consensus/typesto ensure they error cleanlywith_proposer_cacheto useOnceCell.FIXME(sproul)s.ProtoBlock::proposer_shuffling_root_for_child_block.Testing
ProtoBlock::proposer_shuffling_root_for_child_blockis tested by new tests added in e471315. They catch the bug that was fixed by 83128c4.