Skip to content

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Aug 21, 2025

Issue Addressed

Closes:

Changes extracted from earlier PR #7876

This PR fixes two main things with a few other improvements mentioned below:

  • Prevent Lighthouse from repeatedly sending DataColumnByRoot requests to an unsynced peer, causing lookup sync to get stuck
  • Allows Lighthouse to send discovery requests if there isn't enough synced peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted.

Proposed Changes

  • Make peer discovery queries if custody subnet peer count drops below the minimum threshold
  • Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2)
  • Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets
  • Optimise some of the PeerDB functions checking custody peers
  • Only send lookup requests to peers that are synced or advanced

…f custody subnet peers drop below the threshold. Optimise some peerdb functions.
@jimmygchen jimmygchen requested a review from jxs as a code owner August 21, 2025 13:06
@jimmygchen jimmygchen added ready-for-review The code is ready for review das Data Availability Sampling v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Aug 21, 2025
@jimmygchen jimmygchen mentioned this pull request Aug 21, 2025
5 tasks
@mergify
Copy link

mergify bot commented Aug 21, 2025

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

@mergify mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 21, 2025
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 21, 2025
jimmygchen added a commit that referenced this pull request Aug 21, 2025
Squashed commit of the following:

commit 6b1f2c8
Author: Jimmy Chen <[email protected]>
Date:   Fri Aug 22 00:53:11 2025 +1000

    Fix function behaviour

commit edf8571
Author: Jimmy Chen <[email protected]>
Date:   Fri Aug 22 00:46:12 2025 +1000

    Remove brittle and unmaintainable test

commit 232e685
Author: Jimmy Chen <[email protected]>
Date:   Fri Aug 22 00:20:03 2025 +1000

    Prioritize unsynced peers for pruning

commit 9e87e49
Author: Jimmy Chen <[email protected]>
Date:   Thu Aug 21 23:20:13 2025 +1000

    Clean ups.

commit 05baf9c
Author: Jimmy Chen <[email protected]>
Date:   Thu Aug 21 23:04:13 2025 +1000

    Maintain peers across all sampling subnets. Make discovery requests if custody subnet peers drop below the threshold. Optimise some peerdb functions.
@jimmygchen jimmygchen mentioned this pull request Aug 21, 2025
2 tasks
Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement, but we might want to make the data columns first class citizens.

The overall goal of this logic originally was to get Lighthouse to reach a steady-state where it never had to do any discoveries. It would find and maintain a uniform set of subnet peers.

There are two competing factors, discoveries which generate peers and pruning which remove the excess. If the pruning doesn't match what our discovery targets are, we might be in a perpetual state of discovering, then pruning the discovered peers.

Before data columns, we would discover peers if we needed them for attestation subnets, then prune down to maintain a uniform set of attestation subnets, to prevent any future discoveries.

With this change, we now have a new driving requirement, which is has_good_peers_in_custody_subnet(). We will try and discover peers constantly until we meet this requirement.

But the pruning logic is still to maintain uniform attestation subnets. We only now don't prune peers that might help with our custody subnet requirement.

I think now that things have changed, we should prioritize a uniform distribution on the data column custody and as a second priority, manage the attestation subnets. The reason being, is that the attestation subnets don't have a direct maintain_custody_peers() like function causing discoveries, and so I think its therefore less of a priority.

Also, for attestation subnets we really need, we have a min_ttl which prevents them from being pruned when we need them. So we can rely on that to save the crucial ones from being dropped.

So I think the pruning priorities should now be:

  1. Maintain uniform distribution of data columns
  2. a - Don't remove peers that we need for attestation subnets
  3. b - Dont remove peers that we need for sync committees
  4. If all of the above are satisfied, remove peers to make attestation subnets uniform.

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 25, 2025
@mergify mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 25, 2025
@jimmygchen
Copy link
Member Author

Thanks @AgeManning , yeah I think your suggestion makes sense, I'll make this change.

@jimmygchen jimmygchen added work-in-progress PR is a work-in-progress and removed ready-for-review The code is ready for review labels Aug 25, 2025
@jimmygchen jimmygchen self-assigned this Aug 25, 2025
@jimmygchen jimmygchen changed the title Maintain peers across all sampling subnets Maintain peers across all data columnj subnets Sep 2, 2025
@michaelsproul michaelsproul changed the title Maintain peers across all data columnj subnets Maintain peers across all data column subnets Sep 2, 2025
Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

///
/// This creates a unified structure containing all subnet information for each peer,
/// excluding trusted peers and peers already marked for pruning.
fn build_peer_subnet_info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for a future PR. Rather than calculate this thing for every peer every heartbeat, we just change PeerInfo to store these naturally for each peer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good idea, with higher peer count it makes even more sense. I'll raise an issue for this.

Copy link
Member

@ackintosh ackintosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, Jimmy! I've left some comments.

Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me :)

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Sep 4, 2025
@mergify mergify bot merged commit c2a92f1 into sigp:unstable Sep 4, 2025
37 checks passed
mergify bot pushed a commit that referenced this pull request Sep 4, 2025
I just noticed that one of the tests i added in #7915 is incorrect, after it was running flaky for a bit.
This PR fixes the scenario and ensure the outcome will always be the same.
kevaundray pushed a commit to kevaundray/lighthouse that referenced this pull request Sep 13, 2025
Closes:
- sigp#7865
- sigp#7855

Changes extracted from earlier PR sigp#7876

This PR fixes two main things with a few other improvements mentioned below:
- Prevent Lighthouse from repeatedly sending `DataColumnByRoot` requests to an unsynced peer, causing lookup sync to get stuck
- Allows Lighthouse to send discovery requests if there isn't enough **synced** peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted.

  - Make peer discovery queries if custody subnet peer count drops below the minimum threshold
- Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2)
- Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets
- Optimise some of the `PeerDB` functions checking custody peers
- Only send lookup requests to peers that are synced or advanced
kevaundray pushed a commit to kevaundray/lighthouse that referenced this pull request Sep 13, 2025
I just noticed that one of the tests i added in sigp#7915 is incorrect, after it was running flaky for a bit.
This PR fixes the scenario and ensure the outcome will always be the same.
jtraglia pushed a commit to jtraglia/lighthouse that referenced this pull request Sep 16, 2025
Closes:
- sigp#7865
- sigp#7855

Changes extracted from earlier PR sigp#7876

This PR fixes two main things with a few other improvements mentioned below:
- Prevent Lighthouse from repeatedly sending `DataColumnByRoot` requests to an unsynced peer, causing lookup sync to get stuck
- Allows Lighthouse to send discovery requests if there isn't enough **synced** peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted.


  - Make peer discovery queries if custody subnet peer count drops below the minimum threshold
- Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2)
- Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets
- Optimise some of the `PeerDB` functions checking custody peers
- Only send lookup requests to peers that are synced or advanced
jtraglia pushed a commit to jtraglia/lighthouse that referenced this pull request Sep 16, 2025
I just noticed that one of the tests i added in sigp#7915 is incorrect, after it was running flaky for a bit.
This PR fixes the scenario and ensure the outcome will always be the same.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

das Data Availability Sampling ready-for-merge This PR is ready to merge. v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants