-
Notifications
You must be signed in to change notification settings - Fork 952
Maintain peers across all data column subnets #7915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…f custody subnet peers drop below the threshold. Optimise some peerdb functions.
|
Some required checks have failed. Could you please take a look @jimmygchen? 🙏 |
Squashed commit of the following: commit 6b1f2c8 Author: Jimmy Chen <[email protected]> Date: Fri Aug 22 00:53:11 2025 +1000 Fix function behaviour commit edf8571 Author: Jimmy Chen <[email protected]> Date: Fri Aug 22 00:46:12 2025 +1000 Remove brittle and unmaintainable test commit 232e685 Author: Jimmy Chen <[email protected]> Date: Fri Aug 22 00:20:03 2025 +1000 Prioritize unsynced peers for pruning commit 9e87e49 Author: Jimmy Chen <[email protected]> Date: Thu Aug 21 23:20:13 2025 +1000 Clean ups. commit 05baf9c Author: Jimmy Chen <[email protected]> Date: Thu Aug 21 23:04:13 2025 +1000 Maintain peers across all sampling subnets. Make discovery requests if custody subnet peers drop below the threshold. Optimise some peerdb functions.
AgeManning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an improvement, but we might want to make the data columns first class citizens.
The overall goal of this logic originally was to get Lighthouse to reach a steady-state where it never had to do any discoveries. It would find and maintain a uniform set of subnet peers.
There are two competing factors, discoveries which generate peers and pruning which remove the excess. If the pruning doesn't match what our discovery targets are, we might be in a perpetual state of discovering, then pruning the discovered peers.
Before data columns, we would discover peers if we needed them for attestation subnets, then prune down to maintain a uniform set of attestation subnets, to prevent any future discoveries.
With this change, we now have a new driving requirement, which is has_good_peers_in_custody_subnet(). We will try and discover peers constantly until we meet this requirement.
But the pruning logic is still to maintain uniform attestation subnets. We only now don't prune peers that might help with our custody subnet requirement.
I think now that things have changed, we should prioritize a uniform distribution on the data column custody and as a second priority, manage the attestation subnets. The reason being, is that the attestation subnets don't have a direct maintain_custody_peers() like function causing discoveries, and so I think its therefore less of a priority.
Also, for attestation subnets we really need, we have a min_ttl which prevents them from being pruned when we need them. So we can rely on that to save the crucial ones from being dropped.
So I think the pruning priorities should now be:
- Maintain uniform distribution of data columns
- a - Don't remove peers that we need for attestation subnets
- b - Dont remove peers that we need for sync committees
- If all of the above are satisfied, remove peers to make attestation subnets uniform.
|
Thanks @AgeManning , yeah I think your suggestion makes sense, I'll make this change. |
AgeManning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
| /// | ||
| /// This creates a unified structure containing all subnet information for each peer, | ||
| /// excluding trusted peers and peers already marked for pruning. | ||
| fn build_peer_subnet_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for a future PR. Rather than calculate this thing for every peer every heartbeat, we just change PeerInfo to store these naturally for each peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good idea, with higher peer count it makes even more sense. I'll raise an issue for this.
Co-authored-by: Age Manning <[email protected]>
ackintosh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, Jimmy! I've left some comments.
AgeManning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me :)
Co-authored-by: Akihito Nakano <[email protected]>
…ouse into maintain-custody-peers
I just noticed that one of the tests i added in #7915 is incorrect, after it was running flaky for a bit. This PR fixes the scenario and ensure the outcome will always be the same.
Closes: - sigp#7865 - sigp#7855 Changes extracted from earlier PR sigp#7876 This PR fixes two main things with a few other improvements mentioned below: - Prevent Lighthouse from repeatedly sending `DataColumnByRoot` requests to an unsynced peer, causing lookup sync to get stuck - Allows Lighthouse to send discovery requests if there isn't enough **synced** peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted. - Make peer discovery queries if custody subnet peer count drops below the minimum threshold - Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2) - Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets - Optimise some of the `PeerDB` functions checking custody peers - Only send lookup requests to peers that are synced or advanced
I just noticed that one of the tests i added in sigp#7915 is incorrect, after it was running flaky for a bit. This PR fixes the scenario and ensure the outcome will always be the same.
Closes: - sigp#7865 - sigp#7855 Changes extracted from earlier PR sigp#7876 This PR fixes two main things with a few other improvements mentioned below: - Prevent Lighthouse from repeatedly sending `DataColumnByRoot` requests to an unsynced peer, causing lookup sync to get stuck - Allows Lighthouse to send discovery requests if there isn't enough **synced** peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted. - Make peer discovery queries if custody subnet peer count drops below the minimum threshold - Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2) - Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets - Optimise some of the `PeerDB` functions checking custody peers - Only send lookup requests to peers that are synced or advanced
I just noticed that one of the tests i added in sigp#7915 is incorrect, after it was running flaky for a bit. This PR fixes the scenario and ensure the outcome will always be the same.
Issue Addressed
Closes:
Changes extracted from earlier PR #7876
This PR fixes two main things with a few other improvements mentioned below:
DataColumnByRootrequests to an unsynced peer, causing lookup sync to get stuckProposed Changes
PeerDBfunctions checking custody peers