Fix stuck data column lookups by improving peer selection and retry logic #8005

jimmygchen · 2025-09-04T21:12:53Z

Issue Addressed

Fixes the issue described in #7980 where Lighthouse repeatedly sends DataColumnsByRoot requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.

Proposed Changes

Track per peer attempts to limit retry attempts per peer (MAX_CUSTODY_PEER_ATTEMPTS = 3)
Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
Added single_block_lookup root span to track all lookups created and added more debug logs:

…peer too many times. Update peer prioritization logic and removed the random factor in prioritization, so we batch as much as possible and not split into too many small requests.

jimmygchen · 2025-09-08T11:26:36Z

It seems like we're spliting into way too many requests than needed, and doesn't really help with sync speed

I'm going to try experiment replacing the random component when selecting peer and use something deterministic at each block, so we batch as much as possible, but without ending up selecting the same peer for every block lookup.

…fragmentation.

jimmygchen · 2025-09-08T23:29:29Z

Interestingly removing the randomness seems to have caused more issues keeping up with the chain, it seems like the spreading requests out to different peers may be a better approach when peers aren't all serving by root requests well.

jimmygchen · 2025-09-09T02:24:29Z

^ This is not true, it happened because devnet-3 went into non finality testing mode.

So I think the changes are good, I've compare it with some healthy nodes on devnet-3, and see a similar pattern

and some other unhealthy nodes, that likely ran into the "stuck" scenario and had to fallback to range sync:

eserilev

This looks great! I just had one thing I wanted to mention, mostly to make sure I wasn't misunderstanding things, plus a tiny nit that doesn't really matter.

beacon_node/network/src/sync/network_context/custody.rs

…ogic (sigp#8005) Fixes the issue described in sigp#7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck. The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers. - Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`) - Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency. - Added `single_block_lookup` root span to track all lookups created and added more debug logs: <img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" /> Co-Authored-By: Jimmy Chen <[email protected]> Co-Authored-By: Jimmy Chen <[email protected]>

jimmygchen added 3 commits September 4, 2025 23:24

Add logging and spans to single block lookup.

2a73971

Track per peer DataColumnByRoot attempts and avoid retrying the same …

e98f1d9

…peer too many times. Update peer prioritization logic and removed the random factor in prioritization, so we batch as much as possible and not split into too many small requests.

Refactor

b27c4e2

jimmygchen added the work-in-progress PR is a work-in-progress label Sep 4, 2025

jimmygchen and others added 3 commits September 5, 2025 08:45

Merge branch 'unstable' into fix-stuck-lookup-2

9c7e7b2

Add missing span enter call.

65ca88f

Log cleanup

2a8697e

jimmygchen marked this pull request as ready for review September 8, 2025 08:46

jimmygchen requested a review from jxs as a code owner September 8, 2025 08:46

Remove randomness when selecting lookup column peers to reduce batch …

e49827a

…fragmentation.

jimmygchen force-pushed the fix-stuck-lookup-2 branch from 538d174 to e49827a Compare September 8, 2025 13:50

Merge remote-tracking branch 'origin/unstable' into fix-stuck-lookup-2

d7888dc

jimmygchen changed the title ~~Lighthouse repeatedly sending DataColumnsByRoot requests to the same peers that sent us empty responses~~ Fix stuck data column lookups by improving peer selection and retry logic Sep 8, 2025

jimmygchen added ready-for-review The code is ready for review syncing v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky and removed work-in-progress PR is a work-in-progress labels Sep 8, 2025

jimmygchen requested review from dapplion and pawanjay176 September 8, 2025 14:06

jimmygchen requested a review from eserilev September 9, 2025 02:36

eserilev approved these changes Sep 9, 2025

View reviewed changes

beacon_node/network/src/sync/network_context/custody.rs Show resolved Hide resolved

beacon_node/network/src/sync/network_context/custody.rs Outdated Show resolved Hide resolved

Update comment

e0ca662

jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Sep 9, 2025

mergify bot merged commit ee734d1 into sigp:unstable Sep 9, 2025
37 of 38 checks passed

jimmygchen deleted the fix-stuck-lookup-2 branch September 9, 2025 06:29

jimmygchen mentioned this pull request Sep 23, 2025

Stuck block lookup on nodes running latest unstable #8104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix stuck data column lookups by improving peer selection and retry logic #8005

Fix stuck data column lookups by improving peer selection and retry logic #8005

Uh oh!

jimmygchen commented Sep 4, 2025 •

edited

Loading

Uh oh!

jimmygchen commented Sep 8, 2025 •

edited

Loading

Uh oh!

jimmygchen commented Sep 8, 2025

Uh oh!

jimmygchen commented Sep 9, 2025

Uh oh!

eserilev left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix stuck data column lookups by improving peer selection and retry logic #8005

Fix stuck data column lookups by improving peer selection and retry logic #8005

Uh oh!

Conversation

jimmygchen commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Uh oh!

jimmygchen commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmygchen commented Sep 8, 2025

Uh oh!

jimmygchen commented Sep 9, 2025

Uh oh!

eserilev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimmygchen commented Sep 4, 2025 •

edited

Loading

jimmygchen commented Sep 8, 2025 •

edited

Loading