Skip to content

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Sep 4, 2025

Issue Addressed

Fixes the issue described in #7980 where Lighthouse repeatedly sends DataColumnsByRoot requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.

Proposed Changes

  • Track per peer attempts to limit retry attempts per peer (MAX_CUSTODY_PEER_ATTEMPTS = 3)
  • Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
  • Added single_block_lookup root span to track all lookups created and added more debug logs:
image

…peer too many times. Update peer prioritization logic and removed the random factor in prioritization, so we batch as much as possible and not split into too many small requests.
@jimmygchen jimmygchen added the work-in-progress PR is a work-in-progress label Sep 4, 2025
@jimmygchen jimmygchen marked this pull request as ready for review September 8, 2025 08:46
@jimmygchen jimmygchen requested a review from jxs as a code owner September 8, 2025 08:46
@jimmygchen
Copy link
Member Author

jimmygchen commented Sep 8, 2025

It seems like we're spliting into way too many requests than needed, and doesn't really help with sync speed

image

I'm going to try experiment replacing the random component when selecting peer and use something deterministic at each block, so we batch as much as possible, but without ending up selecting the same peer for every block lookup.

@jimmygchen jimmygchen changed the title Lighthouse repeatedly sending DataColumnsByRoot requests to the same peers that sent us empty responses Fix stuck data column lookups by improving peer selection and retry logic Sep 8, 2025
@jimmygchen jimmygchen added ready-for-review The code is ready for review syncing v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky and removed work-in-progress PR is a work-in-progress labels Sep 8, 2025
@jimmygchen
Copy link
Member Author

Interestingly removing the randomness seems to have caused more issues keeping up with the chain, it seems like the spreading requests out to different peers may be a better approach when peers aren't all serving by root requests well.

@jimmygchen
Copy link
Member Author

^ This is not true, it happened because devnet-3 went into non finality testing mode.

So I think the changes are good, I've compare it with some healthy nodes on devnet-3, and see a similar pattern

image

and some other unhealthy nodes, that likely ran into the "stuck" scenario and had to fallback to range sync:

image

@jimmygchen jimmygchen requested a review from eserilev September 9, 2025 02:36
Copy link
Member

@eserilev eserilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I just had one thing I wanted to mention, mostly to make sure I wasn't misunderstanding things, plus a tiny nit that doesn't really matter.

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Sep 9, 2025
@mergify mergify bot merged commit ee734d1 into sigp:unstable Sep 9, 2025
37 of 38 checks passed
@jimmygchen jimmygchen deleted the fix-stuck-lookup-2 branch September 9, 2025 06:29
PoulavBhowmick03 pushed a commit to PoulavBhowmick03/lighthouse that referenced this pull request Sep 12, 2025
…ogic (sigp#8005)

Fixes the issue described in sigp#7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.


  - Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:

<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
kevaundray pushed a commit to kevaundray/lighthouse that referenced this pull request Sep 13, 2025
…ogic (sigp#8005)

Fixes the issue described in sigp#7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.


  - Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:

<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
jtraglia pushed a commit to jtraglia/lighthouse that referenced this pull request Sep 16, 2025
…ogic (sigp#8005)

Fixes the issue described in sigp#7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.


  - Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:

<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge This PR is ready to merge. syncing v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants