Skip to content

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Aug 22, 2025

Issue Addressed

Addresses #7866.

Proposed Changes

Use Rayon to speed up batch KZG verification during range / backfill sync.

While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.

I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.

Additional Info

For gossip batches, a fixed size chunk (128) is used to optimise for more predictable performance.

  • assuming up to 32 blobs, a full node sampling 8 columns will split cells into up to 2 chunks
  • assuming up to 32 blobs, a supernode sampling all columns will split cells into up to 32 chunks

We also need to be careful with using rayon on gossip column processing, because beacon processor could allocate up to num_cpus processing tasks, and if each task uses the rayon global pool (which also has num_cpus threads in the rayon pool), we could be oversubscribing by 2x when gossip columns arrive in burst, although the impact of 2x oversubscription may not be significant.

For range sync batches, I think it's probably fine to use as many available threads as possible, since getting the node to sync is the highest priority task.

For backfill batches, we probably want to avoid over-allocating as the BN may be processing other higher priority tasks. For this we may want to implement scoped rayon pool in BeaconProcessor (#7719)

UPDATE: created a PR to use scope rayon pool, we should probably merge these two together

Test Results

image

This is the worst case scenario when there's at least one invalid column at the end of the epoch - the batch verificaiton (with rayon) took 294ms, but then it found an invalid column and try to validate all individually (> 4000 columns in an epoch)
and the whole thing took 12.15s
image

Note this is after i added the optimisation to short circuit it (if any invalid columns are found, stop validating) - in the worst case its still pretty bad like above - might be worth just verifying columns individually using rayon, instead of aggregating all columns together and then chunk them. I'll implement this - and i think we just short circuit whenever we get a failure, eventually all the bad peers are going to be banned.

@jimmygchen jimmygchen added optimization Something to make Lighthouse run more efficiently. das Data Availability Sampling v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Aug 22, 2025
@jimmygchen jimmygchen requested a review from jxs as a code owner August 22, 2025 14:49
@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Aug 22, 2025
@michaelsproul michaelsproul added the bug Something isn't working label Aug 24, 2025
@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 27, 2025
@jimmygchen
Copy link
Member Author

jimmygchen commented Aug 27, 2025

I'm going to rework / optimise the verify_kzg_for_data_column_list_with_scoring function

/// Complete kzg verification for a list of `DataColumnSidecar`s.
///
/// If there's at least one invalid column, it re-verifies all columns individually to identify the
/// first column that is invalid. This is necessary to attribute fault to the specific peer that
/// sent bad data. The re-verification cost should not be significant. If a peer sends invalid data it
/// will be quickly banned.
pub fn verify_kzg_for_data_column_list_with_scoring<'a, E: EthSpec, I>(
data_column_iter: I,
kzg: &'a Kzg,
) -> Result<(), Vec<(ColumnIndex, KzgError)>>
where
I: Iterator<Item = &'a Arc<DataColumnSidecar<E>>> + Clone,
{
if verify_kzg_for_data_column_list(data_column_iter.clone(), kzg).is_ok() {
return Ok(());
};
// Find all columns that are invalid and identify by index. If we hit this condition there
// should be at least one invalid column
let errors = data_column_iter
.filter_map(|data_column| {
if let Err(e) = verify_kzg_for_data_column(data_column.clone(), kzg) {
Some((data_column.index, e))
} else {
None
}
})
.collect::<Vec<_>>();
Err(errors)
}

The traces screenshot above show that rayon is actually pretty effective - verifying the entire batch with rayon only took 294ms, whereas individually verify each column took up the remaining 12 seconds.

@jimmygchen jimmygchen added work-in-progress PR is a work-in-progress and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 27, 2025
// This is safe from span explosion as we have at most ~32 chunks
// (small batches: 4096/128, large batches: cells/thread_count).
let _span =
tracing::debug_span!("verify_cell_proof_chunk", cells = cell_chunk.len())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add parent here - this span is getting orphaned

@jimmygchen
Copy link
Member Author

Thanks @eserilev for offering to help with this 🙏

From our testing so far, it takes about 1-3 seconds to verify each batch without rayon.
With rayon we've seen numbers around 300-400ms for 325 blobs in an epoch (in the happy case)
In unhappy cases, this goes up to 10-12s, so it'd be good to see how much the single pass verification mentioned above can help.

@eserilev
Copy link
Member

eserilev commented Aug 28, 2025

tracing This is probably already known but without rayon kzg verification times take like 4seconds during range sync, which is the majority of the work `processing_chain_segment` is doing on unstable at the moment!

@eserilev
Copy link
Member

tracing

Some metrics for 84c3f5a

With column based chunking were still seeing rayon perform pretty well. I havent come across kzg verification failures yet, but obviously with the second pass removal its going to perform much better in the failure case.

@eserilev
Copy link
Member

eserilev commented Aug 28, 2025

I'm not seeing any issues with a node having trouble staying synced while backfill is running. I've been running a supernode on devnet-3 with --subscribe-all-subnets --import-all-attestations and --always-prepare-payload while backfilling for the last hour or two and am not seeing any metrics that indicate that the node is struggling. The beacon processor v2 dashboard looks fine, I'm not seeing any traces that would indicate a problem and htop is showing that the lighthouse process is averaging about 25% cpu usage and like 6gb of ram usage. I'll leave this running overnight, but so far it seems a scoped rayon pool for backfill might not be necessary.

EDIT:
Backfill sync was paused due to missing custody peers and cpu usage dropped from 25% to 10%

@jimmygchen
Copy link
Member Author

I'm not seeing any issues with a node having trouble staying synced while backfill is running. I've been running a supernode on devnet-3 with --subscribe-all-subnets --import-all-attestations and --always-prepare-payload while backfilling for the last hour or two and am not seeing any metrics that indicate that the node is struggling. The beacon processor v2 dashboard looks fine, I'm not seeing any traces that would indicate a problem and htop is showing that the lighthouse process is averaging about 25% cpu usage and like 6gb of ram usage. I'll leave this running overnight, but so far it seems a scoped rayon pool for backfill might not be necessary.

EDIT: Backfill sync was paused due to missing custody peers and cpu usage dropped from 25% to 10%

Nice!
Are you noticing your node doing any reconstruction? Might be interesting to see if there's any spike overnight, ie. when both chain segment processing and reconstruction happens at exactly the same time.

@eserilev
Copy link
Member

Reconstruction didnt happen much on my node, only about 1% of total task time was spent on it. I'm going to run an experiment where I tweak some of the reconstruction delay numbers to force reconstruction to happen a bit more and then look at some metrics.

Comment on lines 1042 to 1045
column: vec![
Cell::<E>::default();
E::max_blob_commitments_per_block()
]
Copy link
Member

@eserilev eserilev Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came across a bug in our test suite that only revealed itself when parallelizing by column index (and also probably with the previous chunking impl as well). With column: DataColumn::<E>::empty() were creating a sidecar with no columns.

When we prepare the columns to be verified here

for cell in &data_column.column {
cells.push(ssz_cell_to_crypto_cell::<E>(cell).map_err(|e| (col_index, e))?);
column_indices.push(col_index);
}

The "invalid" sidecars with no columns never enter that for loop, since there's no columns to iterate over. We end up zipping right before kzg batch verification so we only try verifying 128 columns instead of 256 (since the column list ends up being half the size of the commitments and proof list)

I've fixed the issue by adding a column of empty cells to the invalid sidecar. But i'm thinking we might want to add an additional check that columns.len() == proofs.len() == commitments.len() before we zip and verify?

@eserilev
Copy link
Member

I've made the actual fix in this commit da24e3a

The issue was with zip. We just need to check that all the lengths of the different parts of the column match before verification. I've made the check in two places, First in verify_data_columns so that we can make sure that this type of error is always attributable. And second in verify_cell_proof_batch. Its the same exact check but I think its a good idea to make the check here to prevent any foot guns. Making the check here prevents the error from being attributable, so we should always use verify_data_columns. Also the ef tests use verify_cell_proof_batch directly, so we need to make the check here anyways and we dont care about attributability in those test cases.

@eserilev eserilev added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Aug 28, 2025
Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in verify_cell_proof_batch look good to me.

I'm happy to merge this now, but we should merge the scoped rayon PR before making a release imo.

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Aug 29, 2025
@jimmygchen
Copy link
Member Author

Let's go! 🚀 thanks @eserilev

@mergify mergify bot merged commit a134d43 into sigp:unstable Aug 29, 2025
37 checks passed
jtraglia pushed a commit to jtraglia/lighthouse that referenced this pull request Sep 16, 2025
Addresses sigp#7866.


  Use Rayon to speed up batch KZG verification during range / backfill sync.

While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.

I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.
mergify bot pushed a commit that referenced this pull request Sep 18, 2025
Part of #7866

- Continuation of #7921

In the above PR, we enabled rayon for batch KZG verification in chain segment processing. However, using the global rayon thread pool for backfill is likely to create resource contention with higher-priority beacon processor work.


  This PR introduces a dedicated low-priority rayon thread pool `LOW_PRIORITY_RAYON_POOL` and uses it for processing backfill chain segments.

This prevents backfill KZG verification from using the global rayon thread pool and competing with high-priority beacon processor tasks for CPU resources.

However, this PR by itself doesn't prevent CPU oversubscription because other tasks could still fill up the global rayon thread pool, and having an extra thread pool could make things worse. To address this we need the beacon
processor to coordinate total CPU allocation across all tasks, which is covered in:
- #7789


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Eitan Seri- Levi <[email protected]>

Co-Authored-By: Eitan Seri-Levi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working das Data Availability Sampling optimization Something to make Lighthouse run more efficiently. ready-for-merge This PR is ready to merge. v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants