Use `rayon` to speed up batch KZG verification #7921

jimmygchen · 2025-08-22T06:43:58Z

Issue Addressed

Addresses #7866.

Proposed Changes

Use Rayon to speed up batch KZG verification during range / backfill sync.

While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.

I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.

Additional Info

For gossip batches, a fixed size chunk (128) is used to optimise for more predictable performance.

assuming up to 32 blobs, a full node sampling 8 columns will split cells into up to 2 chunks
assuming up to 32 blobs, a supernode sampling all columns will split cells into up to 32 chunks

We also need to be careful with using rayon on gossip column processing, because beacon processor could allocate up to num_cpus processing tasks, and if each task uses the rayon global pool (which also has num_cpus threads in the rayon pool), we could be oversubscribing by 2x when gossip columns arrive in burst, although the impact of 2x oversubscription may not be significant.

For range sync batches, I think it's probably fine to use as many available threads as possible, since getting the node to sync is the highest priority task.

For backfill batches, we probably want to avoid over-allocating as the BN may be processing other higher priority tasks. For this we may want to implement scoped rayon pool in BeaconProcessor (#7719)

UPDATE: created a PR to use scope rayon pool, we should probably merge these two together

Use scoped rayon pool for backfill chain segment processing #7924

Test Results

This is the worst case scenario when there's at least one invalid column at the end of the epoch - the batch verificaiton (with rayon) took 294ms, but then it found an invalid column and try to validate all individually (> 4000 columns in an epoch)
and the whole thing took 12.15s

Note this is after i added the optimisation to short circuit it (if any invalid columns are found, stop validating) - in the worst case its still pretty bad like above - might be worth just verifying columns individually using rayon, instead of aggregating all columns together and then chunk them. I'll implement this - and i think we just short circuit whenever we get a failure, eventually all the bad peers are going to be banned.

beacon_node/beacon_chain/src/data_column_verification.rs

beacon_node/beacon_chain/src/data_availability_checker.rs

… Remove unused functions.

jimmygchen · 2025-08-27T06:21:21Z

I'm going to rework / optimise the verify_kzg_for_data_column_list_with_scoring function

lighthouse/beacon_node/beacon_chain/src/data_column_verification.rs

Lines 445 to 475 in b4704ea

    
           /// Complete kzg verification for a list of `DataColumnSidecar`s. 
        
           /// 
        
           /// If there's at least one invalid column, it re-verifies all columns individually to identify the 
        
           /// first column that is invalid. This is necessary to attribute fault to the specific peer that 
        
           /// sent bad data. The re-verification cost should not be significant. If a peer sends invalid data it 
        
           /// will be quickly banned. 
        
           pub fn verify_kzg_for_data_column_list_with_scoring<'a, E: EthSpec, I>( 
        
               data_column_iter: I, 
        
               kzg: &'a Kzg, 
        
           ) -> Result<(), Vec<(ColumnIndex, KzgError)>> 
        
           where 
        
               I: Iterator<Item = &'a Arc<DataColumnSidecar<E>>> + Clone, 
        
           { 
        
               if verify_kzg_for_data_column_list(data_column_iter.clone(), kzg).is_ok() { 
        
                   return Ok(()); 
        
               }; 
        
               // Find all columns that are invalid and identify by index. If we hit this condition there 
        
               // should be at least one invalid column 
        
               let errors = data_column_iter 
        
                   .filter_map(|data_column| { 
        
                       if let Err(e) = verify_kzg_for_data_column(data_column.clone(), kzg) { 
        
                           Some((data_column.index, e)) 
        
                       } else { 
        
                           None 
        
                       } 
        
                   }) 
        
                   .collect::<Vec<_>>(); 
        
               Err(errors) 
        
           }

The traces screenshot above show that rayon is actually pretty effective - verifying the entire batch with rayon only took 294ms, whereas individually verify each column took up the remaining 12 seconds.

jimmygchen · 2025-08-27T13:06:48Z

crypto/kzg/src/lib.rs

+                    // This is safe from span explosion as we have at most ~32 chunks
+                    // (small batches: 4096/128, large batches: cells/thread_count).
+                    let _span =
+                        tracing::debug_span!("verify_cell_proof_chunk", cells = cell_chunk.len())


Need to add parent here - this span is getting orphaned

beacon_node/beacon_chain/src/data_column_verification.rs

jimmygchen · 2025-08-28T00:46:17Z

Thanks @eserilev for offering to help with this 🙏

From our testing so far, it takes about 1-3 seconds to verify each batch without rayon.
With rayon we've seen numbers around 300-400ms for 325 blobs in an epoch (in the happy case)
In unhappy cases, this goes up to 10-12s, so it'd be good to see how much the single pass verification mentioned above can help.

eserilev · 2025-08-28T02:44:50Z

This is probably already known but without rayon kzg verification times take like 4seconds during range sync, which is the majority of the work `processing_chain_segment` is doing on unstable at the moment!

…ut second pass

eserilev · 2025-08-28T05:58:39Z

Some metrics for 84c3f5a

With column based chunking were still seeing rayon perform pretty well. I havent come across kzg verification failures yet, but obviously with the second pass removal its going to perform much better in the failure case.

eserilev · 2025-08-28T06:38:22Z

I'm not seeing any issues with a node having trouble staying synced while backfill is running. I've been running a supernode on devnet-3 with --subscribe-all-subnets --import-all-attestations and --always-prepare-payload while backfilling for the last hour or two and am not seeing any metrics that indicate that the node is struggling. The beacon processor v2 dashboard looks fine, I'm not seeing any traces that would indicate a problem and htop is showing that the lighthouse process is averaging about 25% cpu usage and like 6gb of ram usage. I'll leave this running overnight, but so far it seems a scoped rayon pool for backfill might not be necessary.

EDIT:
Backfill sync was paused due to missing custody peers and cpu usage dropped from 25% to 10%

…rify_kzg_rayon

jimmygchen · 2025-08-28T07:21:22Z

I'm not seeing any issues with a node having trouble staying synced while backfill is running. I've been running a supernode on devnet-3 with --subscribe-all-subnets --import-all-attestations and --always-prepare-payload while backfilling for the last hour or two and am not seeing any metrics that indicate that the node is struggling. The beacon processor v2 dashboard looks fine, I'm not seeing any traces that would indicate a problem and htop is showing that the lighthouse process is averaging about 25% cpu usage and like 6gb of ram usage. I'll leave this running overnight, but so far it seems a scoped rayon pool for backfill might not be necessary.

EDIT: Backfill sync was paused due to missing custody peers and cpu usage dropped from 25% to 10%

Nice!
Are you noticing your node doing any reconstruction? Might be interesting to see if there's any spike overnight, ie. when both chain segment processing and reconstruction happens at exactly the same time.

eserilev · 2025-08-28T16:21:35Z

Reconstruction didnt happen much on my node, only about 1% of total task time was spent on it. I'm going to run an experiment where I tweak some of the reconstruction delay numbers to force reconstruction to happen a bit more and then look at some metrics.

eserilev · 2025-08-28T19:50:09Z

beacon_node/beacon_chain/src/data_availability_checker.rs

+                                column: vec![
+                                    Cell::<E>::default();
+                                    E::max_blob_commitments_per_block()
+                                ]


Came across a bug in our test suite that only revealed itself when parallelizing by column index (and also probably with the previous chunking impl as well). With column: DataColumn::<E>::empty() were creating a sidecar with no columns.

~~When we prepare the columns to be verified here~~

lighthouse/beacon_node/beacon_chain/src/kzg_utils.rs

Lines 64 to 67 in 2cc8715

for cell in &data_column.column {

cells.push(ssz_cell_to_crypto_cell::<E>(cell).map_err(|e| (col_index, e))?);

column_indices.push(col_index);

}

The "invalid" sidecars with no columns never enter that for loop, since there's no columns to iterate over. We end up zipping right before kzg batch verification so we only try verifying 128 columns instead of 256 (since the column list ends up being half the size of the commitments and proof list)

I've fixed the issue by adding a column of empty cells to the invalid sidecar. But i'm thinking we might want to add an additional check that columns.len() == proofs.len() == commitments.len() before we zip and verify?

eserilev · 2025-08-28T21:01:34Z

I've made the actual fix in this commit da24e3a

The issue was with zip. We just need to check that all the lengths of the different parts of the column match before verification. I've made the check in two places, First in verify_data_columns so that we can make sure that this type of error is always attributable. And second in verify_cell_proof_batch. Its the same exact check but I think its a good idea to make the check here to prevent any foot guns. Making the check here prevents the error from being attributable, so we should always use verify_data_columns. Also the ef tests use verify_cell_proof_batch directly, so we need to make the check here anyways and we dont care about attributability in those test cases.

pawanjay176

The changes in verify_cell_proof_batch look good to me.

I'm happy to merge this now, but we should merge the scoped rayon PR before making a release imo.

jimmygchen · 2025-08-29T00:57:33Z

Let's go! 🚀 thanks @eserilev

Addresses sigp#7866. Use Rayon to speed up batch KZG verification during range / backfill sync. While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified. I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.

Part of #7866 - Continuation of #7921 In the above PR, we enabled rayon for batch KZG verification in chain segment processing. However, using the global rayon thread pool for backfill is likely to create resource contention with higher-priority beacon processor work. This PR introduces a dedicated low-priority rayon thread pool `LOW_PRIORITY_RAYON_POOL` and uses it for processing backfill chain segments. This prevents backfill KZG verification from using the global rayon thread pool and competing with high-priority beacon processor tasks for CPU resources. However, this PR by itself doesn't prevent CPU oversubscription because other tasks could still fill up the global rayon thread pool, and having an extra thread pool could make things worse. To address this we need the beacon processor to coordinate total CPU allocation across all tasks, which is covered in: - #7789 Co-Authored-By: Jimmy Chen <[email protected]> Co-Authored-By: Eitan Seri- Levi <[email protected]> Co-Authored-By: Eitan Seri-Levi <[email protected]>

jimmygchen added 3 commits August 22, 2025 13:12

Move the KZG verification span to track the num of cells verified.

36f1045

Fix kzg verification bug in chain segment processing

b40908f

Use rayon for batch KZG verification.

774afbc

jimmygchen added optimization Something to make Lighthouse run more efficiently. das Data Availability Sampling v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Aug 22, 2025

Cargo sort

2436d5d

jimmygchen requested a review from michaelsproul August 22, 2025 07:02

jimmygchen commented Aug 22, 2025

View reviewed changes

beacon_node/beacon_chain/src/data_column_verification.rs Show resolved Hide resolved

jimmygchen commented Aug 22, 2025

View reviewed changes

beacon_node/beacon_chain/src/data_availability_checker.rs Outdated Show resolved Hide resolved

Only find the first invalid KZG column when batch verification fails.…

60dd014

… Remove unused functions.

jimmygchen force-pushed the verify_kzg_rayon branch from 69c1f83 to 60dd014 Compare August 22, 2025 14:49

jimmygchen requested a review from jxs as a code owner August 22, 2025 14:49

jimmygchen added the ready-for-review The code is ready for review label Aug 22, 2025

jimmygchen mentioned this pull request Aug 22, 2025

Use scoped rayon pool for backfill chain segment processing #7924

Merged

michaelsproul added the bug Something isn't working label Aug 24, 2025

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 27, 2025

jimmygchen added work-in-progress PR is a work-in-progress and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 27, 2025

jimmygchen commented Aug 27, 2025

View reviewed changes

beacon_node/beacon_chain/src/data_column_verification.rs Outdated Show resolved Hide resolved

jimmygchen commented Aug 28, 2025

View reviewed changes

beacon_node/beacon_chain/src/data_column_verification.rs Outdated Show resolved Hide resolved

jimmygchen assigned eserilev Aug 28, 2025

Add col index fault attribution to data column kzg verification witho…

84c3f5a

…ut second pass

remove logs

452c3d7

Merge branch 'unstable' of https://github.com/sigp/lighthouse into ve…

068599d

…rify_kzg_rayon

eserilev added 2 commits August 28, 2025 12:07

fix test

942f305

Code cleanup

2cc8715

eserilev reviewed Aug 28, 2025

View reviewed changes

Fix tests

da24e3a

eserilev added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Aug 28, 2025

pawanjay176 approved these changes Aug 29, 2025

View reviewed changes

jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Aug 29, 2025

mergify bot merged commit a134d43 into sigp:unstable Aug 29, 2025
37 checks passed

jimmygchen mentioned this pull request Aug 29, 2025

Slow batch KZG verification during range sync #7866

Closed

This was referenced Sep 10, 2025

Data column gossip verification taking more than 4s when blob count is above 20 #7203

Closed

Queue and batch verify gossip data columns #7699

Closed

	for cell in &data_column.column {
	cells.push(ssz_cell_to_crypto_cell::<E>(cell).map_err(\|e\| (col_index, e))?);
	column_indices.push(col_index);
	}

Use rayon to speed up batch KZG verification #7921

Use rayon to speed up batch KZG verification #7921

Uh oh!

Conversation

jimmygchen commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Additional Info

Test Results

Uh oh!

Uh oh!

Uh oh!

jimmygchen commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmygchen Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jimmygchen commented Aug 28, 2025

Uh oh!

eserilev commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eserilev commented Aug 28, 2025

Uh oh!

eserilev commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmygchen commented Aug 28, 2025

Uh oh!

eserilev commented Aug 28, 2025

Uh oh!

eserilev Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eserilev commented Aug 28, 2025

Uh oh!

pawanjay176 left a comment

Choose a reason for hiding this comment

Uh oh!

jimmygchen commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use `rayon` to speed up batch KZG verification #7921

Use `rayon` to speed up batch KZG verification #7921

jimmygchen commented Aug 22, 2025 •

edited

Loading

jimmygchen commented Aug 27, 2025 •

edited

Loading

eserilev commented Aug 28, 2025 •

edited

Loading

eserilev commented Aug 28, 2025 •

edited

Loading

eserilev Aug 28, 2025 •

edited

Loading