-
Notifications
You must be signed in to change notification settings - Fork 952
Use rayon to speed up batch KZG verification
#7921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… Remove unused functions.
69c1f83 to
60dd014
Compare
|
I'm going to rework / optimise the lighthouse/beacon_node/beacon_chain/src/data_column_verification.rs Lines 445 to 475 in b4704ea
The traces screenshot above show that rayon is actually pretty effective - verifying the entire batch with rayon only took 294ms, whereas individually verify each column took up the remaining 12 seconds. |
crypto/kzg/src/lib.rs
Outdated
| // This is safe from span explosion as we have at most ~32 chunks | ||
| // (small batches: 4096/128, large batches: cells/thread_count). | ||
| let _span = | ||
| tracing::debug_span!("verify_cell_proof_chunk", cells = cell_chunk.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add parent here - this span is getting orphaned
|
Thanks @eserilev for offering to help with this 🙏 From our testing so far, it takes about 1-3 seconds to verify each batch without rayon. |
Some metrics for 84c3f5a With column based chunking were still seeing rayon perform pretty well. I havent come across kzg verification failures yet, but obviously with the second pass removal its going to perform much better in the failure case. |
|
I'm not seeing any issues with a node having trouble staying synced while backfill is running. I've been running a supernode on devnet-3 with --subscribe-all-subnets --import-all-attestations and --always-prepare-payload while backfilling for the last hour or two and am not seeing any metrics that indicate that the node is struggling. The beacon processor v2 dashboard looks fine, I'm not seeing any traces that would indicate a problem and htop is showing that the lighthouse process is averaging about 25% cpu usage and like 6gb of ram usage. I'll leave this running overnight, but so far it seems a scoped rayon pool for backfill might not be necessary. EDIT: |
Nice! |
|
Reconstruction didnt happen much on my node, only about 1% of total task time was spent on it. I'm going to run an experiment where I tweak some of the reconstruction delay numbers to force reconstruction to happen a bit more and then look at some metrics. |
| column: vec![ | ||
| Cell::<E>::default(); | ||
| E::max_blob_commitments_per_block() | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came across a bug in our test suite that only revealed itself when parallelizing by column index (and also probably with the previous chunking impl as well). With column: DataColumn::<E>::empty() were creating a sidecar with no columns.
When we prepare the columns to be verified here
lighthouse/beacon_node/beacon_chain/src/kzg_utils.rs
Lines 64 to 67 in 2cc8715
| for cell in &data_column.column { | |
| cells.push(ssz_cell_to_crypto_cell::<E>(cell).map_err(|e| (col_index, e))?); | |
| column_indices.push(col_index); | |
| } |
The "invalid" sidecars with no columns never enter that for loop, since there's no columns to iterate over. We end up zipping right before kzg batch verification so we only try verifying 128 columns instead of 256 (since the column list ends up being half the size of the commitments and proof list)
I've fixed the issue by adding a column of empty cells to the invalid sidecar. But i'm thinking we might want to add an additional check that columns.len() == proofs.len() == commitments.len() before we zip and verify?
|
I've made the actual fix in this commit da24e3a The issue was with |
pawanjay176
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in verify_cell_proof_batch look good to me.
I'm happy to merge this now, but we should merge the scoped rayon PR before making a release imo.
|
Let's go! 🚀 thanks @eserilev |
Addresses sigp#7866. Use Rayon to speed up batch KZG verification during range / backfill sync. While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified. I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.
Part of #7866 - Continuation of #7921 In the above PR, we enabled rayon for batch KZG verification in chain segment processing. However, using the global rayon thread pool for backfill is likely to create resource contention with higher-priority beacon processor work. This PR introduces a dedicated low-priority rayon thread pool `LOW_PRIORITY_RAYON_POOL` and uses it for processing backfill chain segments. This prevents backfill KZG verification from using the global rayon thread pool and competing with high-priority beacon processor tasks for CPU resources. However, this PR by itself doesn't prevent CPU oversubscription because other tasks could still fill up the global rayon thread pool, and having an extra thread pool could make things worse. To address this we need the beacon processor to coordinate total CPU allocation across all tasks, which is covered in: - #7789 Co-Authored-By: Jimmy Chen <[email protected]> Co-Authored-By: Eitan Seri- Levi <[email protected]> Co-Authored-By: Eitan Seri-Levi <[email protected]>


Issue Addressed
Addresses #7866.
Proposed Changes
Use Rayon to speed up batch KZG verification during range / backfill sync.
While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.
I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.
Additional Info
For gossip batches, a fixed size chunk (128) is used to optimise for more predictable performance.
We also need to be careful with using rayon on gossip column processing, because beacon processor could allocate up to
num_cpusprocessing tasks, and if each task uses the rayon global pool (which also hasnum_cpusthreads in the rayon pool), we could be oversubscribing by 2x when gossip columns arrive in burst, although the impact of 2x oversubscription may not be significant.For range sync batches, I think it's probably fine to use as many available threads as possible, since getting the node to sync is the highest priority task.
For backfill batches, we probably want to avoid over-allocating as the BN may be processing other higher priority tasks. For this we may want to implement scoped rayon pool in
BeaconProcessor(#7719)UPDATE: created a PR to use scope rayon pool, we should probably merge these two together
Test Results
This is the worst case scenario when there's at least one invalid column at the end of the epoch - the batch verificaiton (with rayon) took 294ms, but then it found an invalid column and try to validate all individually (> 4000 columns in an epoch)

and the whole thing took 12.15s
Note this is after i added the optimisation to short circuit it (if any invalid columns are found, stop validating) - in the worst case its still pretty bad like above - might be worth just verifying columns individually using rayon, instead of aggregating all columns together and then chunk them. I'll implement this - and i think we just short circuit whenever we get a failure, eventually all the bad peers are going to be banned.