Skip to content

[CB] Changes to increase max_batch_tokens#46712

Open
remi-or wants to merge 18 commits into
mainfrom
cb-more-prefill
Open

[CB] Changes to increase max_batch_tokens#46712
remi-or wants to merge 18 commits into
mainfrom
cb-more-prefill

Conversation

@remi-or

@remi-or remi-or commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR aims at increasing the performances of continuous batching, especially for prefill batches. To that order, it changes the following things:

  • rework the cache size estimator by adding a default max_batch_tokens of 8192 and applying a VRAM-based bound. In effect, this means that the prefill batches can now be much bigger
  • along with this behavioral change, the cache size estimator has been reworked to be more readable and self contained
  • deprecate the max_cached_graph attributes, because it was made useless by the graph pool: since all graphs share the same pool, the memory footprint of new graphs is negligible. Actually, this parameter was hurting performance because discarding and recording new graphs led to fragmentation and more memory being used in the end

❗ This PR hinges on #46587 to be merged.

Performance ✅

label main acc new PR acc main (tok/s) new PR (tok/s) diff
gsm8k_default 0.825 0.822 2477.36 3301.09 +33.25%
gsm8k_sampling 0.790 0.788 1952.49 2936.00 +50.37%
gsm8k_compile 0.825 0.823 2477.76 3444.92 +39.03%
gsm8k_no_fast_decode 0.825 0.822 2364.45 2348.65 -0.67%
gsm8k_bare_bones 0.825 0.816 1886.61 1911.51 +1.32%
ifeval_default 0.457 0.445 8958.99 10163.10 +13.44%
few_blocks - - 666.69 674.44 +1.16%
multi_return_seq - - 1626.68 1764.44 +8.47%
rollouts_1024 - - 3545.67 3609.38 +1.80%
rollouts_2048 - - 3363.86 3394.07 +0.90%
rollouts_4096 - - 2958.96 2984.04 +0.85%
rollouts_8192 - - 2360.50 2362.29 +0.08%
rollouts_16384 - - 1549.78 1537.00 -0.83%

Big performance improvements for prefill-bound workloads. Small accuracy regression on "bare-bones", because the size of the batch affects the generated tokens.

Tests ✅

This PR adds a new test to verify the actual memory footprint is aligned with the expected one.
All tests pass

AI Review ✅

Reviewed and addressed.

@remi-or remi-or requested a review from ArthurZucker June 17, 2026 08:05
@remi-or remi-or moved this from Backlog to In review in Continuous batching Jun 17, 2026
@remi-or remi-or self-assigned this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

1 participant