[vllm, checkpoint] feat: add rollout weight sync debug check#6666
[vllm, checkpoint] feat: add rollout weight sync debug check#6666le-czs wants to merge 4 commits into
Conversation
Add a debug-only check for initial trainer-to-rollout weight sync against the source HF checkpoint. The first supported path validates vLLM tensor parallel size 1 loaded weights, including fused qkv and gate/up projections, and supports an early check-only exit for PPO and fully async flows. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a trainer-to-rollout weight synchronization check to verify that backend-loaded rollout weights match the source HuggingFace checkpoint, adding configuration options, documentation, unit tests, and early-exit support. A critical issue was identified in verl/checkpoint_engine/base.py where rollout.check_loaded_weights_equal is called without ray.get, which would cause verification failures on the workers to be silently ignored. Wrapping this call in ray.get is recommended to ensure errors are properly propagated.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Remove configurable tolerance knobs from the weight sync check and use strict tensor equality after dtype/device normalization. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>
Luosuu
left a comment
There was a problem hiding this comment.
Left two inline comments on correctness issues in the new debug path.
Reject vLLM loaded-weight checks when data parallelism is enabled because collective_rpc only covers one DP shard. Keep the current supported path to TP=1 and DP=1. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>
Include registered hybrid replicas when shutting down the fully async LLM server manager, and deduplicate active hybrid replicas that also appear in rollout_replicas. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>
There was a problem hiding this comment.
I think this PR touches too much for the weight sync debug check and actually this PR can be broken down into two parts:
- a minimal version that check weight sync for vLLM (DP/TP=1)
- Another PR for check_weight_sync_only and fully async clean shutdown
please try to make each PR minimal and easy to review. Thank you!

What does this PR do?
Addresses #6414.
Adds a checkpoint engine debug mode that checks whether rollout loaded the same weights sent by trainer during the initial trainer-to-rollout sync. The first supported backend path is vLLM with tensor and data parallel sizes 1.
The check compares the source Hugging Face checkpoint state dict with the rollout backend's loaded state dict. It handles vLLM fused
qkv_projandgate_up_projweights, tiedlm_head.weight, and reports missing, unexpected, and mismatched keys. Value comparison is strict after casting the source checkpoint tensor to the loaded rollout tensor dtype/device.This also adds
check_weight_sync_only, which exits after the sync check. That lets users validate the weight conversion/update path without running rollout sampling or PPO updates.This is not duplicating an existing PR. I checked these searches before opening the PR:
Related open PRs are about MTP sleep, FP8 transfer, generic remote backends, or fully async transfer queue. I did not find an open PR that adds a check-only loaded-weight comparison for #6414.
AI assistance was used for implementation and PR preparation. I reviewed the changed lines before submission.
Checklist Before Starting
[vllm, checkpoint] feat: add rollout weight sync debug check.Test
python3 -m pytest tests/checkpoint_engine/test_weight_sync_on_cpu.py tests/experimental/fully_async_policy/test_llm_server_manager_shutdown_on_cpu.py -q13 passed, 4 warningspython3 -m py_compile verl/experimental/fully_async_policy/fully_async_rollouter.py tests/experimental/fully_async_policy/test_llm_server_manager_shutdown_on_cpu.pygit diff --checkpre-commit run --all-files --show-diff-on-failure --color=alwaysweight sync check passed: [[{'checked': 435, 'missing': 0, 'unexpected': 0, 'mismatched': 0}]]Weight sync check completed. Exit because check_weight_sync_only=True.API and Usage Example
Example overrides:
check_weight_sync=Trueruns the loaded-weight comparison after the initial trainer-to-rollout sync.check_weight_sync_only=Truerequirescheck_weight_sync=Trueand exits after the check.Design & Code Changes
verl.checkpoint_engine.weight_syncfor strict state dict comparison and vLLM fused-weight mapping.collective_rpcpath only covers one DP shard.docs/advance/checkpoint.rstandverl/checkpoint_engine/README.md.Checklist Before Submitting
pre-commit run --all-files --show-diff-on-failure --color=always.