[vllm, checkpoint] feat: add rollout weight sync debug check by le-czs · Pull Request #6666 · verl-project/verl

le-czs · 2026-06-09T07:32:28Z

What does this PR do?

Addresses #6414.

Adds a checkpoint engine debug mode that checks whether rollout loaded the same weights sent by trainer during the initial trainer-to-rollout sync. The first supported backend path is vLLM with tensor and data parallel sizes 1.

The check compares the source Hugging Face checkpoint state dict with the rollout backend's loaded state dict. It handles vLLM fused qkv_proj and gate_up_proj weights, tied lm_head.weight, and reports missing, unexpected, and mismatched keys. Value comparison is strict after casting the source checkpoint tensor to the loaded rollout tensor dtype/device.

This also adds check_weight_sync_only, which exits after the sync check. That lets users validate the weight conversion/update path without running rollout sampling or PPO updates.

This is not duplicating an existing PR. I checked these searches before opening the PR:

Related open PRs are about MTP sleep, FP8 transfer, generic remote backends, or fully async transfer queue. I did not find an open PR that adds a check-only loaded-weight comparison for #6414.

AI assistance was used for implementation and PR preparation. I reviewed the changed lines before submission.

Checklist Before Starting

I have searched for similar PRs to avoid duplicate work. Query links are included above.
I have formatted the PR title as [vllm, checkpoint] feat: add rollout weight sync debug check.

Test

python3 -m pytest tests/checkpoint_engine/test_weight_sync_on_cpu.py tests/experimental/fully_async_policy/test_llm_server_manager_shutdown_on_cpu.py -q
- Result: 13 passed, 4 warnings
python3 -m py_compile verl/experimental/fully_async_policy/fully_async_rollouter.py tests/experimental/fully_async_policy/test_llm_server_manager_shutdown_on_cpu.py
- Result: passed
git diff --check
- Result: passed
pre-commit run --all-files --show-diff-on-failure --color=always
- Result: passed
Manual GPU smoke test: fully async vLLM/NCCL check-only run on Qwen2.5-3B-Instruct, TP=1, DP=1
- Result: weight sync check passed: [[{'checked': 435, 'missing': 0, 'unexpected': 0, 'mismatched': 0}]]
- Result: exited after check with Weight sync check completed. Exit because check_weight_sync_only=True.

API and Usage Example

Example overrides:

actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.data_parallel_size=1 \
actor_rollout_ref.rollout.checkpoint_engine.backend=nccl \
actor_rollout_ref.rollout.checkpoint_engine.check_weight_sync=True \
actor_rollout_ref.rollout.checkpoint_engine.check_weight_sync_only=True

check_weight_sync=True runs the loaded-weight comparison after the initial trainer-to-rollout sync.

check_weight_sync_only=True requires check_weight_sync=True and exits after the check.

Design & Code Changes

Adds checkpoint engine config fields for sync checking, mismatch limit, dtype, timeout, and check-only mode.
Adds verl.checkpoint_engine.weight_sync for strict state dict comparison and vLLM fused-weight mapping.
Adds a vLLM server RPC path that returns loaded-weight comparison results from the rollout backend.
Restricts the current vLLM check to TP=1 and DP=1; DP>1 is rejected because the current collective_rpc path only covers one DP shard.
Runs the check after the initial weight sync by default, because the comparison source is the original HF checkpoint.
Adds check-only exits for PPO and fully async flows.
In fully async check-only mode, skips unrelated agent/reward worker startup and shuts down standalone plus registered hybrid rollout server services after the check.
Adds unit tests for config validation, manager check selection, direct state dict comparison, strict mismatch reporting, vLLM fused comparison, unsupported parallelism rejection, missing/unexpected keys, and fully async hybrid shutdown.
Adds docs in docs/advance/checkpoint.rst and verl/checkpoint_engine/README.md.

Checklist Before Submitting

I have read the Contribute Guide and followed the PR process.
I have run pre-commit checks: pre-commit run --all-files --show-diff-on-failure --color=always.
I have added or updated documentation.
I have added tests or explained why tests are not feasible.
I have requested CI in the required channel, if needed.
If this PR changes recipes, I have updated the recipe submodule.

Add a debug-only check for initial trainer-to-rollout weight sync against the source HF checkpoint. The first supported path validates vLLM tensor parallel size 1 loaded weights, including fused qkv and gate/up projections, and supports an early check-only exit for PPO and fully async flows. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>

gemini-code-assist

Code Review

This pull request introduces a trainer-to-rollout weight synchronization check to verify that backend-loaded rollout weights match the source HuggingFace checkpoint, adding configuration options, documentation, unit tests, and early-exit support. A critical issue was identified in verl/checkpoint_engine/base.py where rollout.check_loaded_weights_equal is called without ray.get, which would cause verification failures on the workers to be silently ignored. Wrapping this call in ray.get is recommended to ensure errors are properly propagated.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

le-czs · 2026-06-09T08:08:26Z

Verification Result

Remove configurable tolerance knobs from the weight sync check and use strict tensor equality after dtype/device normalization. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>

Luosuu

Left two inline comments on correctness issues in the new debug path.

Reject vLLM loaded-weight checks when data parallelism is enabled because collective_rpc only covers one DP shard. Keep the current supported path to TP=1 and DP=1. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>

Include registered hybrid replicas when shutting down the fully async LLM server manager, and deduplicate active hybrid replicas that also appear in rollout_replicas. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>

Luosuu

I think this PR touches too much for the weight sync debug check and actually this PR can be broken down into two parts:

a minimal version that check weight sync for vLLM (DP/TP=1)
Another PR for check_weight_sync_only and fully async clean shutdown

please try to make each PR minimal and easy to review. Thank you!

le-czs requested review from ArronHZG, PeterSH6, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992 and wuxibin89 as code owners June 9, 2026 07:32

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread verl/checkpoint_engine/base.py

Luosuu reviewed Jun 12, 2026

View reviewed changes

Comment thread verl/trainer/config/_generated_ppo_trainer.yaml Outdated

[vllm, checkpoint] fix: make weight sync check strict

20774c6

Remove configurable tolerance knobs from the weight sync check and use strict tensor equality after dtype/device normalization. Co-authored-by: Codex <codex@openai.com> Signed-off-by: le-czs <caozs2@lenovo.com>

Luosuu reviewed Jun 12, 2026

View reviewed changes

Comment thread verl/workers/rollout/vllm_rollout/utils.py Outdated

Comment thread verl/workers/rollout/llm_server.py

le-czs added 2 commits June 12, 2026 16:31

Luosuu requested changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vllm, checkpoint] feat: add rollout weight sync debug check#6666

[vllm, checkpoint] feat: add rollout weight sync debug check#6666
le-czs wants to merge 4 commits into
verl-project:mainfrom
le-czs:feat-checkpoint-weight-sync-check

le-czs commented Jun 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

le-czs commented Jun 9, 2026

Uh oh!

Uh oh!

Luosuu left a comment

Uh oh!

Uh oh!

Uh oh!

Luosuu left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

le-czs commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

le-czs commented Jun 9, 2026

Uh oh!

Uh oh!

Luosuu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Luosuu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

le-czs commented Jun 9, 2026 •

edited

Loading

Luosuu left a comment •

edited

Loading