Skip to content

[vllm, checkpoint] feat: add rollout weight sync debug check#6666

Open
le-czs wants to merge 4 commits into
verl-project:mainfrom
le-czs:feat-checkpoint-weight-sync-check
Open

[vllm, checkpoint] feat: add rollout weight sync debug check#6666
le-czs wants to merge 4 commits into
verl-project:mainfrom
le-czs:feat-checkpoint-weight-sync-check

Conversation

@le-czs

@le-czs le-czs commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Addresses #6414.

Adds a checkpoint engine debug mode that checks whether rollout loaded the same weights sent by trainer during the initial trainer-to-rollout sync. The first supported backend path is vLLM with tensor and data parallel sizes 1.

The check compares the source Hugging Face checkpoint state dict with the rollout backend's loaded state dict. It handles vLLM fused qkv_proj and gate_up_proj weights, tied lm_head.weight, and reports missing, unexpected, and mismatched keys. Value comparison is strict after casting the source checkpoint tensor to the loaded rollout tensor dtype/device.

This also adds check_weight_sync_only, which exits after the sync check. That lets users validate the weight conversion/update path without running rollout sampling or PPO updates.

This is not duplicating an existing PR. I checked these searches before opening the PR:

Related open PRs are about MTP sleep, FP8 transfer, generic remote backends, or fully async transfer queue. I did not find an open PR that adds a check-only loaded-weight comparison for #6414.

AI assistance was used for implementation and PR preparation. I reviewed the changed lines before submission.

Checklist Before Starting

  • I have searched for similar PRs to avoid duplicate work. Query links are included above.
  • I have formatted the PR title as [vllm, checkpoint] feat: add rollout weight sync debug check.

Test

  • python3 -m pytest tests/checkpoint_engine/test_weight_sync_on_cpu.py tests/experimental/fully_async_policy/test_llm_server_manager_shutdown_on_cpu.py -q
    • Result: 13 passed, 4 warnings
  • python3 -m py_compile verl/experimental/fully_async_policy/fully_async_rollouter.py tests/experimental/fully_async_policy/test_llm_server_manager_shutdown_on_cpu.py
    • Result: passed
  • git diff --check
    • Result: passed
  • pre-commit run --all-files --show-diff-on-failure --color=always
    • Result: passed
  • Manual GPU smoke test: fully async vLLM/NCCL check-only run on Qwen2.5-3B-Instruct, TP=1, DP=1
    • Result: weight sync check passed: [[{'checked': 435, 'missing': 0, 'unexpected': 0, 'mismatched': 0}]]
    • Result: exited after check with Weight sync check completed. Exit because check_weight_sync_only=True.

API and Usage Example

Example overrides:

actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.data_parallel_size=1 \
actor_rollout_ref.rollout.checkpoint_engine.backend=nccl \
actor_rollout_ref.rollout.checkpoint_engine.check_weight_sync=True \
actor_rollout_ref.rollout.checkpoint_engine.check_weight_sync_only=True

check_weight_sync=True runs the loaded-weight comparison after the initial trainer-to-rollout sync.

check_weight_sync_only=True requires check_weight_sync=True and exits after the check.

Design & Code Changes

  • Adds checkpoint engine config fields for sync checking, mismatch limit, dtype, timeout, and check-only mode.
  • Adds verl.checkpoint_engine.weight_sync for strict state dict comparison and vLLM fused-weight mapping.
  • Adds a vLLM server RPC path that returns loaded-weight comparison results from the rollout backend.
  • Restricts the current vLLM check to TP=1 and DP=1; DP>1 is rejected because the current collective_rpc path only covers one DP shard.
  • Runs the check after the initial weight sync by default, because the comparison source is the original HF checkpoint.
  • Adds check-only exits for PPO and fully async flows.
  • In fully async check-only mode, skips unrelated agent/reward worker startup and shuts down standalone plus registered hybrid rollout server services after the check.
  • Adds unit tests for config validation, manager check selection, direct state dict comparison, strict mismatch reporting, vLLM fused comparison, unsupported parallelism rejection, missing/unexpected keys, and fully async hybrid shutdown.
  • Adds docs in docs/advance/checkpoint.rst and verl/checkpoint_engine/README.md.

Checklist Before Submitting

  • I have read the Contribute Guide and followed the PR process.
  • I have run pre-commit checks: pre-commit run --all-files --show-diff-on-failure --color=always.
  • I have added or updated documentation.
  • I have added tests or explained why tests are not feasible.
  • I have requested CI in the required channel, if needed.
  • If this PR changes recipes, I have updated the recipe submodule.

Add a debug-only check for initial trainer-to-rollout weight sync against the source HF checkpoint. The first supported path validates vLLM tensor parallel size 1 loaded weights, including fused qkv and gate/up projections, and supports an early check-only exit for PPO and fully async flows.

Co-authored-by: Codex <codex@openai.com>

Signed-off-by: le-czs <caozs2@lenovo.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a trainer-to-rollout weight synchronization check to verify that backend-loaded rollout weights match the source HuggingFace checkpoint, adding configuration options, documentation, unit tests, and early-exit support. A critical issue was identified in verl/checkpoint_engine/base.py where rollout.check_loaded_weights_equal is called without ray.get, which would cause verification failures on the workers to be silently ignored. Wrapping this call in ray.get is recommended to ensure errors are properly propagated.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread verl/checkpoint_engine/base.py
@le-czs

le-czs commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Verification Result

image

Comment thread verl/trainer/config/_generated_ppo_trainer.yaml Outdated
Remove configurable tolerance knobs from the weight sync check and use strict tensor equality after dtype/device normalization.

Co-authored-by: Codex <codex@openai.com>

Signed-off-by: le-czs <caozs2@lenovo.com>

@Luosuu Luosuu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two inline comments on correctness issues in the new debug path.

Comment thread verl/workers/rollout/vllm_rollout/utils.py Outdated
Comment thread verl/workers/rollout/llm_server.py
le-czs added 2 commits June 12, 2026 16:31
Reject vLLM loaded-weight checks when data parallelism is enabled because collective_rpc only covers one DP shard. Keep the current supported path to TP=1 and DP=1.

Co-authored-by: Codex <codex@openai.com>

Signed-off-by: le-czs <caozs2@lenovo.com>
Include registered hybrid replicas when shutting down the fully async LLM server manager, and deduplicate active hybrid replicas that also appear in rollout_replicas.

Co-authored-by: Codex <codex@openai.com>

Signed-off-by: le-czs <caozs2@lenovo.com>

@Luosuu Luosuu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR touches too much for the weight sync debug check and actually this PR can be broken down into two parts:

  1. a minimal version that check weight sync for vLLM (DP/TP=1)
  2. Another PR for check_weight_sync_only and fully async clean shutdown

please try to make each PR minimal and easy to review. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants