[fully_async, trainer] fix: sync optimizer total steps before trainer initialization#6684
[fully_async, trainer] fix: sync optimizer total steps before trainer initialization#6684mikequan0425 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors FullyAsyncTrainer to resolve and set the total training steps in the configuration prior to worker initialization. It extracts the configuration-setting logic into _set_total_training_steps_in_config, adds a helper method _resolve_total_training_steps_before_init to compute the steps dynamically, and invokes this resolution during init_workers. There are no review comments, and no additional feedback is provided.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
Luosuu
left a comment
There was a problem hiding this comment.
Found two issues in the pre-init total step calculation that should be addressed before this is safe.
|
|
||
| required_samples = ( | ||
| self.config.actor_rollout_ref.actor.ppo_mini_batch_size * self.config.async_training.require_batches | ||
| ) |
There was a problem hiding this comment.
optim.total_training_steps is consumed by the LR scheduler, and the scheduler is stepped on every actor update (_fit_update_actor -> update_actor), not only when parameters are synced. Dividing by trigger_parameter_sync_step makes the schedule finish trigger_parameter_sync_step times too early; if the progress bar/checkpoint version wants sync steps, please keep that separate from the optimizer scheduler steps.
| @@ -266,7 +271,16 @@ def set_total_train_steps(self, total_training_steps): | |||
| except Exception as e: | |||
| print(f"Warning: Could not set total_training_steps in config. Structure missing? Error: {e}") | |||
|
|
|||
There was a problem hiding this comment.
This uses the raw configured rollout.total_rollout_steps, but the rollouter later computes the effective count as min(config.rollout.total_rollout_steps, len(train_dataloader) * total_epochs) and also supports None. Since the optimizer is already constructed during init_workers, the later set_total_train_steps() call cannot rebuild the scheduler, so dataset-limited or unset configs still get the wrong LR schedule here.
What does this PR do?
It was observed in experiments that the learning rate (lr) was always 0. This issue does not occur when starting the script via main_ppo, and only emerges under fully async mode. Refer to the following for detailed problem description:#6683
In short, optim.total_training_steps is not assigned correctly during the initialization of trainer optim.
Design & Code Changes
The simplest approach is to pass values through by configuring
actor_rollout_ref.actor.optim.total_training_steps, yet the actual trainer step should be calculated as total_rollout_steps / (required_samples * trigger_parameter_sync_step)Therefore, this PR provides a method to assign the corresponding value to optim.total_training_steps when creating the trainer.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisChecklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.