Why are the moment values in the optimizer different after training with virtual_pipeline_size=1 or 2? #1557

lxgsbqylbk · 2025-04-29T14:24:27Z

lxgsbqylbk
Apr 29, 2025

Hi,
Need Some help.

I started two different MoE model pretraining tasks with fixed random seeds and identical parameters (such as pipeline_parallel_size=2, expert_parallel_size=4 ...), except for virtual_pipeline_size. One task uses vpp_size=1, and the other uses vpp_size=2 (interleaved pipeline).

After training both for 100 iterations, I used torch.load() to inspect the optimizer parameters in the checkpoints.
I found that the master weights tensor (key name "param") are nearly identical between the two tasks, but the moment values (exp_avg, exp_avg_sq) differ significantly.

I want to understand why changing the virtual stage size leads to this discrepancy, or whether it's caused by other runtime factors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why are the moment values in the optimizer different after training with virtual_pipeline_size=1 or 2? #1557

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why are the moment values in the optimizer different after training with virtual_pipeline_size=1 or 2? #1557

Uh oh!

lxgsbqylbk Apr 29, 2025

Replies: 0 comments

lxgsbqylbk
Apr 29, 2025