You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I started two different MoE model pretraining tasks with fixed random seeds and identical parameters (such as pipeline_parallel_size=2, expert_parallel_size=4 ...), except for virtual_pipeline_size. One task uses vpp_size=1, and the other uses vpp_size=2 (interleaved pipeline).
After training both for 100 iterations, I used torch.load() to inspect the optimizer parameters in the checkpoints.
I found that the master weights tensor (key name "param") are nearly identical between the two tasks, but the moment values (exp_avg, exp_avg_sq) differ significantly.
I want to understand why changing the virtual stage size leads to this discrepancy, or whether it's caused by other runtime factors.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Need Some help.
I started two different MoE model pretraining tasks with fixed random seeds and identical parameters (such as pipeline_parallel_size=2, expert_parallel_size=4 ...), except for virtual_pipeline_size. One task uses vpp_size=1, and the other uses vpp_size=2 (interleaved pipeline).
After training both for 100 iterations, I used torch.load() to inspect the optimizer parameters in the checkpoints.
I found that the master weights tensor (key name "param") are nearly identical between the two tasks, but the moment values (exp_avg, exp_avg_sq) differ significantly.
I want to understand why changing the virtual stage size leads to this discrepancy, or whether it's caused by other runtime factors.
Beta Was this translation helpful? Give feedback.
All reactions