[MoE] DeepEP refactor and fix memory leak during training and inference by shuhuayu · Pull Request #2296 · pytorch/torchtitan

shuhuayu · 2026-01-29T06:56:05Z

Simplified the token permutation logic.
Updated the handle management so there will be no memory leak during training and inference. Related issue: [DeepEP] How is _handle_cache handled during inference? #2273

On training on 16b deepseek v3 model, before the fix there was a growing memory usage.

After the fix, the memory usage stabilizes.

tianyu-l · 2026-01-29T09:34:21Z

@elfiegg please take a look

elfiegg · 2026-01-29T20:20:56Z

Looks good, is my understanding correct that this PR mainly implements:

use torch.argsort for sorting the tokens for performance reasons
save handle in dispatch's setup_context, and clear handle cache in combine forward if inference mode else clear it in setup_context

elfiegg · 2026-01-29T20:22:57Z

FYI @goldhuang to unblock your work

shuhuayu · 2026-01-29T20:26:32Z

Looks good, is my understanding correct that this PR mainly implements:

use torch.argsort for sorting the tokens for performance reasons

save handle in dispatch's setup_context, and clear handle cache in combine forward if inference mode else clear it in setup_context

Yes. During training, handles are saved in both dispatch_ctx and combine_ctx, and the handle in _handle_cache is cleared in combine_setup_context. During inference, no handles saved in op_ctx, and _handle_cache is cleared during combine_forward.

elfiegg · 2026-01-29T20:40:16Z

Sounds good, logic makes totally sense to me. Good to know inference mode setup_context won't be called

goldhuang · 2026-01-29T21:31:06Z

@shuhuayu With the changes in the PR, can I train without selective activation checkpointing on deepep? The extra memory cost from SAC is huge. It may not be a good idea to enable SAC in some cases.

elfiegg · 2026-01-29T22:08:48Z

@goldhuang You can configure --activation_checkpoint.mode=full and --parallelism.expert_parallel_comm_backend=deepep to turn on full checkpointing with deepep being all-to-all backend

goldhuang · 2026-01-29T22:24:54Z

@shuhuayu @elfiegg My point is that deepep integration in current main branch also has memory leak when you use it without SAC (meaning it's running in the pattern of fwd-fwd-bwd). You may want to make sure your changes also cover the fwd-fwd-bwd case.
The current main branch can only do fwd-bwd without a leak. Both fwd and fwd-fwd-bwd will cause a leak. I only reported the fwd case earlier.

shuhuayu · 2026-01-29T22:28:54Z

@shuhuayu With the changes in the PR, can I train without selective activation checkpointing on deepep? The extra memory cost from SAC is huge. It may not be a good idea to enable SAC in some cases.

@goldhuang Thanks for the question.

You can use full AC to save memory.
Currently, during training, dispatch_op.ctx and combine_op.ctx save layout medata, i.e., the handle regardless AC configuration, which should be not big.

If we use selective and op in AC, deepep ops are included in the op_sac_save_list, so their activations get saved. We can remove them from op_sac_save_list by commenting out these two lines:

torchtitan/torchtitan/models/llama4/infra/parallelize.py

Lines 132 to 133 in cee9482

    
           _op_sac_save_list.add(torch.ops.deepep.dispatch.default) 
        
           _op_sac_save_list.add(torch.ops.deepep.combine.default)

I think 1 saves most memory and 3 saves memory specifically from deepep communications.

In my test of a deepseek 16b model on 16 h100s (seqlen=4096, bsz=4, fsdp=ep=8, pp=2, attention=sdpa, compile=loss, moe_communication=deepep, no moe force load balance), the results are:

selective ac with op: mfu 11.46%, memory 27.5%.
selective ac with op, but recompute deepep ops (bullet 3 above): mfu 11.5%, memory 27.38%
full ac: mfu 12.8%, memory 24.5%.

So in these tests the savings from excluding deepep ops from sac save list are not significant in small scale.

shuhuayu · 2026-01-29T22:30:46Z

@shuhuayu @elfiegg My point is that deepep integration in current main branch also has memory leak when you use it without SAC (meaning it's running in the pattern of fwd-fwd-bwd). You may want to make sure your changes also cover the fwd-fwd-bwd case. The current main branch can only do fwd-bwd without a leak. Both fwd and fwd-fwd-bwd will cause a leak. I only reported the fwd case earlier.

@goldhuang Thanks for pointing this out.

fwd case should be fixed if you use torch.inference_mode().
fwd-fwd-bwd case like in full ac is now covered in this pr and tested by running full ac test.

…ce (pytorch#2296) 1. Simplified the token permutation logic. 2. Updated the handle management so there will be no memory leak during training and inference. Related issue: pytorch#2273 On training on 16b deepseek v3 model, before the fix there was a growing memory usage. <img width="479" height="331" alt="image" src="https://github.com/user-attachments/assets/12571963-47a5-4e13-b66a-1b213fc10d66" /> After the fix, the memory usage stabilizes. <img width="479" height="328" alt="image" src="https://github.com/user-attachments/assets/9257c7ce-faf6-4330-a295-1ef1150d4ab0" />

shuhuayu added 8 commits January 28, 2026 16:31

cache clear for ac

69c8549

move cache clean from train to deepep

7c3f180

initial cleanup

5c5df3b

cleanup deepep integration

812eab3

fix lint

a9ced55

simplify token permute

147cccc

inference mode automatic cleanup

a87353b

consisten naming

93de771

shuhuayu requested review from fegin, tianyu-l, wconstab and wwwjn as code owners January 29, 2026 06:56

pytorch-bot Bot added the ciflow/8gpu label Jan 29, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2026

fix lint

8ebdc44

tianyu-l approved these changes Jan 29, 2026

View reviewed changes

shuhuayu merged commit 808cdf7 into pytorch:main Jan 29, 2026
25 checks passed

shuhuayu mentioned this pull request Feb 4, 2026

[DeepEP] shared_experts cannot overlap with deepep.combine() #2298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE] DeepEP refactor and fix memory leak during training and inference#2296

[MoE] DeepEP refactor and fix memory leak during training and inference#2296
shuhuayu merged 9 commits into
pytorch:mainfrom
shuhuayu:deepep

shuhuayu commented Jan 29, 2026 •

edited

Loading

Uh oh!

tianyu-l commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

shuhuayu commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

goldhuang commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

goldhuang commented Jan 29, 2026

Uh oh!

shuhuayu commented Jan 29, 2026

Uh oh!

shuhuayu commented Jan 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shuhuayu commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

shuhuayu commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

goldhuang commented Jan 29, 2026

Uh oh!

elfiegg commented Jan 29, 2026

Uh oh!

goldhuang commented Jan 29, 2026

Uh oh!

shuhuayu commented Jan 29, 2026

Uh oh!

shuhuayu commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shuhuayu commented Jan 29, 2026 •

edited

Loading

shuhuayu commented Jan 29, 2026 •

edited

Loading