[Feat] Enable expert parallel for diffusion MoE layers#1323
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27ac16cebd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| world_size: int = torch.distributed.get_world_size() | ||
| backend = backend or torch.distributed.get_backend(get_world_group().device_group) | ||
|
|
||
| forward_context = get_forward_context() |
There was a problem hiding this comment.
Gate forward-context lookup to expert-parallel path
initialize_model_parallel now always requires a forward context, but existing direct callers invoke it after init_distributed_environment() without wrapping set_forward_context (for example the distributed tests call initialize_model_parallel(...) directly), so this now raises Forward context is not set before any group initialization. The context lookup should be conditional on the EP-only branch or otherwise optional to preserve prior behavior for non-EP setups.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Adds an “expert parallel” (EP) switch to diffusion inference so MoE-based diffusion transformers (e.g., HunyuanImage3.0) can initialize an EP process group and plumb the flag from user entrypoints/examples into diffusion distributed setup.
Changes:
- Plumbs
enable_expert_parallelthrough diffusion configs, async default stage config creation, worker init, and offline inference examples. - Ensures
OmniDiffusionConfig.tf_model_configis populated from HF config files and adds anis_moehelper for gating EP. - Extends diffusion distributed parallel state to optionally create/destroy an “expert” process group and updates docs to list EP support.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_omni/entrypoints/omni_diffusion.py | Populate tf_model_config from config.json fallback path. |
| vllm_omni/entrypoints/async_omni_diffusion.py | Same tf_model_config population for async diffusion entrypoint. |
| vllm_omni/entrypoints/async_omni.py | Adds enable_expert_parallel to default diffusion stage config creation. |
| vllm_omni/diffusion/worker/diffusion_worker.py | Propagates EP flag into VllmConfig and diffusion model-parallel init. |
| vllm_omni/diffusion/distributed/parallel_state.py | Adds EP group creation/destruction and introduces forward-context dependency. |
| vllm_omni/diffusion/data.py | Adds enable_expert_parallel field + is_moe property. |
| examples/offline_inference/text_to_video/text_to_video.py | Adds CLI flag and passes EP into DiffusionParallelConfig. |
| examples/offline_inference/text_to_image/text_to_image.py | Adds CLI flag and passes EP into DiffusionParallelConfig. |
| examples/offline_inference/image_to_image/image_edit.py | Adds CLI flag and passes EP into DiffusionParallelConfig. |
| docs/user_guide/diffusion/parallelism_acceleration.md | Documents EP support in the model/parallelism matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if enable_expert_parallel: | ||
| assert od_config.is_moe | ||
| vllm_parallel_state._EP = init_model_parallel_group( | ||
| group_ranks=rank_generator.get_ranks("ep"), | ||
| local_rank=get_world_group().local_rank, | ||
| backend=backend, | ||
| parallel_mode="expert", | ||
| ) | ||
|
|
There was a problem hiding this comment.
enable_expert_parallel adds new behavior in initialize_model_parallel, but there are existing unit tests for parallel group construction under tests/diffusion/distributed/. Please add a test that exercises initialize_model_parallel(enable_expert_parallel=True) and asserts the EP group is created with the expected world size/ranks (and that it is not created when the flag is false).
| """Number of tensor parallel groups.""" | ||
|
|
||
| enable_expert_parallel: bool = False | ||
| """Use expert parallelism instead of tensor parallelism for MoE layers.""" |
There was a problem hiding this comment.
enable_expert_parallel is documented as “Use expert parallelism instead of tensor parallelism for MoE layers,” but this change does not disable TP; it only introduces an EP flag/group in addition to the existing TP setup. Please update the docstring to reflect the actual behavior, or adjust the implementation if EP is truly meant to replace TP for MoE layers.
| """Use expert parallelism instead of tensor parallelism for MoE layers.""" | |
| """Enable expert parallelism for MoE layers in addition to tensor parallelism.""" |
| forward_context = get_forward_context() | ||
| od_config = forward_context.omni_diffusion_config | ||
|
|
There was a problem hiding this comment.
initialize_model_parallel now unconditionally calls get_forward_context(), which asserts if no forward context has been set. There are multiple call sites (including existing unit tests) that call initialize_model_parallel() before entering any set_forward_context(...) scope, so this will break those flows even when enable_expert_parallel=False. Consider gating this lookup behind enable_expert_parallel (and/or using is_forward_context_available()), or pass the needed OmniDiffusionConfig explicitly as a parameter when EP is enabled.
| | **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | ||
| | **FLUX.2-klein** | `black-forest-labs/FLUX.2-klein-4B` | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | | ||
| | **FLUX.1-dev** | `black-forest-labs/FLUX.1-dev` | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | | ||
| | **HunyuanImage3.0** | `tencent/HunyuanImage-3.0`, `tencent/HunyuanImage-3.0-Instruct` | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | |
There was a problem hiding this comment.
For non-MoE models, should I use 'N/A' instead of 'X' for the Expert-Parallel column? Since most models don't have MoE layers, 'X' might be misleading—users could interpret it as a lack of feature support rather than a structural irrelevance.
There was a problem hiding this comment.
@hsliuustc0106 @princepride @lishunyang12 Hi guys, any suggestion about this part?
There was a problem hiding this comment.
N/A is better here -- X suggests the feature was intentionally not supported, N/A makes it clear the dimension doesn't apply to non-MoE architectures.
There was a problem hiding this comment.
N/A is better here -- X suggests the feature was intentionally not supported, N/A makes it clear the dimension doesn't apply to non-MoE architectures.
Thank you. That's better indeed.
| vllm_parallel_state._EP = init_model_parallel_group( | ||
| group_ranks=rank_generator.get_ranks("ep"), | ||
| local_rank=get_world_group().local_rank, | ||
| backend=backend, | ||
| parallel_mode="expert", | ||
| ) |
There was a problem hiding this comment.
vllm_omni/diffusion/distributed/parallel_state.py:#L187 It seems that ep is always 1
There was a problem hiding this comment.
You're right, I misunderstood previous code. Modified.
There was a problem hiding this comment.
Thanks for fixing the EP size. Two remaining issues in the current code:
-
get_forward_context()on line 735 is still called unconditionally — it needs to be inside theif enable_expert_parallel:block (or passed as a parameter). Right now any caller that doesn't set a forward context will crash even when EP is disabled. -
Adding
"ep"to the order string andname_to_sizebreaksgenerate_masked_orthogonal_rank_groupsfor all other tokens.ordered_sizebecomes[tp, sp, pp, cfg, dp, ep]whereep = tp*sp*cfg*dp, so the product no longer equalsworld_size. Theget_ranks("ep")special-case is correct, but the mask-based path used byget_ranks("tp")/get_ranks("cfg")/ etc. will produce invalid rank groups.
Fix: don't add ep to the order string or name_to_size. The get_ranks("ep") method already computes EP groups directly without needing the mask infrastructure.
There was a problem hiding this comment.
Thanks for fixing the EP size. Two remaining issues in the current code:
get_forward_context()on line 735 is still called unconditionally — it needs to be inside theif enable_expert_parallel:block (or passed as a parameter). Right now any caller that doesn't set a forward context will crash even when EP is disabled.- Adding
"ep"to the order string andname_to_sizebreaksgenerate_masked_orthogonal_rank_groupsfor all other tokens.ordered_sizebecomes[tp, sp, pp, cfg, dp, ep]whereep = tp*sp*cfg*dp, so the product no longer equalsworld_size. Theget_ranks("ep")special-case is correct, but the mask-based path used byget_ranks("tp")/get_ranks("cfg")/ etc. will produce invalid rank groups.Fix: don't add
epto the order string orname_to_size. Theget_ranks("ep")method already computes EP groups directly without needing the mask infrastructure.
Sure, moved get_forward_context() into if enable_expert_parallel. Removed unnecessary ep code.
lishunyang12
left a comment
There was a problem hiding this comment.
The EP plumbing is there but the EP group always has world_size=1 since self.ep = 1 is hardcoded -- so this does not actually do expert parallelism yet.
| self.pp = pp | ||
| self.cfg = cfg | ||
| self.dp = dp | ||
| self.ep = 1 # no matter EP enabled, EP stride should always be 1 |
There was a problem hiding this comment.
self.ep = 1 means get_ranks("ep") will always return singleton groups. So even when EP is enabled, every rank ends up in its own EP group of size 1 and no actual expert-parallel communication happens. This is the core issue -- how is the EP world size supposed to be derived? Should it equal tp (reusing the TP ranks for EP), or should there be a separate expert_parallel_size parameter?
| world_size: int = torch.distributed.get_world_size() | ||
| backend = backend or torch.distributed.get_backend(get_world_group().device_group) | ||
|
|
||
| forward_context = get_forward_context() |
There was a problem hiding this comment.
This get_forward_context() call runs unconditionally, even when enable_expert_parallel=False. Any caller that does not set a forward context first (including existing tests) will crash here. Move this inside the if enable_expert_parallel: block below, or pass od_config as a parameter instead of pulling it from global state.
| ) | ||
|
|
||
| if enable_expert_parallel: | ||
| assert od_config.is_moe |
There was a problem hiding this comment.
Bare assert gets stripped under python -O. This should be a ValueError or RuntimeError with a message like "enable_expert_parallel requires a MoE model". Also, should EP be silently ignored for non-MoE models instead of raising?
| cfg_parallel_size, | ||
| data_parallel_size, | ||
| "tp-sp-pp-cfg-dp", | ||
| "tp-sp-pp-cfg-dp-ep", |
There was a problem hiding this comment.
Adding ep to the order string here but keeping self.ep = 1 (line 187) means the rank generator always treats EP as a trivial dimension. If EP is meant to share the TP mesh, the order string change alone will not do it -- you need to actually set self.ep to the desired EP size and adjust self.tp accordingly.
|
|
||
| @property | ||
| def is_moe(self) -> bool: | ||
| if self.tf_model_config.get("num_experts", None): |
There was a problem hiding this comment.
A few issues with is_moe: (1) num_experts can be a list in some configs (per-layer expert counts) -- calling > 0 on a list will throw TypeError. (2) num_experts=1 would return True here, but a single expert is effectively dense. Consider handling the list case and checking > 1 instead of > 0.
| tensor_parallel_size: int = 1 | ||
| """Number of tensor parallel groups.""" | ||
|
|
||
| enable_expert_parallel: bool = False |
There was a problem hiding this comment.
The docstring says "Use expert parallelism instead of tensor parallelism" but the implementation creates an EP group in addition to TP -- TP is never disabled. Which is the intended behavior? If they coexist, the docstring should say so.
There was a problem hiding this comment.
I meant to say in MoE layers, no longer use TP when EP enabled, but other linear layers will continue use TP. Does my expression make users confused?
There was a problem hiding this comment.
That's clearer. Something like "Enable expert parallelism for MoE layers (TP is still used for non-MoE layers)" would remove any ambiguity.
There was a problem hiding this comment.
That's clearer. Something like "Enable expert parallelism for MoE layers (TP is still used for non-MoE layers)" would remove any ambiguity.
Yes, your comment makes it more clearer.
| f" Parallel configuration: tensor_parallel_size={args.tensor_parallel_size}, " | ||
| f"ulysses_degree={args.ulysses_degree}, ring_degree={args.ring_degree}, cfg_parallel_size={args.cfg_parallel_size}, " | ||
| f"vae_patch_parallel_size={args.vae_patch_parallel_size}" | ||
| f"vae_patch_parallel_size={args.vae_patch_parallel_size}, enable_expert_parallel: {args.enable_expert_parallel}." |
There was a problem hiding this comment.
Nit: this uses enable_expert_parallel: (colon + trailing period), while text_to_video.py uses enable_expert_parallel= (equals, no period) and image_edit.py uses yet another variant. Would be nice to keep the format consistent across all three examples.
There was a problem hiding this comment.
Yes you're right, my bad. Modified.
| vllm_parallel_state._TP.destroy() | ||
| vllm_parallel_state._TP = None | ||
|
|
||
| if vllm_parallel_state._EP: |
There was a problem hiding this comment.
Is vllm_parallel_state._EP guaranteed to be defined if initialize_model_parallel was never called with enable_expert_parallel=True? If _EP was never set as an attribute, this will raise AttributeError. You might want getattr(vllm_parallel_state, "_EP", None) or ensure _EP is initialized to None at module level in vllm.
There was a problem hiding this comment.
Yes, since _EP is from vllm, and it's none defaultly if initialize_model_parallel was not called. So here we use if vllm_parallel_state._EP will not raise AttributeError.
There was a problem hiding this comment.
Makes sense, thanks for confirming.
|
@vllm-omni-reviewer |
7a96e93 to
b6d078c
Compare
You're right, I misunderstood previous code. Modified. |
|
Thanks for the updates. Two things still need fixing before this is ready:
See inline comments for details. |
do we expect EP will bring a perf regression? |
No, actually I think EP would bring perf improvement. But this looks weird indeed. I actually saw a warning about MoE |
please open a new issue about this perf regression and wait for rebase 0.17.0 (should be ok tomorrow morning) |
OK, got you. |
lishunyang12
left a comment
There was a problem hiding this comment.
previous concerns addressed, looks good now
| help="Number of GPUs used for VAE patch/tile parallelism (decode).", | ||
| ) | ||
| parser.add_argument( | ||
| "--enable_expert_parallel", |
There was a problem hiding this comment.
| "--enable_expert_parallel", | |
| "--enable-expert-parallel", |
apply to all please, cc @wtomin
There was a problem hiding this comment.
apply to all please, cc @wtomin
cli format change applied to all examples files with enable-expert-parallel option.
30274d2 to
b96b713
Compare
|
|
||
| 6. [HSDP](#hsdp): Hybrid Sharded Data Parallel shards model weights across GPUs using PyTorch FSDP2. This reduces per-GPU memory usage, enabling inference of large models on GPUs with limited memory. | ||
|
|
||
| The following table shows which models are currently supported by parallelism method: |
There was a problem hiding this comment.
please also add EP before this line
There was a problem hiding this comment.
please also add EP before this line
Sure, added.
|
fix precommits |
Signed-off-by: Semmer2 <semmer@live.cn>
Only HunyuanImage3.0 support EP now, choose it for testing. Signed-off-by: Semmer2 <semmer@live.cn>
Sorry, fixed. |
| ) | ||
| parser.add_argument( | ||
| "--enable-expert-parallel", | ||
| action="store_true", |
There was a problem hiding this comment.
Is there any text-to-video model having MoE layers now in vllm-omni?
There was a problem hiding this comment.
Is there any text-to-video model having MoE layers now in vllm-omni?
As far as I know, wan2.2 is a MoE video generation model. But the upstream supported version did not contain any MoE structure for now. I think it's adaption is till WIP.
hsliuustc0106
left a comment
There was a problem hiding this comment.
open a new issue for the remaining to-dos
Added new issue, and mentioned this PR: #1801 |
…1323) Signed-off-by: Semmer2 <semmer@live.cn>
…1323) Signed-off-by: Semmer2 <semmer@live.cn>
…1323) Signed-off-by: Semmer2 <semmer@live.cn>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Support MoE layers with Expert Parallel in diffusion inference.
Test Plan
test HunyuanImage3.0 with EP and with out EP, and evaluate the output.
with EP:
python examples/offline_inference/text_to_image/text_to_image.py --model /data/HunyuanImage-3.0/ --prompt "A brown and white dog is running on the grass" --output output_with_EP.png --num_inference_steps 50 --guidance_scale 5.0 --tensor_parallel_size 8 --seed 1234 --enable_expert_parallelwithout EP:
python examples/offline_inference/text_to_image/text_to_image.py --model /data/HunyuanImage-3.0/ --prompt "A brown and white dog is running on the grass" --output output_without_EP.png --num_inference_steps 50 --guidance_scale 5.0 --tensor_parallel_size 8 --seed 1234and you can tell if the EP works by cmd output log:
Test Result
Current EP feature may cause performance degradation for now, e2e test result:
Please refer to issue