Tp param level#46290
Merged
Merged
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Decompose MoE tensor/expert parallelism per review feedback: weight sharding is declared per-parameter, while the experts module entry stays forward-comm only. - MoEParamShard: parameter-only style wrapping named expert weights as DTensor placeholders (no forward hook). grouped_gemm shards the expert dim and updates module.num_experts to the per-rank local count. - Register grouped_gemm (Shard(0)), moe_gate_up_colwise (_StridedShard(-2)), moe_gate_up_colwise_alt (_StridedShard(-1)), moe_down_rowwise (Shard(-1)). - EpRouterParallel (ep_router): forward-only slicing of router outputs to local experts, ported from the original RouterParallel (#39501). - moe_experts_allreduce is now forward-comm only: strip the baked shard_plan and drop the now-dead shard_plan ctor arg / _moe_shard_plan / shard_parameters override from MoEExpertsParallel; skip _AllReduceBackward on routing weights under EP. - verify_tp_plan: treat moe_experts_allreduce / ep_router as forward-only.
Run tensor parallelism in two passes: - Pass 1 (param-level): walk named_parameters() and, for styles in PARAM_ONLY_STYLES (grouped_gemm, moe_gate_up_colwise[_alt], moe_down_rowwise), shard the parameter directly via shard_parameters(). No forward hook. - Pass 2 (module-level): the existing named_modules() loop for forward hooks, now skipping PARAM_ONLY_STYLES. Param sharding runs first so module forward hooks (moe_experts_allreduce) see the already-sharded DTensor params. Also wire the EP-plan fallback so enable_expert_parallel uses model._ep_plan when no explicit plan is passed.
…tion - New tests/distributed/test_moe_tensor_parallel_plan.py: plan resolution, placement expectations for grouped_gemm / moe_gate_up_colwise[_alt] / moe_down_rowwise, gloo distributed integration (EP Shard(0), TP _StridedShard(-2)+Shard(-1), ep_router slicing), and a registry guard that moe_experts_allreduce carries no baked shard plan. - _verify_tp_sharding: add a two-sided check asserting that every parameter whose plan entry is a weight-sharding style actually comes back as a non-replicate DTensor. The prior check only validated params that happened to be sharded, so a style that gracefully degrades to replicated when unsharded (e.g. MoEExpertsParallel) could pass output-equality while silently running unparallelized.
For every TP/SP plan that sharded experts, declare per-parameter entries:
"layers.*.mlp.experts.gate_up_proj": "moe_gate_up_colwise"
"layers.*.mlp.experts.down_proj": "moe_down_rowwise"
while keeping the forward-only "layers.*.mlp.experts": "moe_experts_allreduce".
This matches the now-empty moe_experts_allreduce shard_plan; sharding is declared in
config at parameter granularity. EP plans already used "grouped_gemm" and are unchanged.
hy_v3 and laguna previously used "packed_colwise" / "rowwise_allreduce" on the 3D expert
*parameters*; those styles are module-level and were silently no-ops on params (the bundled
shard_plan did the work). They now use the param-level moe_gate_up_colwise / moe_down_rowwise
like every other MoE model.
Edited modular files where they own the plan literal; generated configs and inherited
plans (e.g. from qwen3_moe) propagated via modular conversion.
- expert_parallelism.md: describe the param-level decomposition (grouped_gemm, ep_router, moe_experts_allreduce) instead of the removed GroupedGemmParallel class, and note the TP equivalents (moe_gate_up_colwise / moe_down_rowwise). - weightconverter.md: note that fused expert weights are sharded at parameter granularity by the parallel plan.
Rename registry and plan entries so TP-on-expert sharding is distinct from EP (grouped_gemm) and dense packed_colwise: moe_gate_up_colwise -> moe_tp_gate_up_colwise, moe_down_rowwise -> moe_tp_down_rowwise. Drop unused moe_tp_gate_up_colwise_alt (GPT-OSS-style layouts stay EP-only).
* [distributed] Add resolve_parallel_plan merge helper Compose SP/TP dense recipes with an optional EP overlay and strip intra-expert moe_tp_* when expert parallelism is enabled. Add unit tests for training (SP+EP), inference (TP+EP), and TP-only paths. * [distributed] Wire resolve_parallel_plan into apply_tensor_parallel Replace exclusive SP|EP|TP plan selection with merged plans when tp_plan is unset. Add distributed test for TP+EP merged expert sharding. * [distributed] Use merged plan in tp_plan property and load path Expose resolve_parallel_plan via PreTrainedModel.tp_plan and set active_tp_plan during from_pretrained so checkpoint sharding matches the applied layout. * [distributed] Drop intra-expert moe_tp_* from MoE SP plans Expert weight TP under sequence parallelism comes from the EP overlay (grouped_gemm) when enable_expert_parallel is set; keep moe_tp_* only in base_model_tp_plan for TP-only MoE. * [distributed] Document SP+EP and TP+EP flag combinations Update expert_parallelism guide and DistributedConfig docs for merged plans. Export resolve_parallel_plan and extend resolve-plan tests for trimmed SP sources. * refactor merging plans * add test sp_ep and tp_ep * extend verify_tp_plan to verify_tp_sp_ep_plan * add ep_plan to mixtral and olmoe * cleaning _accumulate_local_param_grad (#46394) * remove _accumulate_local_param_grad * comments * linting * fix * clean _accumulate_local_param_grad * linting * cleaning * cleaning * fix mellun test because of bug in parsing sp_ep plan with regex * aea * Add select_parallel_plan and explicit combo plan config fields Introduce base_model_tp_ep_plan / base_model_sp_ep_plan on PreTrainedConfig, select_parallel_plan() with legacy resolve_parallel_plan fallback, and wire apply_tensor_parallel to use the selector. Model post_init tracks _tp_ep_plan and _sp_ep_plan for composite models. * Add base_model_tp_ep_plan and base_model_sp_ep_plan for Mixtral and Qwen3-MoE Define complete inference TP+EP and training SP+EP plans on the pilot MoE configs. Qwen3-MoE expands per-layer entries in _update_parallel_plans. Add plan_utils and golden tests against legacy resolve_parallel_plan merge. * Add explicit tp_ep / sp_ep plans for remaining MoE models Populate combo plans via init_combo_plans() at config init time for MoE configs that still use split tp/sp/ep recipes. Dynamic configs call it after _update_sp_plan(); modular sources updated for generated configuration files. * Remove resolve_parallel_plan and use explicit combo plan selection Delete runtime plan merging; select_parallel_plan now requires a complete combo dict and raises when missing. apply_tensor_parallel uses DistributedConfig flags directly for SP/EP behavior. Drop model._ep_plan aggregation; load-time verification checks the active plan only. Refresh combo plans after MXFP4 quantizer patches. * Sync modular MoE configs and update expert parallelism docs Propagate init_combo_plans from modular sources to generated configuration files and document select_parallel_plan combo lookup in expert_parallelism.md. * Refactor select_parallel_plan flag lookup for readability Use explicit if/elif branches for the SP/EP flag matrix and derive config_attr from plan_attr instead of parallel lookup dicts. * Write explicit combo parallel plans in MoE configs and remove plan_utils Define base_model_tp_ep_plan and base_model_sp_ep_plan directly in each MoE configuration (or via config-time _update_parallel_plans for dynamic models). Delete plan_utils.py and all init_combo_plans / refresh_combo_plans usage. * Add lm_head entries to _tp_ep_plan and _sp_ep_plan on CausalLM classes Explicit combo plan selection no longer merges _sp_plan with _ep_plan, so head-level lm_head rules must live on _tp_ep_plan/_sp_ep_plan directly. Fixes SP+EP training loss shape mismatch under sequence parallelism. * cleaning * cleaning * cleaning * cleaning * linting * add verify tp and fsdp pla aeaea * revert doc * cleaning * check-repository-consistency
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: afmoe, apertus, arcee, aria, bamba, bitnet, cohere, cohere2, cohere2_moe, csm, cwm, dbrx, deepseek_v2, deepseek_v3, deepseek_v32, deepseek_v4 |
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46290&sha=f00563 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Done:
- handling sparse + dense in the sp_plan for models like qwen3_moe
- better tests to check ep/tp/sp path during forward/backward
- Double check MoEExpertsParallel
go from module level to param level ("layers..mlp.experts.gate_up_proj": "grouped_gemm", "layers..mlp.experts.down_proj": "grouped_gemm", "layers.*.mlp.experts": "moe_experts_allreduce",)
- Double check ep_router => is it the same as main and Amine
- fixing ep_backward test
- sp + ep training / tp + ep inference