permute fusion and padded mla attention by wenxie-amd · Pull Request #120 · AMD-AGI/Primus

wenxie-amd · 2025-07-13T09:32:35Z

Support moe_permute_fusion
- Updated to align with the latest Megatron implementation.
- Added support for the newly required permutation logic in moe_permute_fusion.
Support for fused_padded_mla_attention
- Enables fused attention even when the QK head dim is 192 and V head dim is 128, as seen in DeepSeek-style models.
- Pads the V head dim to match QK, allowing the use of flash-attn or TE fused attention with uniform head dimension (192).
Fix TE flash-attn version compatibility
- Updated _flash_attn_max_version from PkgVersion("2.7.3") to PkgVersion("3.0.0.post1"), ensuring compatibility with newer versions of flash-attn.
Support HSA_NO_SCRATCH_RECLAIM configuration
- HSA_NO_SCRATCH_RECLAIM can now be configured via environment variables and is properly passed into Slurm jobs for AMD ROCm tuning.

primus/README_patch.md

primus/backends/megatron/core/transformer/multi_latent_attention.py

primus/backends/transformer_engine/pytorch/permutation.py

primus/backends/megatron/core/transformer/multi_latent_attention.py

primus/README_patch.md

…mus into dev/wenx/permute_fusion

lhzhang333 · 2025-07-14T01:57:15Z

LGTM

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

wenxie-amd added 3 commits July 13, 2025 02:43

import primus_turbo only when fused router is enabled

604510b

support fused moe permutation

0cea72f

support padded mla fused attention

5e10f60

wenxie-amd requested review from Xiaoming-AMD and limou102 as code owners July 13, 2025 09:32

wenxie-amd changed the title ~~Dev/wenx/permute fusion~~ permute fusion and padded mla attention Jul 13, 2025