Skip to content

permute fusion and padded mla attention#120

Merged
wenxie-amd merged 6 commits intomainfrom
dev/wenx/permute_fusion
Jul 14, 2025
Merged

permute fusion and padded mla attention#120
wenxie-amd merged 6 commits intomainfrom
dev/wenx/permute_fusion

Conversation

@wenxie-amd
Copy link
Copy Markdown
Contributor

@wenxie-amd wenxie-amd commented Jul 13, 2025

  1. Support moe_permute_fusion

    • Updated to align with the latest Megatron implementation.
    • Added support for the newly required permutation logic in moe_permute_fusion.
  2. Support for fused_padded_mla_attention

    • Enables fused attention even when the QK head dim is 192 and V head dim is 128, as seen in DeepSeek-style models.
    • Pads the V head dim to match QK, allowing the use of flash-attn or TE fused attention with uniform head dimension (192).
  3. Fix TE flash-attn version compatibility

    • Updated _flash_attn_max_version from PkgVersion("2.7.3") to PkgVersion("3.0.0.post1"), ensuring compatibility with newer versions of flash-attn.
  4. Support HSA_NO_SCRATCH_RECLAIM configuration

    • HSA_NO_SCRATCH_RECLAIM can now be configured via environment variables and is properly passed into Slurm jobs for AMD ROCm tuning.

@wenxie-amd wenxie-amd changed the title Dev/wenx/permute fusion permute fusion and padded mla attention Jul 13, 2025
@lhzhang333
Copy link
Copy Markdown
Collaborator

LGTM

@wenxie-amd wenxie-amd merged commit 1b2978b into main Jul 14, 2025
4 checks passed
zhenhuang12 pushed a commit that referenced this pull request Jul 16, 2025
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
@wenxie-amd wenxie-amd deleted the dev/wenx/permute_fusion branch December 2, 2025 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants