turbo update by Xiaoming-AMD · Pull Request #638 · AMD-AGI/Primus

Xiaoming-AMD · 2026-03-30T06:46:43Z

No description provided.

### Changes: - Detect when ip route get returns a local route (i.e., when the node is the master). - Added logic to look up the actual physical interface corresponding to the local IP using ip addr show, rather than accepting the route's output which typically defaults to lo (loopback). ### Reason for changes: Warning suppression for master host: ```[Primus:Preflight] WARN: Socket IFNAME does not match route-to-master interface (may hang init_process_group)```

### Changes: - Refine network_mode detection to calculate the actual number of nodes (nnodes) instead of relying solely on world_size. - Support node count detection from Slurm (SLURM_NNODES), OpenMPI (OMPI_COMM_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE), and PyTorch (WORLD_SIZE / LOCAL_WORLD_SIZE) environment variables. - Set network_mode="multi-node" only if nnodes > 1. ### Reason for changes: The previous logic incorrectly classified single-node distributed training (e.g., multi-GPU on one machine) as "multi-node" simply because the world size was greater than 1. This change ensures that network_mode accurately reflects whether training spans multiple physical nodes.

Co-authored-by: vidushi8 <vidgoyal@amd.com> Co-authored-by: Kailash Gogineni <gkailashnath1998@gmail.com> Co-authored-by: Mingyu Yang <Mingyu.Yang@amd.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Co-authored-by: HuangWei-95 <Wei.Huang4@amd.com> Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: WangLingxun <linxwang@amd.com> Co-authored-by: Anshu Raina <Anshu.Raina@amd.com> Co-authored-by: wenxie-amd <Wen.Xie@amd.com>

(1) refactor maxtext to register-patch workflow (2) 'core' workflow support docker image v25.9 and v26.1 and 'legacy' workflow support v26.1 (3) refactor some patches to wrapper method --------- Co-authored-by: Xiaoming-AMD <xiaoming.peng@amd.com>

chore(megatron): bump version to core_v0.16.0 --------- Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: WangLingxun <linxwang@amd.com>

…592) fix(cli): pass --debug to python and include file:line in error msg Co-authored-by: HuangWei-95 <weihuan@amd.com>

Patch Megatron validate_args in the backend base trainer to support Primus-specific argument flows for decoder_pipeline_manual_split_list and fp4.

#617) Fix api compatible issue after mcore upgrade to v0.16.0.

…ain and runner hook (#591) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

(1) Switch maxtext submodule from google to rocm (2) remove patches which have existed in the rocm/maxtext release branch (3) refactor maxtext config structure from replicate to extract ![Uploading image.png…]()

…-found (#621) amd-aiter has two C++ kernel-loading paths with conflicting expectations for the AITER_ASM_DIR env var: - aiter_hip_common.h appends /{gfx}/ to AITER_ASM_DIR - codegen.py (fmha_fwd_v3_kernel) does not When core.py includes a gfx subdirectory in AITER_ASM_DIR, the first path gets a double gfx prefix; when it doesn't, the second path can't find the .co kernel files. The new hook runs at container startup and: 1. Normalizes core.py so AITER_ASM_DIR always ends at .../hsa/ 2. Creates symlinks from hsa/ into hsa/{gfx}/ so both paths resolve Images without aiter_meta/hsa (e.g. rocm/primus:v26.1 workspace installs) are safely skipped.

* Update the latest primus-turbo for better fp8 grouped gemm performance. For better performance, you need to set the env `PRIMUS_TURBO_AUTO_TUNE=1`. * Modify DockerFile: uninstall aiter and reinstall aiter(The version of turbo used). * When installing aiter, precompile the attn_v3 kernel in advance to avoid JIT. * Removed some deprecated Turbo APIs in Primus. * Enhance Turbo Test in Megatron --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…xit code (#620) 1. Shell fix (primus-cli-direct.sh): - Add `set +e` before `eval "$CMD"` so the script does not exit immediately on non-zero torchrun return code, allowing proper error logging before exit. 2. UT validation (tests/utils.py → run_training_script): - Extract common training-script execution logic into a shared 'run_training_script()' helper, replacing duplicated code across test_megatron_trainer, test_torchtitan_trainer, and test_maxtext_trainer. - In the success path (exit code 0), assert that the PrimusRuntime 'Training completed.' marker is present in the log file. This catches silent training failures where torchrun returns 0 but training did not actually finish (e.g. AITER HIP errors).

Remove the legacy light-megatron trainer implementation and clean up all related framework aliases. This change deletes the lightmegatron trainer modules and removes light-megatron routing from parser and hook dispatchers (train pretrain and projection performance), so framework resolution now follows megatron directly without a compatibility alias. Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Related doc: https://amd.atlassian.net/wiki/spaces/~712020ea4fade82ae94a95b7c0ba1cb554d2a8/pages/1382714769/GPT-OSS+test+with+triton+sink+attention This PR will add support to gpu oss 20b and 120b models --------- Signed-off-by: Gene Der Su <e870252314@gmail.com> Co-authored-by: JohnQinAMD <yanyuan.qin@amd.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: wenxie-amd <Wen.Xie@amd.com>

…o 1 (#623)

…port (#622) perf-projection: add pipeline scheduler comparison and fix plotext import - simulator.py: wrap plotext import in try/except (optional dependency) - projection.py: support --pipeline-schedule-algorithm flag with options auto, zerobubble, zbv-formatted, zbv-greedy, megatron-ilp, all for comparing pipeline schedulers including Megatron ILP (SeaAI lab) zero-bubble scheduler - projection.py (CLI): add --pipeline-schedule-algorithm argument - When zero-bubble enabled and VPP=1, uses Megatron ILP scheduler - When VPP>1, falls back to Primus interleaved 1F1B scheduler - When VPP==2 and 'all' mode, also runs ZBV Formatted and ZBV Greedy (min + half memory configs) schedulers for comparison

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

## Summary - Adds a new `zero-bubble-heuristic` pipeline parallelism scheduling algorithm that uses a graph-based heuristic to explore 8 candidate schedules (combinations of `allow_bubble_before_first_b`, `prioritize_b`, `no_bubble_greedy`) and selects the one with the lowest bubble time. - Exposes configurable parameters (`pp_max_mem`, `pp_cost_f`, `pp_cost_b`, `pp_cost_w`) to control the memory budget and F/B/W cost model, enabling the scheduler to produce memory-aware schedules with realistic cost ratios. - Enhances the PP visualization tool (`vis.py`) with per-rank F/B/W time breakdown, correct cross-rank iteration time calculation, and detailed console output for easier performance analysis. ## Changes ### Core Algorithm - **`zerobubble_heuristic.py`** (new): Self-contained implementation of the zero-bubble-heuristic scheduler, ported from the internal Megatron ZB module into the Primus scheduler framework. Implements `_Graph` (DAG-based scheduling), `_initial_solution` (best-of-8 heuristic search), and `ScheduleZeroBubbleHeuristic` (the `PipelineScheduleAlgo` subclass that generates the schedule table with proper send/recv communication pairs). ### Integration - **`pipeline_launcher.py`**: Registers `zero-bubble-heuristic` as a valid algorithm, passes `max_mem`/`cost_f`/`cost_b`/`cost_w` kwargs to the schedule factory, and adds `dump_pp_data` support via `schedule_wrapper`. - **`primus_turbo.py`**: Enables split W-grad operations for the new algorithm. - **`schedule_table_factory.py`**: Registers `ScheduleZeroBubbleHeuristic` in the algorithm map; replaces `@lru_cache` with a manual dict cache to support unhashable kwargs (lists). - **`primus_pipeline.yaml`**: Adds config entries for `pp_max_mem`, `pp_cost_f`, `pp_cost_b`, `pp_cost_w`. - **`megatron_pretrain_trainer.py`**: Adds post-training PP data dump for visualization/analysis. ### Visualization & Analysis - **`vis.py`**: Extracts `get_fbw_times()` helper; fixes `iter_time` to use max across all ranks (not just rank-0); adds per-rank F/B/W time and percentage breakdown in console output. - **`pp_simulation.yaml`**: Adds two example simulation configs (`zb-heuristic-mem8`, `zb-heuristic-mem10`). ## Algorithm Visualization <img width="3600" height="5400" alt="image" src="https://github.com/user-attachments/assets/0d73d80b-d8c1-45d6-918f-ee05499018ea" /> Co-authored-by: root <root@smc300x-ccs-aus-a16-19.prov.aus.ccs.cpe.ice.amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Adding slurm script for DCGPU cluster rccl benchmarking. Some edits are from Joyce for their cluster setup. Usage example on DCGPU cluster: `DOCKER_IMAGE=rocm/primus:v26.1 NNODES=2 sbatch -N2 -w smci355-ccs-aus-n04-[25,29] -p Compute-DCPT ./run_slurm.sh` --------- Co-authored-by: Joyce Zhang <joyzhang@smci355-ccs-aus-n03-25.prov.aus.ccs.cpe.ice.amd.com> Co-authored-by: Joyce Zhang <joyzhang@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Tech blog on primus perf model plus add related configs.

…#611) ## Summary Add model definitions and pretrain example configs for multiple model scales that were previously missing from the Megatron and TorchTitan backends. ### Megatron - **Model configs**: Llama2 13B, Qwen2.5 3B/14B/32B, Qwen3 4B/14B/32B - **Pretrain example configs** (MI300X & MI355X, BF16 + FP8): - Llama2 13B - Qwen2.5 3B, 14B, 32B - Qwen3 4B, 8B, 14B, 32B ### TorchTitan - **Model configs**: Llama4 Scout 17Bx16E, Llama4 Maverick 17Bx128E, DeepSeek V3 236B, Qwen3 4B/8B/14B (BF16 & FP8 variants) - **Pretrain example configs** (MI300X & MI355X, BF16 + FP8): - Llama4 Scout 17Bx16E - Llama4 Maverick 17Bx128E - DeepSeek V3 236B - Qwen3 4B, 8B, 14B --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Add the skills file for primus projection.

…ode dual-rank filter (#605) Only NODE_RANK=0 now streams raw hook stdout/stderr in execute_hooks while non-primary nodes still capture output for env/extra parsing and preserve hook exit semantics. Also update direct launcher local-ranks-filter logic to skip adding last local rank on single-node runs, preventing duplicated rank0/rank7-style outputs in logs.

* Update Primus-Turbo. * Add disable_turbo_grouped_mlp_low_precision * Format primus_turbo.yaml.

Docs: expand projection disclaimers for directional estimates Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

…604) ### Changes: - Add optional argument `expect_distributed = True` to `run_preflight_info` - Configure `preflight` with `expect_distributed=False` during initial local-only checks. ### Reason for changes: Warning suppression: ```[Primus:Preflight] WARN: Runtime process group not initialized``` --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Keep input layout to SBHD layout to reduce extra q,k,v transpose in attention. Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Co-authored-by: vidushi8 <vidgoyal@amd.com> Co-authored-by: Kailash Gogineni <gkailashnath1998@gmail.com> Co-authored-by: clairesonglee <Claire.Lee2@amd.com> Co-authored-by: Mingyu Yang <Mingyu.Yang@amd.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Co-authored-by: HuangWei-95 <Wei.Huang4@amd.com> Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: WangLingxun <linxwang@amd.com> Co-authored-by: Anshu Raina <Anshu.Raina@amd.com> Co-authored-by: wenxie-amd <Wen.Xie@amd.com>

…641) Reverts #625

#634) # Fix an error for launching multi-node training jobs and add other improvements in the launching script ## Problem: When launching multip-node training jobs using some docker image, the following error occurs. `ImportError: /opt/rocm-7.2.0/lib/libamdhip64.so.7: undefined symbol: hsa_amd_memory_get_preferred_copy_engine, version ROCR_1` The problem is that `/opt/rocm/lib` is not added in `LD_LIBRARY_PATH` in the docker images. ## Fix: In `examples/run_pretrain.sh` and `runner/helpers/envs/base_env.sh`, set `LD_LIBRARY_PATH` default value to `/opt/rocm/lib` The order of the library paths in `LD_LIBRARY_PATH` is also important. The `/opt/rocm/lib` is put before all other paths. ## Other changes: - allow users to set `NCCL_CROSS_NIC` value. It was hardcoded. - in `examples/run_local_pretrain.sh`, fixed `TC_RESULTS` env variable - in `examples/run_local_pretrain.sh`, added a name for the launching container. - For the ANP plugin, removed hard failure. Allow the training to run without using ANP plugin

The aiter reinstall flow introduced with the Primus-Turbo docker update still relies on `python setup.py develop`, which now fails in Docker with `ModuleNotFoundError: vcs_versioning`. Switch to `pip install --use-pep517 -e .` so aiter resolves build dependencies through its `pyproject.toml`.

Update all references to the Primus base image across documentation, configuration files, CI/CD workflows, benchmark helpers, and example scripts to use the latest v26.2 release. Keep existing JAX/MaxText image references unchanged.

alexsu52 and others added 25 commits March 13, 2026 08:49

chore(megatron): bump version to core_v0.16.0 (#549)

def6d62

chore(megatron): bump version to core_v0.16.0 --------- Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: WangLingxun <linxwang@amd.com>

fix(cli): pass --debug to python and include file:line in error msg (#…

94ae31a

…592) fix(cli): pass --debug to python and include file:line in error msg Co-authored-by: HuangWei-95 <weihuan@amd.com>

feat: Update Primus-Turbo by setting DeepEP CPU timeout via env. (#609)

f5d319d

fix(megatron): apply validate_args_modified path in base trainer (#588)

59ea090

Patch Megatron validate_args in the backend base trainer to support Primus-specific argument flows for decoder_pipeline_manual_split_list and fp4.

[Megatron-LM] fix: api compatible issue after mcore upgrade to v0.16.0 (

ace9a5e

#617) Fix api compatible issue after mcore upgrade to v0.16.0.

feat: add uccl-ep to docker built and add USING_UEP flag to run_pretr…

abc85c9

…ain and runner hook (#591) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

refactor(maxtext): switch submodule maxtext to rocm/maxtext repo (#615)

3d202ac

(1) Switch maxtext submodule from google to rocm (2) remove patches which have existed in the rocm/maxtext release branch (3) refactor maxtext config structure from replicate to extract ![Uploading image.png…]()

fix: use CUDA_DEVICE_MAX_CONNECTIONS from env if it is set, default t…

ecfbb68

…o 1 (#623)

fix build docker bug (#624)

6560252

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Tech blog on primus perf model (#602)

2eb9dc7

Tech blog on primus perf model plus add related configs.

Add the skills file for primus projection. (#632)

5259b88

Add the skills file for primus projection.

Xiaoming-AMD requested review from limou102 and wenxie-amd as code owners March 30, 2026 06:46

xiaobochen-amd and others added 3 commits March 30, 2026 14:48

chore: update turbo & add disable_turbo_grouped_mlp_low_precision (#633)

d2af532

* Update Primus-Turbo. * Add disable_turbo_grouped_mlp_low_precision * Format primus_turbo.yaml.

Docs: expand projection disclaimers for directional estimates (#631)

8e87188

Docs: expand projection disclaimers for directional estimates Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

RuibinCheung and others added 6 commits March 31, 2026 14:19

[WIP][Megatron-LM] feat: reduce extra qkv transpose in attn (#625)

7665157

* Keep input layout to SBHD layout to reduce extra q,k,v transpose in attention. Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Revert "[WIP][Megatron-LM] feat: reduce extra qkv transpose in attn" (#…

b61cddc

…641) Reverts #625

Xiaoming-AMD closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turbo update#638

turbo update#638
Xiaoming-AMD wants to merge 34 commits intodev/tas/moe_package_v2.0from
main

Xiaoming-AMD commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

Xiaoming-AMD commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants