Skip to content

turbo update#638

Closed
Xiaoming-AMD wants to merge 34 commits intodev/tas/moe_package_v2.0from
main
Closed

turbo update#638
Xiaoming-AMD wants to merge 34 commits intodev/tas/moe_package_v2.0from
main

Conversation

@Xiaoming-AMD
Copy link
Copy Markdown
Collaborator

No description provided.

alexsu52 and others added 25 commits March 13, 2026 08:49
### Changes:
- Detect when ip route get returns a local route (i.e., when the node is
the master).
- Added logic to look up the actual physical interface corresponding to
the local IP using ip addr show, rather than accepting the route's
output which typically defaults to lo (loopback).
### Reason for changes:
Warning suppression for master host: 
```[Primus:Preflight] WARN: Socket IFNAME does not match route-to-master interface (may hang init_process_group)```
### Changes:
- Refine network_mode detection to calculate the actual number of nodes
(nnodes) instead of relying solely on world_size.
- Support node count detection from Slurm (SLURM_NNODES), OpenMPI
(OMPI_COMM_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE), and PyTorch
(WORLD_SIZE / LOCAL_WORLD_SIZE) environment variables.
- Set network_mode="multi-node" only if nnodes > 1.

### Reason for changes:
The previous logic incorrectly classified single-node distributed
training (e.g., multi-GPU on one machine) as "multi-node" simply because
the world size was greater than 1. This change ensures that network_mode
accurately reflects whether training spans multiple physical nodes.
Co-authored-by: vidushi8 <vidgoyal@amd.com>
Co-authored-by: Kailash Gogineni <gkailashnath1998@gmail.com>
Co-authored-by: Mingyu Yang <Mingyu.Yang@amd.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: HuangWei-95 <Wei.Huang4@amd.com>
Co-authored-by: HuangWei-95 <weihuan@amd.com>
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Co-authored-by: WangLingxun <linxwang@amd.com>
Co-authored-by: Anshu Raina <Anshu.Raina@amd.com>
Co-authored-by: wenxie-amd <Wen.Xie@amd.com>
(1) refactor maxtext to register-patch workflow
(2) 'core' workflow support docker image v25.9 and v26.1 and 'legacy'
workflow support v26.1
(3) refactor some patches to wrapper method

---------

Co-authored-by: Xiaoming-AMD <xiaoming.peng@amd.com>
chore(megatron): bump version to core_v0.16.0

---------

Co-authored-by: HuangWei-95 <weihuan@amd.com>
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Co-authored-by: WangLingxun <linxwang@amd.com>
…592)

fix(cli): pass --debug to python and include file:line in error msg

Co-authored-by: HuangWei-95 <weihuan@amd.com>
Patch Megatron validate_args in the backend base trainer to support
Primus-specific argument flows for decoder_pipeline_manual_split_list
and fp4.
#617)

Fix api compatible issue after mcore upgrade to v0.16.0.
…ain and runner hook (#591)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(1) Switch maxtext submodule from google to rocm
(2) remove patches which have existed in the rocm/maxtext release branch
(3) refactor maxtext config structure from replicate to extract
![Uploading image.png…]()
…-found (#621)

amd-aiter has two C++ kernel-loading paths with conflicting expectations
for the AITER_ASM_DIR env var:

  - aiter_hip_common.h appends /{gfx}/ to AITER_ASM_DIR
  - codegen.py (fmha_fwd_v3_kernel) does not

When core.py includes a gfx subdirectory in AITER_ASM_DIR, the first
path gets a double gfx prefix; when it doesn't, the second path can't
find the .co kernel files.

The new hook runs at container startup and:
1. Normalizes core.py so AITER_ASM_DIR always ends at .../hsa/
2. Creates symlinks from hsa/ into hsa/{gfx}/ so both paths resolve

Images without aiter_meta/hsa (e.g. rocm/primus:v26.1 workspace
installs) are safely skipped.
* Update the latest primus-turbo for better fp8 grouped gemm
performance. For better performance, you need to set the env
`PRIMUS_TURBO_AUTO_TUNE=1`.
* Modify DockerFile: uninstall aiter and reinstall aiter(The version of
turbo used).
* When installing aiter, precompile the attn_v3 kernel in advance to
avoid JIT.
* Removed some deprecated Turbo APIs in Primus.
* Enhance Turbo Test in Megatron

---------

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…xit code (#620)

1. Shell fix (primus-cli-direct.sh):
   - Add `set +e` before `eval "$CMD"` so the script does not exit
     immediately on non-zero torchrun return code, allowing proper
     error logging before exit.

2. UT validation (tests/utils.py → run_training_script):
   - Extract common training-script execution logic into a shared
     'run_training_script()' helper, replacing duplicated code across
     test_megatron_trainer, test_torchtitan_trainer, and
     test_maxtext_trainer.
   - In the success path (exit code 0), assert that the PrimusRuntime
     'Training completed.' marker is present in the log file. This
     catches silent training failures where torchrun returns 0 but
     training did not actually finish (e.g. AITER HIP errors).
Remove the legacy light-megatron trainer implementation and clean up all
related framework aliases. This change deletes the lightmegatron trainer
modules and removes light-megatron routing from parser and hook
dispatchers (train pretrain and projection performance), so framework
resolution now follows megatron directly without a compatibility alias.

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Related doc:
https://amd.atlassian.net/wiki/spaces/~712020ea4fade82ae94a95b7c0ba1cb554d2a8/pages/1382714769/GPT-OSS+test+with+triton+sink+attention

This PR will add support to gpu oss 20b and 120b models

---------

Signed-off-by: Gene Der Su <e870252314@gmail.com>
Co-authored-by: JohnQinAMD <yanyuan.qin@amd.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: wenxie-amd <Wen.Xie@amd.com>
…port (#622)

perf-projection: add pipeline scheduler comparison and fix plotext
import
    
- simulator.py: wrap plotext import in try/except (optional dependency)
- projection.py: support --pipeline-schedule-algorithm flag with options
      auto, zerobubble, zbv-formatted, zbv-greedy, megatron-ilp, all
for comparing pipeline schedulers including Megatron ILP (SeaAI lab)
      zero-bubble scheduler
    - projection.py (CLI): add --pipeline-schedule-algorithm argument
    - When zero-bubble enabled and VPP=1, uses Megatron ILP scheduler
    - When VPP>1, falls back to Primus interleaved 1F1B scheduler
    - When VPP==2 and 'all' mode, also runs ZBV Formatted and ZBV Greedy
      (min + half memory configs) schedulers for comparison
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
## Summary

- Adds a new `zero-bubble-heuristic` pipeline parallelism scheduling
algorithm that uses a graph-based heuristic to explore 8 candidate
schedules (combinations of `allow_bubble_before_first_b`,
`prioritize_b`, `no_bubble_greedy`) and selects the one with the lowest
bubble time.
- Exposes configurable parameters (`pp_max_mem`, `pp_cost_f`,
`pp_cost_b`, `pp_cost_w`) to control the memory budget and F/B/W cost
model, enabling the scheduler to produce memory-aware schedules with
realistic cost ratios.
- Enhances the PP visualization tool (`vis.py`) with per-rank F/B/W time
breakdown, correct cross-rank iteration time calculation, and detailed
console output for easier performance analysis.

## Changes

### Core Algorithm
- **`zerobubble_heuristic.py`** (new): Self-contained implementation of
the zero-bubble-heuristic scheduler, ported from the internal Megatron
ZB module into the Primus scheduler framework. Implements `_Graph`
(DAG-based scheduling), `_initial_solution` (best-of-8 heuristic
search), and `ScheduleZeroBubbleHeuristic` (the `PipelineScheduleAlgo`
subclass that generates the schedule table with proper send/recv
communication pairs).

### Integration
- **`pipeline_launcher.py`**: Registers `zero-bubble-heuristic` as a
valid algorithm, passes `max_mem`/`cost_f`/`cost_b`/`cost_w` kwargs to
the schedule factory, and adds `dump_pp_data` support via
`schedule_wrapper`.
- **`primus_turbo.py`**: Enables split W-grad operations for the new
algorithm.
- **`schedule_table_factory.py`**: Registers
`ScheduleZeroBubbleHeuristic` in the algorithm map; replaces
`@lru_cache` with a manual dict cache to support unhashable kwargs
(lists).
- **`primus_pipeline.yaml`**: Adds config entries for `pp_max_mem`,
`pp_cost_f`, `pp_cost_b`, `pp_cost_w`.
- **`megatron_pretrain_trainer.py`**: Adds post-training PP data dump
for visualization/analysis.

### Visualization & Analysis
- **`vis.py`**: Extracts `get_fbw_times()` helper; fixes `iter_time` to
use max across all ranks (not just rank-0); adds per-rank F/B/W time and
percentage breakdown in console output.
- **`pp_simulation.yaml`**: Adds two example simulation configs
(`zb-heuristic-mem8`, `zb-heuristic-mem10`).

## Algorithm Visualization

<img width="3600" height="5400" alt="image"
src="https://github.com/user-attachments/assets/0d73d80b-d8c1-45d6-918f-ee05499018ea"
/>

Co-authored-by: root <root@smc300x-ccs-aus-a16-19.prov.aus.ccs.cpe.ice.amd.com>
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Adding slurm script for DCGPU cluster rccl benchmarking. Some edits are
from Joyce for their cluster setup.

Usage example on DCGPU cluster:

`DOCKER_IMAGE=rocm/primus:v26.1 NNODES=2 sbatch -N2 -w
smci355-ccs-aus-n04-[25,29] -p Compute-DCPT ./run_slurm.sh`

---------

Co-authored-by: Joyce Zhang <joyzhang@smci355-ccs-aus-n03-25.prov.aus.ccs.cpe.ice.amd.com>
Co-authored-by: Joyce Zhang <joyzhang@amd.com>
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Tech blog on primus perf model plus add related configs.
…#611)

## Summary

Add model definitions and pretrain example configs for multiple model
scales that were previously missing from the Megatron and TorchTitan
backends.

### Megatron

- **Model configs**: Llama2 13B, Qwen2.5 3B/14B/32B, Qwen3 4B/14B/32B
- **Pretrain example configs** (MI300X & MI355X, BF16 + FP8):
  - Llama2 13B
  - Qwen2.5 3B, 14B, 32B
  - Qwen3 4B, 8B, 14B, 32B

### TorchTitan

- **Model configs**: Llama4 Scout 17Bx16E, Llama4 Maverick 17Bx128E,
DeepSeek V3 236B, Qwen3 4B/8B/14B (BF16 & FP8 variants)
- **Pretrain example configs** (MI300X & MI355X, BF16 + FP8):
  - Llama4 Scout 17Bx16E
  - Llama4 Maverick 17Bx128E
  - DeepSeek V3 236B
  - Qwen3 4B, 8B, 14B

---------

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Add the skills file for primus projection.
…ode dual-rank filter (#605)

Only NODE_RANK=0 now streams raw hook stdout/stderr in execute_hooks
while non-primary nodes still capture output for env/extra parsing and
preserve hook exit semantics. Also update direct launcher
local-ranks-filter logic to skip adding last local rank on single-node
runs, preventing duplicated rank0/rank7-style outputs in logs.
xiaobochen-amd and others added 3 commits March 30, 2026 14:48
* Update Primus-Turbo.
* Add disable_turbo_grouped_mlp_low_precision
* Format primus_turbo.yaml.
Docs: expand projection disclaimers for directional estimates

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
…604)

### Changes:
- Add optional argument `expect_distributed = True` to
`run_preflight_info`
- Configure `preflight` with `expect_distributed=False` during initial
local-only checks.

### Reason for changes:
Warning suppression:
```[Primus:Preflight] WARN: Runtime process group not initialized```

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
RuibinCheung and others added 6 commits March 31, 2026 14:19
* Keep input layout to SBHD layout to reduce extra q,k,v transpose in
attention.

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Co-authored-by: vidushi8 <vidgoyal@amd.com>
Co-authored-by: Kailash Gogineni <gkailashnath1998@gmail.com>
Co-authored-by: clairesonglee <Claire.Lee2@amd.com>
Co-authored-by: Mingyu Yang <Mingyu.Yang@amd.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: HuangWei-95 <Wei.Huang4@amd.com>
Co-authored-by: HuangWei-95 <weihuan@amd.com>
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Co-authored-by: WangLingxun <linxwang@amd.com>
Co-authored-by: Anshu Raina <Anshu.Raina@amd.com>
Co-authored-by: wenxie-amd <Wen.Xie@amd.com>
#634)

# Fix an error for launching multi-node training jobs and add other
improvements in the launching script

## Problem:
When launching multip-node training jobs using some docker image, the
following error occurs. `ImportError:
/opt/rocm-7.2.0/lib/libamdhip64.so.7: undefined symbol:
hsa_amd_memory_get_preferred_copy_engine, version ROCR_1` The problem is
that `/opt/rocm/lib` is not added in `LD_LIBRARY_PATH` in the docker
images.

## Fix:
In `examples/run_pretrain.sh` and `runner/helpers/envs/base_env.sh`, set
`LD_LIBRARY_PATH` default value to `/opt/rocm/lib` The order of the
library paths in `LD_LIBRARY_PATH` is also important. The
`/opt/rocm/lib` is put before all other paths.

## Other changes:
- allow users to set `NCCL_CROSS_NIC` value. It was hardcoded.
- in `examples/run_local_pretrain.sh`, fixed `TC_RESULTS` env variable
- in `examples/run_local_pretrain.sh`, added a name for the launching
container.
- For the ANP plugin, removed hard failure. Allow the training to run
without using ANP plugin
The aiter reinstall flow introduced with the Primus-Turbo docker update
still relies on `python setup.py develop`, which now fails in Docker
with `ModuleNotFoundError: vcs_versioning`. Switch to `pip install
--use-pep517 -e .` so aiter resolves build dependencies through its
`pyproject.toml`.
Update all references to the Primus base image across documentation,
configuration files, CI/CD workflows, benchmark helpers, and example
scripts to use the latest v26.2 release.

Keep existing JAX/MaxText image references unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.