Closed
Conversation
### Changes: - Detect when ip route get returns a local route (i.e., when the node is the master). - Added logic to look up the actual physical interface corresponding to the local IP using ip addr show, rather than accepting the route's output which typically defaults to lo (loopback). ### Reason for changes: Warning suppression for master host: ```[Primus:Preflight] WARN: Socket IFNAME does not match route-to-master interface (may hang init_process_group)```
### Changes: - Refine network_mode detection to calculate the actual number of nodes (nnodes) instead of relying solely on world_size. - Support node count detection from Slurm (SLURM_NNODES), OpenMPI (OMPI_COMM_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE), and PyTorch (WORLD_SIZE / LOCAL_WORLD_SIZE) environment variables. - Set network_mode="multi-node" only if nnodes > 1. ### Reason for changes: The previous logic incorrectly classified single-node distributed training (e.g., multi-GPU on one machine) as "multi-node" simply because the world size was greater than 1. This change ensures that network_mode accurately reflects whether training spans multiple physical nodes.
Co-authored-by: vidushi8 <vidgoyal@amd.com> Co-authored-by: Kailash Gogineni <gkailashnath1998@gmail.com> Co-authored-by: Mingyu Yang <Mingyu.Yang@amd.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Co-authored-by: HuangWei-95 <Wei.Huang4@amd.com> Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: WangLingxun <linxwang@amd.com> Co-authored-by: Anshu Raina <Anshu.Raina@amd.com> Co-authored-by: wenxie-amd <Wen.Xie@amd.com>
(1) refactor maxtext to register-patch workflow (2) 'core' workflow support docker image v25.9 and v26.1 and 'legacy' workflow support v26.1 (3) refactor some patches to wrapper method --------- Co-authored-by: Xiaoming-AMD <xiaoming.peng@amd.com>
chore(megatron): bump version to core_v0.16.0 --------- Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: WangLingxun <linxwang@amd.com>
…592) fix(cli): pass --debug to python and include file:line in error msg Co-authored-by: HuangWei-95 <weihuan@amd.com>
Patch Megatron validate_args in the backend base trainer to support Primus-specific argument flows for decoder_pipeline_manual_split_list and fp4.
#617) Fix api compatible issue after mcore upgrade to v0.16.0.
…ain and runner hook (#591) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(1) Switch maxtext submodule from google to rocm (2) remove patches which have existed in the rocm/maxtext release branch (3) refactor maxtext config structure from replicate to extract ![Uploading image.png…]()
…-found (#621) amd-aiter has two C++ kernel-loading paths with conflicting expectations for the AITER_ASM_DIR env var: - aiter_hip_common.h appends /{gfx}/ to AITER_ASM_DIR - codegen.py (fmha_fwd_v3_kernel) does not When core.py includes a gfx subdirectory in AITER_ASM_DIR, the first path gets a double gfx prefix; when it doesn't, the second path can't find the .co kernel files. The new hook runs at container startup and: 1. Normalizes core.py so AITER_ASM_DIR always ends at .../hsa/ 2. Creates symlinks from hsa/ into hsa/{gfx}/ so both paths resolve Images without aiter_meta/hsa (e.g. rocm/primus:v26.1 workspace installs) are safely skipped.
* Update the latest primus-turbo for better fp8 grouped gemm performance. For better performance, you need to set the env `PRIMUS_TURBO_AUTO_TUNE=1`. * Modify DockerFile: uninstall aiter and reinstall aiter(The version of turbo used). * When installing aiter, precompile the attn_v3 kernel in advance to avoid JIT. * Removed some deprecated Turbo APIs in Primus. * Enhance Turbo Test in Megatron --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…xit code (#620) 1. Shell fix (primus-cli-direct.sh): - Add `set +e` before `eval "$CMD"` so the script does not exit immediately on non-zero torchrun return code, allowing proper error logging before exit. 2. UT validation (tests/utils.py → run_training_script): - Extract common training-script execution logic into a shared 'run_training_script()' helper, replacing duplicated code across test_megatron_trainer, test_torchtitan_trainer, and test_maxtext_trainer. - In the success path (exit code 0), assert that the PrimusRuntime 'Training completed.' marker is present in the log file. This catches silent training failures where torchrun returns 0 but training did not actually finish (e.g. AITER HIP errors).
Remove the legacy light-megatron trainer implementation and clean up all related framework aliases. This change deletes the lightmegatron trainer modules and removes light-megatron routing from parser and hook dispatchers (train pretrain and projection performance), so framework resolution now follows megatron directly without a compatibility alias. Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Related doc: https://amd.atlassian.net/wiki/spaces/~712020ea4fade82ae94a95b7c0ba1cb554d2a8/pages/1382714769/GPT-OSS+test+with+triton+sink+attention This PR will add support to gpu oss 20b and 120b models --------- Signed-off-by: Gene Der Su <e870252314@gmail.com> Co-authored-by: JohnQinAMD <yanyuan.qin@amd.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: wenxie-amd <Wen.Xie@amd.com>
…port (#622) perf-projection: add pipeline scheduler comparison and fix plotext import - simulator.py: wrap plotext import in try/except (optional dependency) - projection.py: support --pipeline-schedule-algorithm flag with options auto, zerobubble, zbv-formatted, zbv-greedy, megatron-ilp, all for comparing pipeline schedulers including Megatron ILP (SeaAI lab) zero-bubble scheduler - projection.py (CLI): add --pipeline-schedule-algorithm argument - When zero-bubble enabled and VPP=1, uses Megatron ILP scheduler - When VPP>1, falls back to Primus interleaved 1F1B scheduler - When VPP==2 and 'all' mode, also runs ZBV Formatted and ZBV Greedy (min + half memory configs) schedulers for comparison
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
## Summary - Adds a new `zero-bubble-heuristic` pipeline parallelism scheduling algorithm that uses a graph-based heuristic to explore 8 candidate schedules (combinations of `allow_bubble_before_first_b`, `prioritize_b`, `no_bubble_greedy`) and selects the one with the lowest bubble time. - Exposes configurable parameters (`pp_max_mem`, `pp_cost_f`, `pp_cost_b`, `pp_cost_w`) to control the memory budget and F/B/W cost model, enabling the scheduler to produce memory-aware schedules with realistic cost ratios. - Enhances the PP visualization tool (`vis.py`) with per-rank F/B/W time breakdown, correct cross-rank iteration time calculation, and detailed console output for easier performance analysis. ## Changes ### Core Algorithm - **`zerobubble_heuristic.py`** (new): Self-contained implementation of the zero-bubble-heuristic scheduler, ported from the internal Megatron ZB module into the Primus scheduler framework. Implements `_Graph` (DAG-based scheduling), `_initial_solution` (best-of-8 heuristic search), and `ScheduleZeroBubbleHeuristic` (the `PipelineScheduleAlgo` subclass that generates the schedule table with proper send/recv communication pairs). ### Integration - **`pipeline_launcher.py`**: Registers `zero-bubble-heuristic` as a valid algorithm, passes `max_mem`/`cost_f`/`cost_b`/`cost_w` kwargs to the schedule factory, and adds `dump_pp_data` support via `schedule_wrapper`. - **`primus_turbo.py`**: Enables split W-grad operations for the new algorithm. - **`schedule_table_factory.py`**: Registers `ScheduleZeroBubbleHeuristic` in the algorithm map; replaces `@lru_cache` with a manual dict cache to support unhashable kwargs (lists). - **`primus_pipeline.yaml`**: Adds config entries for `pp_max_mem`, `pp_cost_f`, `pp_cost_b`, `pp_cost_w`. - **`megatron_pretrain_trainer.py`**: Adds post-training PP data dump for visualization/analysis. ### Visualization & Analysis - **`vis.py`**: Extracts `get_fbw_times()` helper; fixes `iter_time` to use max across all ranks (not just rank-0); adds per-rank F/B/W time and percentage breakdown in console output. - **`pp_simulation.yaml`**: Adds two example simulation configs (`zb-heuristic-mem8`, `zb-heuristic-mem10`). ## Algorithm Visualization <img width="3600" height="5400" alt="image" src="https://github.com/user-attachments/assets/0d73d80b-d8c1-45d6-918f-ee05499018ea" /> Co-authored-by: root <root@smc300x-ccs-aus-a16-19.prov.aus.ccs.cpe.ice.amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Adding slurm script for DCGPU cluster rccl benchmarking. Some edits are from Joyce for their cluster setup. Usage example on DCGPU cluster: `DOCKER_IMAGE=rocm/primus:v26.1 NNODES=2 sbatch -N2 -w smci355-ccs-aus-n04-[25,29] -p Compute-DCPT ./run_slurm.sh` --------- Co-authored-by: Joyce Zhang <joyzhang@smci355-ccs-aus-n03-25.prov.aus.ccs.cpe.ice.amd.com> Co-authored-by: Joyce Zhang <joyzhang@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Tech blog on primus perf model plus add related configs.
…#611) ## Summary Add model definitions and pretrain example configs for multiple model scales that were previously missing from the Megatron and TorchTitan backends. ### Megatron - **Model configs**: Llama2 13B, Qwen2.5 3B/14B/32B, Qwen3 4B/14B/32B - **Pretrain example configs** (MI300X & MI355X, BF16 + FP8): - Llama2 13B - Qwen2.5 3B, 14B, 32B - Qwen3 4B, 8B, 14B, 32B ### TorchTitan - **Model configs**: Llama4 Scout 17Bx16E, Llama4 Maverick 17Bx128E, DeepSeek V3 236B, Qwen3 4B/8B/14B (BF16 & FP8 variants) - **Pretrain example configs** (MI300X & MI355X, BF16 + FP8): - Llama4 Scout 17Bx16E - Llama4 Maverick 17Bx128E - DeepSeek V3 236B - Qwen3 4B, 8B, 14B --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Add the skills file for primus projection.
…ode dual-rank filter (#605) Only NODE_RANK=0 now streams raw hook stdout/stderr in execute_hooks while non-primary nodes still capture output for env/extra parsing and preserve hook exit semantics. Also update direct launcher local-ranks-filter logic to skip adding last local rank on single-node runs, preventing duplicated rank0/rank7-style outputs in logs.
* Update Primus-Turbo. * Add disable_turbo_grouped_mlp_low_precision * Format primus_turbo.yaml.
Docs: expand projection disclaimers for directional estimates Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
…604) ### Changes: - Add optional argument `expect_distributed = True` to `run_preflight_info` - Configure `preflight` with `expect_distributed=False` during initial local-only checks. ### Reason for changes: Warning suppression: ```[Primus:Preflight] WARN: Runtime process group not initialized``` --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Keep input layout to SBHD layout to reduce extra q,k,v transpose in attention. Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Co-authored-by: vidushi8 <vidgoyal@amd.com> Co-authored-by: Kailash Gogineni <gkailashnath1998@gmail.com> Co-authored-by: clairesonglee <Claire.Lee2@amd.com> Co-authored-by: Mingyu Yang <Mingyu.Yang@amd.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Co-authored-by: HuangWei-95 <Wei.Huang4@amd.com> Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com> Co-authored-by: WangLingxun <linxwang@amd.com> Co-authored-by: Anshu Raina <Anshu.Raina@amd.com> Co-authored-by: wenxie-amd <Wen.Xie@amd.com>
#634) # Fix an error for launching multi-node training jobs and add other improvements in the launching script ## Problem: When launching multip-node training jobs using some docker image, the following error occurs. `ImportError: /opt/rocm-7.2.0/lib/libamdhip64.so.7: undefined symbol: hsa_amd_memory_get_preferred_copy_engine, version ROCR_1` The problem is that `/opt/rocm/lib` is not added in `LD_LIBRARY_PATH` in the docker images. ## Fix: In `examples/run_pretrain.sh` and `runner/helpers/envs/base_env.sh`, set `LD_LIBRARY_PATH` default value to `/opt/rocm/lib` The order of the library paths in `LD_LIBRARY_PATH` is also important. The `/opt/rocm/lib` is put before all other paths. ## Other changes: - allow users to set `NCCL_CROSS_NIC` value. It was hardcoded. - in `examples/run_local_pretrain.sh`, fixed `TC_RESULTS` env variable - in `examples/run_local_pretrain.sh`, added a name for the launching container. - For the ANP plugin, removed hard failure. Allow the training to run without using ANP plugin
The aiter reinstall flow introduced with the Primus-Turbo docker update still relies on `python setup.py develop`, which now fails in Docker with `ModuleNotFoundError: vcs_versioning`. Switch to `pip install --use-pep517 -e .` so aiter resolves build dependencies through its `pyproject.toml`.
Update all references to the Primus base image across documentation, configuration files, CI/CD workflows, benchmark helpers, and example scripts to use the latest v26.2 release. Keep existing JAX/MaxText image references unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.