Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
43eacb6
Update torchtitan batxh size and enable CE fusion
vidushi8 Feb 13, 2026
33daa5a
[Docs] & [Feature]: Add Post-Training Documentation and Update Qwen3_…
kailashg26 Feb 19, 2026
8db36dc
update MI355 yaml for better perf
vidushi8 Feb 26, 2026
143d593
update yaml
vidushi8 Feb 26, 2026
3dade71
[Primus] Fix MoE MLA issue from hybrid models branch merge (#564)
clairesonglee Feb 26, 2026
365758c
tune hybrid model mi300x configs
clairesonglee Feb 27, 2026
06d8e1e
tune hybrid model mi355x configs
clairesonglee Feb 27, 2026
fadaeb1
Expand projection.md with memory projection and performance details. …
araina-amd Mar 2, 2026
4262afd
Merge branch 'main' into release/v26.2
wenxie-amd Mar 4, 2026
d32360e
update yamls to fix regressions and standardize
vidushi8 Mar 6, 2026
bc4c861
fix(megatron): patch validate_args and add ROCM argument validation (…
WangLingxun Mar 6, 2026
c6754d9
Merge branch 'main' into release/v26.2
wenxie-amd Mar 13, 2026
0420f1a
fix code-lint issue
wenxie-amd Mar 13, 2026
3423cec
[Megatron-LM] Update Mamba model tokenizer (#603)
clairesonglee Mar 14, 2026
b4204f6
remove redundant params in mamba config
clairesonglee Mar 14, 2026
df4c65e
update config mi355 llama3 70b
vidushi8 Mar 16, 2026
697efab
fix turbo argument in mi355 dsv3
vidushi8 Mar 16, 2026
2c26dd7
sync with release/v26.2 & add patch to calculate zebra-llama flops
Mar 17, 2026
1be8773
Merge branch 'main' into dev/clairlee/update-hybrid-throughput
clairesonglee Mar 25, 2026
aafa2f4
update mbs=16
clairesonglee Mar 25, 2026
2f738e6
create MI355X config
clairesonglee Mar 25, 2026
d271241
code lint with pre-commit
clairesonglee Mar 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions docs/backends/adding-megatron-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ For Megatron, there are three layers of YAML config involved:

At runtime:

- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as
- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as
`primus/configs/models/megatron/llama3.1_8B.yaml`.
- `extends: - llama3_8B.yaml` further chains into
`primus/configs/models/megatron/llama3_8B.yaml`, which in turn extends
Expand Down Expand Up @@ -206,4 +206,3 @@ You should see in the printed configuration and logs that:

Once these steps are done, your new `tinyllama_1.1B` config behaves like any other
Megatron model in Primus and can be used in experiments, sweeps, and CI.

3 changes: 1 addition & 2 deletions docs/backends/adding-torchtitan-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ For TorchTitan, there are three layers of YAML involved:
At runtime:
- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as
- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as
`primus/configs/models/torchtitan/llama3.1_8B.yaml`.
- The TorchTitan launcher reads the `job` and `model` sections and wires up
the actual PyTorch model + training loop.
Expand Down Expand Up @@ -277,4 +277,3 @@ hooked up correctly.
Once these steps are done, your new TorchTitan model config behaves like any
other TorchTitan model in Primus and can be used in experiments, sweeps,
and CI.

8 changes: 4 additions & 4 deletions docs/backends/extending-backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,11 +381,11 @@ configuring backend-specific env vars at runtime), you can use **train hooks**
under `runner/helpers/hooks`.

- **Hook locations for training**:
- Global hooks (run for all commands):
- Global hooks (run for all commands):
`runner/helpers/hooks/*.sh|*.py`
- Train-specific hooks (per framework):
`runner/helpers/hooks/train/pretrain/<framework>/*.sh|*.py`
`runner/helpers/hooks/train/posttrain/<framework>/*.sh|*.py`
- Train-specific hooks (per framework):
`runner/helpers/hooks/train/pretrain/<framework>/*.sh|*.py`
`runner/helpers/hooks/train/posttrain/<framework>/*.sh|*.py`
where `<framework>` is `megatron`, `torchtitan`, `dummy`, etc.
- Files in each directory are discovered with `find ... -name "*.sh" -o -name "*.py"`
and executed in **lexicographical order** of their filenames.
Expand Down
8 changes: 4 additions & 4 deletions docs/posttraining.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ modules:

# Fine-tuning method
peft: "none" # Full fine-tuning

# Training configuration
train_iters: 200
global_batch_size: 8
Expand Down Expand Up @@ -167,7 +167,7 @@ modules:

# Fine-tuning method
peft: lora # LoRA fine-tuning

# Training configuration
train_iters: 200
global_batch_size: 32
Expand All @@ -181,7 +181,7 @@ modules:

# Precision
precision_config: bf16_mixed

# Recompute configuration
recompute_granularity: full
recompute_method: uniform
Expand Down Expand Up @@ -357,7 +357,7 @@ recompute_num_layers: 1 # Number of layers to recompute
**For SFT:**
- Use higher `tensor_model_parallel_size` for large models (e.g., TP=8 for 70B)
- Consider pipeline parallelism for very large models
- Examples:
- Examples:
- 32B model: TP=1-2 (MI300X: TP=2, MI355X: TP=1)
- 70B model: TP=8

Expand Down
2 changes: 1 addition & 1 deletion docs/projection.md
Original file line number Diff line number Diff line change
Expand Up @@ -600,7 +600,7 @@ Projected Time = (Base Time / DP_scaling_factor) + Communication Overheads

**What it does**: Distributes layers across pipeline stages. Each stage processes microbatches in sequence.

**How it's modeled**:
**How it's modeled**:
- If PP > 1 but only 1 node available for benchmarking, PP is reduced to 1 for benchmarking
- A **pipeline scheduler simulator** (`simulator.py`) reconstructs the full pipeline schedule
- Simulates the actual 1F1B or zero-bubble schedule with proper send/receive synchronization
Expand Down
6 changes: 3 additions & 3 deletions docs/tech_blogs/primus_pipeline/primus_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,15 @@ The key idea of Primus-pipeline is to separate pipeline scheduling logic from tr

- Implement InputGrad/WeightGrad separation ops for GeMM and GroupGemm by redefining [Primus-Turbo](https://github.com/AMD-AGI/Primus-Turbo) ops.

- Provide simulation tools for each PP algorithm in both theory and practice, which clearly simulate and measure bubble rate and memory consumption under specific configs.
- Provide simulation tools for each PP algorithm in both theory and practice, which clearly simulate and measure bubble rate and memory consumption under specific configs.

### Schedule Design

Primus-pipeline patch and substite the Megatron-LM's `megatron.core.pipeline_parallel.get_forward_backward_func` function. The entrypoint of the schedule logic is at [PrimusPipelineParallelLauncher](https://github.com/AMD-AGI/Primus/blob/dev/yc/primus-pipe-blog/primus/backends/megatron/core/pipeline_parallel/primuspipe/pipeline_launcher.py)

Here are the steps to define and run a PP algorithm in Primus-pipeline.

1. **Create a ScheduleTable with ScheduleNodes**: For most PP algorithms, a schedule table containing schedule nodes can be defined given PP world size, virtual pipeline chunks per rank, and minibatches.
1. **Create a ScheduleTable with ScheduleNodes**: For most PP algorithms, a schedule table containing schedule nodes can be defined given PP world size, virtual pipeline chunks per rank, and minibatches.
- **ScheduleNode**: Each step of the execution can be abstracted as a ScheduleNode including computation nodes such as FORWARD/BACKWARD/WGRAD and communication nodes such as RECV_FORWARD/SEND_FORWARD.
- **PP Algorithms**: [pp-algorithms](https://github.com/AMD-AGI/Primus/tree/main/primus/core/pipeline_parallel/scheduler/algorithms)

Expand Down Expand Up @@ -211,4 +211,4 @@ content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS
PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT
IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO
YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE
FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.
FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.
8 changes: 3 additions & 5 deletions examples/hardware_configs/custom_hardware_example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,14 @@ hardware_config:
# Intra-node Communication (NVLink/xGMI equivalent)
node_bw: 896.0 # Intra-node bandwidth per GPU (GB/s bidirectional)
node_lat: 1.0 # Intra-node latency (microseconds)

# Inter-node Communication (InfiniBand/RoCE)
pod_bw: 50.0 # Inter-node bandwidth per NIC (GB/s, 400 Gbps = 50 GB/s)
pod_lat: 2.0 # Inter-node latency (microseconds)

# Network Topology
node_size: 8 # GPUs per node
nics_per_node: 8 # Number of network interfaces per node

# Bandwidth efficiency factor (0.0 to 1.0)
bw_eff: 0.9


Original file line number Diff line number Diff line change
Expand Up @@ -88,11 +88,11 @@ modules:
# stage 3 is completely no gpu-cpu sync in MoE, but cost more memory
# stage 2 is recommended for better performance
turbo_sync_free_moe_stage: 2

# remove once super flag is functional again
moe_use_fused_router_with_aux_score: true
moe_permute_fusion: true

# Cross entropy flags
cross_entropy_fusion_impl: "te"
cross_entropy_loss_fusion: true
8 changes: 2 additions & 6 deletions examples/megatron/configs/MI300X/mamba_370M-pretrain.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ modules:
log_avg_reset_interval: 50

train_iters: 50
micro_batch_size: 4
global_batch_size: 256
micro_batch_size: 16
global_batch_size: 128

seq_length: 2048
max_position_embeddings: 2048
Expand All @@ -44,10 +44,6 @@ modules:
# Mamba-specific: must provide spec
spec: ['megatron.core.models.mamba.mamba_layer_specs', 'mamba_stack_spec']

# Tokenizer
tokenizer_type: HuggingFaceTokenizer
tokenizer_model: EleutherAI/gpt-neox-20b

# parallel
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ modules:
eval_iters: 0

cross_entropy_loss_fusion: true

# recompute
recompute_granularity: full # full, selective
recompute_method: block # uniform, block
Expand All @@ -120,4 +120,4 @@ modules:
# use_turbo_grouped_mlp: false
# enable_primus_turbo: false
# enable_turbo_attention_float8 : false
# enable_turbo_gemm_float8 : false
# enable_turbo_gemm_float8 : false
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ modules:
eval_iters: 0

cross_entropy_loss_fusion: true

# recompute
recompute_granularity: full # full, selective
recompute_method: block # uniform, block
Expand All @@ -119,4 +119,4 @@ modules:
# use_turbo_grouped_mlp: false
# enable_primus_turbo: false
# enable_turbo_attention_float8 : false
# enable_turbo_gemm_float8 : false
# enable_turbo_gemm_float8 : false
Original file line number Diff line number Diff line change
Expand Up @@ -113,4 +113,4 @@ modules:
# use_turbo_grouped_mlp: false
# enable_primus_turbo: false
# enable_turbo_attention_float8 : false
# enable_turbo_gemm_float8 : false
# enable_turbo_gemm_float8 : false
Original file line number Diff line number Diff line change
Expand Up @@ -113,4 +113,4 @@ modules:
# use_turbo_grouped_mlp: false
# enable_primus_turbo: false
# enable_turbo_attention_float8 : false
# enable_turbo_gemm_float8 : false
# enable_turbo_gemm_float8 : false
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ modules:
# recompute
recompute_granularity: full # full, selective
recompute_method: block # uniform, block
recompute_num_layers: 80 # int
recompute_num_layers: 30 # int

# Cross entropy flags
cross_entropy_fusion_impl: "te"
Expand Down
81 changes: 81 additions & 0 deletions examples/megatron/configs/MI355X/mamba_370M-pretrain.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
work_group: ${PRIMUS_TEAM:amd}
user_name: ${PRIMUS_USER:root}
exp_name: ${PRIMUS_EXP_NAME:mamba_370M-pretrain}
workspace: ${PRIMUS_WORKSPACE:./output}

modules:
pre_trainer:
framework: megatron
config: pre_trainer.yaml

# model to run
model: mamba_370M.yaml
overrides:
# log
wandb_project: "Primus_Mamba_Pretrain"
# disable_wandb: false
# disable_tensorboard: false
stderr_sink_level: DEBUG

eval_iters: 0

log_avg_skip_iterations: 2
log_avg_reset_interval: 50

train_iters: 50
micro_batch_size: 32
global_batch_size: 256

seq_length: 2048
max_position_embeddings: 2048

lr: 3.0e-4
min_lr: 0.0
lr_warmup_iters: 50000
lr_decay_iters: 73192188
lr_decay_style: cosine
weight_decay: 0.1
adam_beta1: 0.9
adam_beta2: 0.95
eod_mask_loss: true
init_method_std: 0.02
norm_epsilon: 1.0e-5

# Mamba-specific: must provide spec
spec: ['megatron.core.models.mamba.mamba_layer_specs', 'mamba_stack_spec']

# parallel
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
overlap_grad_reduce: true
overlap_param_gather: true
gradient_accumulation_fusion: false

# data
mock_data: true
train_data_path: null
valid_data_path: null
test_data_path: null

# ckpt
finetune: false
auto_continue_train: false
load: null
no_load_optim: null
no_load_rng: null
save: null
save_interval: 20000
no_save_optim: null
no_save_rng: null
disable_last_saving: true
ckpt_format: torch

# Turbo - may need to disable for Mamba if not supported
enable_primus_turbo: false
use_turbo_attention: false
use_turbo_grouped_mlp: false

# Cross entropy flags
# cross_entropy_fusion_impl: "native"
# cross_entropy_loss_fusion: false
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,3 @@ modules:
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1

Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,3 @@ modules:
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1

Loading
Loading