AMD-AGI · clairesonglee · Feb 13, 2026 · Feb 19, 2026 · Feb 26, 2026 · Feb 26, 2026
@@ -54,7 +54,7 @@ For Megatron, there are three layers of YAML config involved:
 
 At runtime:
 
-- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as  
+- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as
   `primus/configs/models/megatron/llama3.1_8B.yaml`.
 - `extends: - llama3_8B.yaml` further chains into
   `primus/configs/models/megatron/llama3_8B.yaml`, which in turn extends
@@ -206,4 +206,3 @@ You should see in the printed configuration and logs that:
 
 Once these steps are done, your new `tinyllama_1.1B` config behaves like any other
 Megatron model in Primus and can be used in experiments, sweeps, and CI.
-
@@ -86,7 +86,7 @@ For TorchTitan, there are three layers of YAML involved:
 
 At runtime:
 
-- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as  
+- `modules.pre_trainer.model: llama3.1_8B.yaml` is resolved as
   `primus/configs/models/torchtitan/llama3.1_8B.yaml`.
 - The TorchTitan launcher reads the `job` and `model` sections and wires up
   the actual PyTorch model + training loop.
@@ -277,4 +277,3 @@ hooked up correctly.
 Once these steps are done, your new TorchTitan model config behaves like any
 other TorchTitan model in Primus and can be used in experiments, sweeps,
 and CI.
-
@@ -381,11 +381,11 @@ configuring backend-specific env vars at runtime), you can use **train hooks**
 under `runner/helpers/hooks`.
 
 - **Hook locations for training**:
-  - Global hooks (run for all commands):  
+  - Global hooks (run for all commands):
     `runner/helpers/hooks/*.sh|*.py`
-  - Train-specific hooks (per framework):  
-    `runner/helpers/hooks/train/pretrain/<framework>/*.sh|*.py`  
-    `runner/helpers/hooks/train/posttrain/<framework>/*.sh|*.py`  
+  - Train-specific hooks (per framework):
+    `runner/helpers/hooks/train/pretrain/<framework>/*.sh|*.py`
+    `runner/helpers/hooks/train/posttrain/<framework>/*.sh|*.py`
     where `<framework>` is `megatron`, `torchtitan`, `dummy`, etc.
 - Files in each directory are discovered with `find ... -name "*.sh" -o -name "*.py"`
   and executed in **lexicographical order** of their filenames.

@@ -124,7 +124,7 @@ modules:
 
       # Fine-tuning method
       peft: "none"  # Full fine-tuning
-      
+
       # Training configuration
       train_iters: 200
       global_batch_size: 8
@@ -167,7 +167,7 @@ modules:
 
       # Fine-tuning method
       peft: lora  # LoRA fine-tuning
-      
+
       # Training configuration
       train_iters: 200
       global_batch_size: 32
@@ -181,7 +181,7 @@ modules:
 
       # Precision
       precision_config: bf16_mixed
-      
+
       # Recompute configuration
       recompute_granularity: full
       recompute_method: uniform
@@ -357,7 +357,7 @@ recompute_num_layers: 1       # Number of layers to recompute
 **For SFT:**
 - Use higher `tensor_model_parallel_size` for large models (e.g., TP=8 for 70B)
 - Consider pipeline parallelism for very large models
-- Examples: 
+- Examples:
   - 32B model: TP=1-2 (MI300X: TP=2, MI355X: TP=1)
   - 70B model: TP=8
 

@@ -600,7 +600,7 @@ Projected Time = (Base Time / DP_scaling_factor) + Communication Overheads
 
 **What it does**: Distributes layers across pipeline stages. Each stage processes microbatches in sequence.
 
-**How it's modeled**: 
+**How it's modeled**:
 - If PP > 1 but only 1 node available for benchmarking, PP is reduced to 1 for benchmarking
 - A **pipeline scheduler simulator** (`simulator.py`) reconstructs the full pipeline schedule
 - Simulates the actual 1F1B or zero-bubble schedule with proper send/receive synchronization

@@ -40,15 +40,15 @@ The key idea of Primus-pipeline is to separate pipeline scheduling logic from tr
 
 - Implement InputGrad/WeightGrad separation ops for GeMM and GroupGemm by redefining [Primus-Turbo](https://github.com/AMD-AGI/Primus-Turbo) ops.
 
-- Provide simulation tools for each PP algorithm in both theory and practice, which clearly simulate and measure bubble rate and memory consumption under specific configs. 
+- Provide simulation tools for each PP algorithm in both theory and practice, which clearly simulate and measure bubble rate and memory consumption under specific configs.
 
 ### Schedule Design
 
 Primus-pipeline patch and substite the Megatron-LM's `megatron.core.pipeline_parallel.get_forward_backward_func` function. The entrypoint of the schedule logic is at [PrimusPipelineParallelLauncher](https://github.com/AMD-AGI/Primus/blob/dev/yc/primus-pipe-blog/primus/backends/megatron/core/pipeline_parallel/primuspipe/pipeline_launcher.py)
 
 Here are the steps to define and run a PP algorithm in Primus-pipeline.
 
-1. **Create a ScheduleTable with ScheduleNodes**: For most PP algorithms, a schedule table containing schedule nodes can be defined given PP world size, virtual pipeline chunks per rank, and minibatches. 
+1. **Create a ScheduleTable with ScheduleNodes**: For most PP algorithms, a schedule table containing schedule nodes can be defined given PP world size, virtual pipeline chunks per rank, and minibatches.
   - **ScheduleNode**: Each step of the execution can be abstracted as a ScheduleNode including computation nodes such as FORWARD/BACKWARD/WGRAD and communication nodes such as RECV_FORWARD/SEND_FORWARD.
   - **PP Algorithms**: [pp-algorithms](https://github.com/AMD-AGI/Primus/tree/main/primus/core/pipeline_parallel/scheduler/algorithms)
 
@@ -211,4 +211,4 @@ content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS
 PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT
 IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO
 YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE
-FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.
+FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.
@@ -6,16 +6,14 @@ hardware_config:
   # Intra-node Communication (NVLink/xGMI equivalent)
   node_bw: 896.0  # Intra-node bandwidth per GPU (GB/s bidirectional)
   node_lat: 1.0  # Intra-node latency (microseconds)
-  
+
   # Inter-node Communication (InfiniBand/RoCE)
   pod_bw: 50.0  # Inter-node bandwidth per NIC (GB/s, 400 Gbps = 50 GB/s)
   pod_lat: 2.0  # Inter-node latency (microseconds)
-  
+
   # Network Topology
   node_size: 8  # GPUs per node
   nics_per_node: 8  # Number of network interfaces per node
-  
+
   # Bandwidth efficiency factor (0.0 to 1.0)
   bw_eff: 0.9
-
-
@@ -88,11 +88,11 @@ modules:
       # stage 3 is completely no gpu-cpu sync in MoE, but cost more memory
       # stage 2 is recommended for better performance
       turbo_sync_free_moe_stage: 2
-      
+
       # remove once super flag is functional again
       moe_use_fused_router_with_aux_score: true
       moe_permute_fusion: true
-      
+
       # Cross entropy flags
       cross_entropy_fusion_impl: "te"
       cross_entropy_loss_fusion: true
@@ -23,8 +23,8 @@ modules:
       log_avg_reset_interval: 50
 
       train_iters: 50
-      micro_batch_size: 4
-      global_batch_size: 256
+      micro_batch_size: 16
+      global_batch_size: 128
 
       seq_length: 2048
       max_position_embeddings: 2048
@@ -44,10 +44,6 @@ modules:
       # Mamba-specific: must provide spec
       spec: ['megatron.core.models.mamba.mamba_layer_specs', 'mamba_stack_spec']
 
-      # Tokenizer
-      tokenizer_type: HuggingFaceTokenizer
-      tokenizer_model: EleutherAI/gpt-neox-20b
-
       # parallel
       tensor_model_parallel_size: 1
       pipeline_model_parallel_size: 1

@@ -107,7 +107,7 @@ modules:
       eval_iters: 0
 
       cross_entropy_loss_fusion: true
-      
+
       # recompute
       recompute_granularity: full # full, selective
       recompute_method: block # uniform, block
@@ -120,4 +120,4 @@ modules:
       # use_turbo_grouped_mlp: false
       # enable_primus_turbo: false
       # enable_turbo_attention_float8 : false
-      # enable_turbo_gemm_float8 : false
+      # enable_turbo_gemm_float8 : false
@@ -107,7 +107,7 @@ modules:
       eval_iters: 0
 
       cross_entropy_loss_fusion: true
-      
+
       # recompute
       recompute_granularity: full # full, selective
       recompute_method: block # uniform, block
@@ -119,4 +119,4 @@ modules:
       # use_turbo_grouped_mlp: false
       # enable_primus_turbo: false
       # enable_turbo_attention_float8 : false
-      # enable_turbo_gemm_float8 : false
+      # enable_turbo_gemm_float8 : false
@@ -113,4 +113,4 @@ modules:
       # use_turbo_grouped_mlp: false
       # enable_primus_turbo: false
       # enable_turbo_attention_float8 : false
-      # enable_turbo_gemm_float8 : false
+      # enable_turbo_gemm_float8 : false
@@ -113,4 +113,4 @@ modules:
       # use_turbo_grouped_mlp: false
       # enable_primus_turbo: false
       # enable_turbo_attention_float8 : false
-      # enable_turbo_gemm_float8 : false
+      # enable_turbo_gemm_float8 : false
@@ -69,7 +69,7 @@ modules:
       # recompute
       recompute_granularity: full # full, selective
       recompute_method: block # uniform, block
-      recompute_num_layers: 80 # int
+      recompute_num_layers: 30 # int
 
       # Cross entropy flags
       cross_entropy_fusion_impl: "te"

@@ -0,0 +1,81 @@
+work_group: ${PRIMUS_TEAM:amd}
+user_name: ${PRIMUS_USER:root}
+exp_name: ${PRIMUS_EXP_NAME:mamba_370M-pretrain}
+workspace: ${PRIMUS_WORKSPACE:./output}
+
+modules:
+  pre_trainer:
+    framework: megatron
+    config: pre_trainer.yaml
+
+    # model to run
+    model: mamba_370M.yaml
+    overrides:
+      # log
+      wandb_project: "Primus_Mamba_Pretrain"
+      # disable_wandb: false
+      # disable_tensorboard: false
+      stderr_sink_level: DEBUG
+
+      eval_iters: 0
+
+      log_avg_skip_iterations: 2
+      log_avg_reset_interval: 50
+
+      train_iters: 50
+      micro_batch_size: 32
+      global_batch_size: 256
+
+      seq_length: 2048
+      max_position_embeddings: 2048
+
+      lr: 3.0e-4
+      min_lr: 0.0
+      lr_warmup_iters: 50000
+      lr_decay_iters: 73192188
+      lr_decay_style: cosine
+      weight_decay: 0.1
+      adam_beta1: 0.9
+      adam_beta2: 0.95
+      eod_mask_loss: true
+      init_method_std: 0.02
+      norm_epsilon: 1.0e-5
+
+      # Mamba-specific: must provide spec
+      spec: ['megatron.core.models.mamba.mamba_layer_specs', 'mamba_stack_spec']
+
+      # parallel
+      tensor_model_parallel_size: 1
+      pipeline_model_parallel_size: 1
+      expert_model_parallel_size: 1
+      overlap_grad_reduce: true
+      overlap_param_gather: true
+      gradient_accumulation_fusion: false
+
+      # data
+      mock_data: true
+      train_data_path: null
+      valid_data_path: null
+      test_data_path: null
+
+      # ckpt
+      finetune: false
+      auto_continue_train: false
+      load: null
+      no_load_optim: null
+      no_load_rng: null
+      save: null
+      save_interval: 20000
+      no_save_optim: null
+      no_save_rng: null
+      disable_last_saving: true
+      ckpt_format: torch
+
+      # Turbo - may need to disable for Mamba if not supported
+      enable_primus_turbo: false
+      use_turbo_attention: false
+      use_turbo_grouped_mlp: false
+
+      # Cross entropy flags
+      # cross_entropy_fusion_impl: "native"
+      # cross_entropy_loss_fusion: false
@@ -55,4 +55,3 @@ modules:
       recompute_granularity: full
       recompute_method: uniform
       recompute_num_layers: 1
-
@@ -55,4 +55,3 @@ modules:
       recompute_granularity: full
       recompute_method: uniform
       recompute_num_layers: 1
-
Original file line number	Diff line number	Diff line change
Expand Up		@@ -55,4 +55,3 @@ modules:
		recompute_granularity: full
		recompute_method: uniform
		recompute_num_layers: 1