β οΈ NOTE: This is a draft version and not the final release.Author: AMD AI Brain - Training at Scale (TAS) Team Published: 2025-11-10 Tags:
#ROCm#LLM-Training#Primus#DevTools#AMD-GPU
Imagine this scenario:
You've just taken over a large model training project, and the project directory is littered with dozens of Bash scripts: setup_env_mi300x.sh, run_megatron_8node.sh, check_network.sh, preprocess_data_v2_final_really.sh... Each script has its own logic, and they're tightly coupled with each other.
When you want to reproduce a training run on a new GPU cluster, you need to:
- Manually detect the GPU model and find the corresponding environment configuration script
- Modify distributed parameters based on the number of nodes
- Ensure data preprocessing scripts execute before training
- Manually set dozens of environment variables
- Pray that everything runs smoothly...
This is the real problem we encountered when building the Primus platform.
In the AMD GPU large model training ecosystem, we face complexity at multiple levels:
- π§ Environment Setup: Different GPU models (MI300X, MI250X) require different ROCm configurations
- π Network Topology: RCCL/NCCL environments and InfiniBand configurations vary widely
- π― Framework Orchestration: Megatron, TorchTitan, JAX and other frameworks each have their own characteristics
- π₯οΈ Execution Environment: Three scenarios - local development, containerization, and Slurm clusters
- π Performance Validation: GEMM benchmarks, communication performance testing
- π οΈ Pre/Post Processing: Data preprocessing, environment checks, hotfix patches
The traditional approach is to use a large number of Bash scripts to handle these aspects, but this brings:
- β Repeated logic, difficult to maintain
- β Lack of unified error handling
- β Experiments difficult to reproduce
- β High barrier to entry for newcomers
Primus CLI was born to solve these pain points.
Our core philosophy is simple: One command, from environment configuration to training launch, fully automated.
# Just this simple!
primus-cli direct -- train pretrain --config deepseek_v2.yamlPrimus CLI adopts a clear three-layer structure + plugin system:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Runtime Layer β
β direct | container | slurm β
β Auto-detect env, configure GPU, manage distributed β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
β Hook/Patch System β
β Data preprocessing | Env checks | Hotfixes β
β Pluggable pre/post task logic β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
β Task Execution Layer β
β train | benchmark | preflight | analyze β
β Plugin-based business logic and tasks β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Goal | Implementation | User Benefits |
|---|---|---|
| π Consistency | Unified CLI entry, supports Megatron/TorchTitan/JAX | No need to learn different commands for different frameworks |
| π Extensibility | Auto-discovery plugins, dynamic registration of new features | Add new features without modifying core code |
| π Debuggability | Rank-aware logging, detailed execution tracing | Quickly locate multi-node training issues |
| π¦ Reproducibility | Auto-export runtime environment and config | One-click reproduce any historical experiment |
Different scenarios require different runtime environments, but users shouldn't have to worry about these details. Primus CLI provides three seamlessly switchable runtime modes:
| π Mode | π Use Case | π― Typical Command |
|---|---|---|
| π₯οΈ Direct | Local dev, quick validation | primus-cli direct -- train pretrain |
| π³ Container | Environment isolation, dependency management | primus-cli container --image rocm/megatron:v25.8 -- train pretrain |
| π§ Slurm | Multi-node production training | primus-cli slurm srun -N 8 -- train pretrain |
Key Highlight: These three modes share the same command syntax, differing only in runtime environment preparation:
# Local testing
primus-cli direct -- benchmark gemm -M 4096
# Container testing (ensure environment consistency)
primus-cli container -- benchmark gemm -M 4096
# Production environment (8-node cluster)
primus-cli slurm srun -N 8 -- benchmark gemm -M 4096From development to production, just change the runtime mode - commands and parameters stay the same!
Training is more than just running a Python script. You might need to:
- ποΈ Preprocess datasets before training
- π Check GPU and network environment
- π©Ή Apply temporary hotfixes
- π Collect and report metrics
Primus CLI's Hook system automates all of this:
# Directory structure
runner/helpers/hooks/
βββ train/
βββ pretrain/
βββ 01_check_environment.sh
βββ 02_prepare_dataset.sh
βββ 03_setup_monitoring.shWhen you run primus-cli train pretrain, these Hooks execute automatically in order. No manual invocation, no training code modification needed.
The Patch system is for more flexible scenarios:
primus-cli direct --patch fixes/workaround_rccl.sh \
-- train pretrain --config config.yamlThis is especially useful when you need to quickly apply temporary fixes or make environment-specific adjustments.
This layer is responsible for executing specific business logicβtraining, testing, environment checks, and other actual tasks. Remember we said "zero-intrusion extension"? How is this achieved?
# Training task
primus-cli direct -- train pretrain --config deepseek_v2.yaml
# Performance testing task
primus-cli direct -- benchmark gemm --dtype bf16 -M 8192
# Environment check task
primus-cli direct -- preflight check --gpu --networkEach task is an independent Python plugin module, auto-registered via decorators:
from primus.cli.registry import register_subcommand
@register_subcommand("train")
def run_train(args, unknown_args):
# Your training business logic
...Want to add a new task? Just create a new plugin file in primus/cli/subcommands/ without modifying any core code. For example, if you want to add a topology analysis task:
# Add file: primus/cli/subcommands/analyze.py
# And you can use it directly
primus-cli direct -- analyze topology --visualizeThis plugin-based design allows Primus CLI to quickly respond to new requirements while keeping the core stable as functionality expands.
This is probably the most "black tech" part of Primus CLI.
AMD's GPU family is rich: MI300X, MI250X, MI210... Each GPU has its optimal ROCm configuration and environment variable settings. The traditional approach is to let users manually select configurations, but this is both error-prone and insufficiently automated.
Step 1: Load Common Environment
# base_env.sh provides unified logging and utility functions
source "${SCRIPT_DIR}/base_env.sh"Step 2: Auto-Detect GPU
# Intelligently detect the current node's GPU model
GPU_MODEL=$(bash "${SCRIPT_DIR}/detect_gpu.sh")
# Output: MI300XStep 3: Load GPU-Specific Configuration
# Based on detection result, auto-load optimal configuration
GPU_CONFIG_FILE="${SCRIPT_DIR}/${GPU_MODEL}.sh"
source "$GPU_CONFIG_FILE" # Load MI300X.shNow, MI300X.sh can contain all best practices for this GPU model:
- ROCm environment variables (
HSA_*,HIP_*) - RCCL communication optimization parameters
- Memory management strategies
- Performance tuning options
Users don't need to worry about these details at all - everything is automatic.
# On MI300X cluster
primus-cli direct -- train pretrain --config config.yaml
# Auto-loads: MI300X.sh β Optimized RCCL configuration
# Switch to MI250X cluster, same command
primus-cli direct -- train pretrain --config config.yaml
# Auto-loads: MI250X.sh β Adapted different configurationCross-GPU migration with zero configuration changes!
In machine learning research, reproducibility is crucial. But reality is harsh:
"I ran this experiment three months ago, now I want to reproduce it... Wait, where's the config file? How did I set those environment variables again?"
Does this sound familiar? Primus CLI completely solves this problem with an automated snapshot mechanism.
Every time training starts, Primus CLI automatically saves the complete runtime context:
output/exp_2025_11_10_134522/
βββ env/
β βββ primus_env_dump.txt # All environment variables
β βββ gpu_model.txt # GPU model info
β βββ rocm_version.txt # ROCm version
β βββ system_info.json # System configuration
βββ config/
β βββ primus_config.yaml # Primus config
β βββ model_config.yaml # Model config
β βββ data_config.yaml # Data config
βββ logs/
β βββ launch.log # Launch logs
βββ metadata.json # Runtime metadata
Three months later, when you want to reproduce this experiment:
# Just this simple!
primus-cli direct -- train pretrain --replay output/exp_2025_11_10_134522/Primus CLI will automatically:
- Restore all environment variables
- Load original configuration files
- Verify GPU and system environment
- Start training (if environment is compatible)
| Scenario | Traditional Approach | Using Primus CLI |
|---|---|---|
| π Performance Comparison | Manually record config, easy to miss details | Auto-snapshot, precise reproduction |
| π Bug Debugging | Hard to reproduce issues in new environment | --replay one-click reproduction |
| π A/B Testing | Manually ensure two runs are consistent | Auto-guarantee config consistency |
| π Paper Experiments | Manually write experiment setup docs | One-click export complete environment |
| π Version Regression | Rely on manual records | Automated CI integration |
Let's see how Primus CLI simplifies the entire workflow through a real scenario.
Step 1: Local Development & Validation π₯οΈ
# Quickly verify configuration on dev machine
primus-cli direct --debug -- train pretrain \
--config configs/deepseek_v2_debug.yamlDiscover config issues, data format errors, etc. within minutes.
Step 2: Container Environment Testing π³
# Ensure it runs normally in standardized environment
primus-cli container \
--image rocm/megatron-lm:v25.8_py310 \
--mount /data:/data \
-- train pretrain --config configs/deepseek_v2_debug.yamlContainers ensure environment consistency, avoiding "works on my machine" problems.
Step 3: Small-Scale Cluster Validation π§
# 2-node test, verify distributed communication
primus-cli slurm srun -N 2 -p gpu-test \
-- container --image rocm/megatron-lm:v25.8_py310 \
-- train pretrain --config configs/deepseek_v2_small.yamlStep 4: Production Large-Scale Training π
# 64-node, 512-GPU production training
primus-cli slurm sbatch \
-N 64 -p gpu-prod -t 72:00:00 \
--job-name=deepseek-v2-prod \
-o logs/train_%j.log \
-- container --image rocm/megatron-lm:v25.8_py310 \
-- train pretrain --config configs/deepseek_v2_prod.yamlNotice? From development to production, the core command structure remains unchanged:
primus-cli <runtime> -- <subcommand> <args>
Only the runtime environment (direct β container β slurm) changes - everything else is unified.
After the detailed introduction above, let's summarize the core value Primus CLI brings:
| π Feature | πͺ Capability | π Value |
|---|---|---|
| Unified Entry | One command, fits all scenarios | Lower learning curve, higher development efficiency |
| Plugin-Based | Zero-intrusion extension of new features | Quickly respond to new requirements, keep system stable |
| Intelligent Environment | Auto-detect GPU and optimize config | Zero cost for cross-platform migration |
| Hook System | Automate pre/post task processing | Decouple complex workflows, improve code reuse |
| Reproducibility | One-click snapshot and restore | Solid foundation for scientific experiments |
| Three-Layer Runtime | Direct/Container/Slurm seamless switching | Smooth path from development to production |
| Unified Logging | Rank-aware structured logging | Quickly locate multi-node issues |
Primus CLI continues to evolve, and our near-term plans include:
- π― Python Hook API: Support writing Hooks in Python scripts for more flexible extension capabilities
- π― Intelligent Preflight: Auto-check GPU health, network topology, InfiniBand connectivity before launch
- π― Configuration Template System: Built-in best practice config templates for common models
- π― Performance Analysis Reports: Auto-generate training performance analysis reports including GEMM, communication, I/O bottleneck analysis
- π― Enhanced Error Diagnostics: Intelligent error analysis and fix suggestions (e.g., RCCL hang, OOM, and other common issues)
- π― Extended Framework Support: Improve support for more training frameworks like TorchTitan, JAX/Flax
- π― CI/CD Integration: Provide standardized testing and validation workflows, support automated regression testing
- π Become the standard training entry point for the ROCm ecosystem
Back to the question at the beginning: How do we make large model training go from complex to simple?
Primus CLI's answer: Through carefully designed abstraction layers, hide complexity under a unified interface.
# From this simple command...
primus-cli direct -- train pretrain --config deepseek_v2.yaml
# ...to everything behind it:
# β
Auto GPU detection and configuration
# β
Intelligent environment variable setup
# β
Data preprocessing Hooks
# β
Distributed communication optimization
# β
Log collection and analysis
# β
Experiment snapshots and reproduction
# β
Error handling and recoveryThis is Primus CLI: Making complex things simple, and simple things automatic.
- π User Guide: PRIMUS-CLI-GUIDE.md
- π§ Quick Start:
primus-cli --help - π¬ Issue Reporting: GitHub Issues
- π ROCm Ecosystem: rocm.github.io
"The best interface is no interface."
But where interfaces cannot be eliminated, the best interface is unified, simple, and automated. That's Primus CLI.
Happy Training with AMD ROCm! π