Skip to content

Latest commit

Β 

History

History
431 lines (310 loc) Β· 16.6 KB

File metadata and controls

431 lines (310 loc) Β· 16.6 KB

πŸš€ From Chaos to Order: Building a Unified Entry Point for AMD GPU LLM Training

⚠️ NOTE: This is a draft version and not the final release.

Author: AMD AI Brain - Training at Scale (TAS) Team Published: 2025-11-10 Tags: #ROCm #LLM-Training #Primus #DevTools #AMD-GPU


πŸ“– The Beginning: Pain Points in Training Workflows

Imagine this scenario:

You've just taken over a large model training project, and the project directory is littered with dozens of Bash scripts: setup_env_mi300x.sh, run_megatron_8node.sh, check_network.sh, preprocess_data_v2_final_really.sh... Each script has its own logic, and they're tightly coupled with each other.

When you want to reproduce a training run on a new GPU cluster, you need to:

  1. Manually detect the GPU model and find the corresponding environment configuration script
  2. Modify distributed parameters based on the number of nodes
  3. Ensure data preprocessing scripts execute before training
  4. Manually set dozens of environment variables
  5. Pray that everything runs smoothly...

This is the real problem we encountered when building the Primus platform.

In the AMD GPU large model training ecosystem, we face complexity at multiple levels:

  • πŸ”§ Environment Setup: Different GPU models (MI300X, MI250X) require different ROCm configurations
  • πŸ”— Network Topology: RCCL/NCCL environments and InfiniBand configurations vary widely
  • 🎯 Framework Orchestration: Megatron, TorchTitan, JAX and other frameworks each have their own characteristics
  • πŸ–₯️ Execution Environment: Three scenarios - local development, containerization, and Slurm clusters
  • πŸ“Š Performance Validation: GEMM benchmarks, communication performance testing
  • πŸ› οΈ Pre/Post Processing: Data preprocessing, environment checks, hotfix patches

The traditional approach is to use a large number of Bash scripts to handle these aspects, but this brings:

  • ❌ Repeated logic, difficult to maintain
  • ❌ Lack of unified error handling
  • ❌ Experiments difficult to reproduce
  • ❌ High barrier to entry for newcomers

Primus CLI was born to solve these pain points.


πŸ’‘ Design Philosophy: One Command, Done

Our core philosophy is simple: One command, from environment configuration to training launch, fully automated.

# Just this simple!
primus-cli direct -- train pretrain --config deepseek_v2.yaml

πŸ—οΈ Three-Layer Architecture Design

Primus CLI adopts a clear three-layer structure + plugin system:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Runtime Layer                     β”‚
β”‚            direct | container | slurm               β”‚
β”‚  Auto-detect env, configure GPU, manage distributed β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Hook/Patch System                    β”‚
β”‚    Data preprocessing | Env checks | Hotfixes       β”‚
β”‚           Pluggable pre/post task logic             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Task Execution Layer                  β”‚
β”‚       train | benchmark | preflight | analyze       β”‚
β”‚        Plugin-based business logic and tasks        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎯 Four Design Goals

Goal Implementation User Benefits
πŸ”„ Consistency Unified CLI entry, supports Megatron/TorchTitan/JAX No need to learn different commands for different frameworks
πŸ”Œ Extensibility Auto-discovery plugins, dynamic registration of new features Add new features without modifying core code
πŸ› Debuggability Rank-aware logging, detailed execution tracing Quickly locate multi-node training issues
πŸ“¦ Reproducibility Auto-export runtime environment and config One-click reproduce any historical experiment

πŸ” Deep Dive: Architecture Dissection

βš™οΈ Layer 1: Intelligent Runtime Abstraction

Different scenarios require different runtime environments, but users shouldn't have to worry about these details. Primus CLI provides three seamlessly switchable runtime modes:

🎭 Mode πŸ“ Use Case 🎯 Typical Command
πŸ–₯️ Direct Local dev, quick validation primus-cli direct -- train pretrain
🐳 Container Environment isolation, dependency management primus-cli container --image rocm/megatron:v25.8 -- train pretrain
πŸ–§ Slurm Multi-node production training primus-cli slurm srun -N 8 -- train pretrain

Key Highlight: These three modes share the same command syntax, differing only in runtime environment preparation:

# Local testing
primus-cli direct -- benchmark gemm -M 4096

# Container testing (ensure environment consistency)
primus-cli container -- benchmark gemm -M 4096

# Production environment (8-node cluster)
primus-cli slurm srun -N 8 -- benchmark gemm -M 4096

From development to production, just change the runtime mode - commands and parameters stay the same!


πŸ” Layer 2: Hook & Patch System

Training is more than just running a Python script. You might need to:

  • πŸ—‚οΈ Preprocess datasets before training
  • πŸ” Check GPU and network environment
  • 🩹 Apply temporary hotfixes
  • πŸ“Š Collect and report metrics

Primus CLI's Hook system automates all of this:

# Directory structure
runner/helpers/hooks/
└── train/
    └── pretrain/
        β”œβ”€β”€ 01_check_environment.sh
        β”œβ”€β”€ 02_prepare_dataset.sh
        └── 03_setup_monitoring.sh

When you run primus-cli train pretrain, these Hooks execute automatically in order. No manual invocation, no training code modification needed.

The Patch system is for more flexible scenarios:

primus-cli direct --patch fixes/workaround_rccl.sh \
  -- train pretrain --config config.yaml

This is especially useful when you need to quickly apply temporary fixes or make environment-specific adjustments.


🧩 Layer 3: Task Execution Layer

This layer is responsible for executing specific business logicβ€”training, testing, environment checks, and other actual tasks. Remember we said "zero-intrusion extension"? How is this achieved?

# Training task
primus-cli direct -- train pretrain --config deepseek_v2.yaml

# Performance testing task
primus-cli direct -- benchmark gemm --dtype bf16 -M 8192

# Environment check task
primus-cli direct -- preflight check --gpu --network

Each task is an independent Python plugin module, auto-registered via decorators:

from primus.cli.registry import register_subcommand

@register_subcommand("train")
def run_train(args, unknown_args):
    # Your training business logic
    ...

Want to add a new task? Just create a new plugin file in primus/cli/subcommands/ without modifying any core code. For example, if you want to add a topology analysis task:

# Add file: primus/cli/subcommands/analyze.py
# And you can use it directly
primus-cli direct -- analyze topology --visualize

This plugin-based design allows Primus CLI to quickly respond to new requirements while keeping the core stable as functionality expands.


🌐 The Magic Behind: Intelligent Environment Detection

This is probably the most "black tech" part of Primus CLI.

Problem: Different GPUs Need Different Configurations

AMD's GPU family is rich: MI300X, MI250X, MI210... Each GPU has its optimal ROCm configuration and environment variable settings. The traditional approach is to let users manually select configurations, but this is both error-prone and insufficiently automated.

Solution: Three-Step Auto-Configuration

Step 1: Load Common Environment

# base_env.sh provides unified logging and utility functions
source "${SCRIPT_DIR}/base_env.sh"

Step 2: Auto-Detect GPU

# Intelligently detect the current node's GPU model
GPU_MODEL=$(bash "${SCRIPT_DIR}/detect_gpu.sh")
# Output: MI300X

Step 3: Load GPU-Specific Configuration

# Based on detection result, auto-load optimal configuration
GPU_CONFIG_FILE="${SCRIPT_DIR}/${GPU_MODEL}.sh"
source "$GPU_CONFIG_FILE"  # Load MI300X.sh

Now, MI300X.sh can contain all best practices for this GPU model:

  • ROCm environment variables (HSA_*, HIP_*)
  • RCCL communication optimization parameters
  • Memory management strategies
  • Performance tuning options

Users don't need to worry about these details at all - everything is automatic.

Real-World Example

# On MI300X cluster
primus-cli direct -- train pretrain --config config.yaml
# Auto-loads: MI300X.sh β†’ Optimized RCCL configuration

# Switch to MI250X cluster, same command
primus-cli direct -- train pretrain --config config.yaml
# Auto-loads: MI250X.sh β†’ Adapted different configuration

Cross-GPU migration with zero configuration changes!


πŸ§ͺ Foundation of Scientific Experiments: Reproducibility

In machine learning research, reproducibility is crucial. But reality is harsh:

"I ran this experiment three months ago, now I want to reproduce it... Wait, where's the config file? How did I set those environment variables again?"

Does this sound familiar? Primus CLI completely solves this problem with an automated snapshot mechanism.

Auto-Record Everything

Every time training starts, Primus CLI automatically saves the complete runtime context:

output/exp_2025_11_10_134522/
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ primus_env_dump.txt      # All environment variables
β”‚   β”œβ”€β”€ gpu_model.txt            # GPU model info
β”‚   β”œβ”€β”€ rocm_version.txt         # ROCm version
β”‚   └── system_info.json         # System configuration
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ primus_config.yaml       # Primus config
β”‚   β”œβ”€β”€ model_config.yaml        # Model config
β”‚   └── data_config.yaml         # Data config
β”œβ”€β”€ logs/
β”‚   └── launch.log               # Launch logs
└── metadata.json                # Runtime metadata

One-Click Reproduction

Three months later, when you want to reproduce this experiment:

# Just this simple!
primus-cli direct -- train pretrain --replay output/exp_2025_11_10_134522/

Primus CLI will automatically:

  1. Restore all environment variables
  2. Load original configuration files
  3. Verify GPU and system environment
  4. Start training (if environment is compatible)

Real-World Value

Scenario Traditional Approach Using Primus CLI
πŸ” Performance Comparison Manually record config, easy to miss details Auto-snapshot, precise reproduction
πŸ› Bug Debugging Hard to reproduce issues in new environment --replay one-click reproduction
πŸ“Š A/B Testing Manually ensure two runs are consistent Auto-guarantee config consistency
πŸ† Paper Experiments Manually write experiment setup docs One-click export complete environment
πŸ”„ Version Regression Rely on manual records Automated CI integration

πŸ“Š Real-World Case: From Development to Production

Let's see how Primus CLI simplifies the entire workflow through a real scenario.

Scenario: Training DeepSeek-V2 Model

Step 1: Local Development & Validation πŸ–₯️

# Quickly verify configuration on dev machine
primus-cli direct --debug -- train pretrain \
  --config configs/deepseek_v2_debug.yaml

Discover config issues, data format errors, etc. within minutes.

Step 2: Container Environment Testing 🐳

# Ensure it runs normally in standardized environment
primus-cli container \
  --image rocm/megatron-lm:v25.8_py310 \
  --mount /data:/data \
  -- train pretrain --config configs/deepseek_v2_debug.yaml

Containers ensure environment consistency, avoiding "works on my machine" problems.

Step 3: Small-Scale Cluster Validation πŸ–§

# 2-node test, verify distributed communication
primus-cli slurm srun -N 2 -p gpu-test \
  -- container --image rocm/megatron-lm:v25.8_py310 \
  -- train pretrain --config configs/deepseek_v2_small.yaml

Step 4: Production Large-Scale Training πŸš€

# 64-node, 512-GPU production training
primus-cli slurm sbatch \
  -N 64 -p gpu-prod -t 72:00:00 \
  --job-name=deepseek-v2-prod \
  -o logs/train_%j.log \
  -- container --image rocm/megatron-lm:v25.8_py310 \
  -- train pretrain --config configs/deepseek_v2_prod.yaml

Key Insight

Notice? From development to production, the core command structure remains unchanged:

primus-cli <runtime> -- <subcommand> <args>

Only the runtime environment (direct β†’ container β†’ slurm) changes - everything else is unified.


🎯 Core Advantages Summary

After the detailed introduction above, let's summarize the core value Primus CLI brings:

🎁 Feature πŸ’ͺ Capability πŸš€ Value
Unified Entry One command, fits all scenarios Lower learning curve, higher development efficiency
Plugin-Based Zero-intrusion extension of new features Quickly respond to new requirements, keep system stable
Intelligent Environment Auto-detect GPU and optimize config Zero cost for cross-platform migration
Hook System Automate pre/post task processing Decouple complex workflows, improve code reuse
Reproducibility One-click snapshot and restore Solid foundation for scientific experiments
Three-Layer Runtime Direct/Container/Slurm seamless switching Smooth path from development to production
Unified Logging Rank-aware structured logging Quickly locate multi-node issues

πŸ›£οΈ Future Roadmap

Primus CLI continues to evolve, and our near-term plans include:

Short-Term Goals (2025)

  • 🎯 Python Hook API: Support writing Hooks in Python scripts for more flexible extension capabilities
  • 🎯 Intelligent Preflight: Auto-check GPU health, network topology, InfiniBand connectivity before launch
  • 🎯 Configuration Template System: Built-in best practice config templates for common models
  • 🎯 Performance Analysis Reports: Auto-generate training performance analysis reports including GEMM, communication, I/O bottleneck analysis
  • 🎯 Enhanced Error Diagnostics: Intelligent error analysis and fix suggestions (e.g., RCCL hang, OOM, and other common issues)
  • 🎯 Extended Framework Support: Improve support for more training frameworks like TorchTitan, JAX/Flax
  • 🎯 CI/CD Integration: Provide standardized testing and validation workflows, support automated regression testing

Long-Term Vision

  • 🌟 Become the standard training entry point for the ROCm ecosystem

πŸŽ“ Summary: The Power of One Command

Back to the question at the beginning: How do we make large model training go from complex to simple?

Primus CLI's answer: Through carefully designed abstraction layers, hide complexity under a unified interface.

# From this simple command...
primus-cli direct -- train pretrain --config deepseek_v2.yaml

# ...to everything behind it:
# βœ… Auto GPU detection and configuration
# βœ… Intelligent environment variable setup
# βœ… Data preprocessing Hooks
# βœ… Distributed communication optimization
# βœ… Log collection and analysis
# βœ… Experiment snapshots and reproduction
# βœ… Error handling and recovery

This is Primus CLI: Making complex things simple, and simple things automatic.


πŸ“š Learn More


"The best interface is no interface."

But where interfaces cannot be eliminated, the best interface is unified, simple, and automated. That's Primus CLI.

Happy Training with AMD ROCm! πŸŽ‰