Getting started with AMD ROCm

Last updated: 05/17/2026.

Author: Mingjie Lu, Xiaohong Kou, Fuwei Yang

Overview

This document is a quick-start tutorial for running VeRL on AMD ROCm. It provides a production-style bring-up flow for container startup, environment verification, and training examples.

Current software and hardware scope:

Runtime modes: fully supports Fully Async and Colocate.
Inference engine: vLLM validated; SGLang support is ongoing.
Trainer backends: FSDP, FSDP2 and Megatron.
GPU targets:
- MI300X / MI325X (gfx942)
- MI355X (gfx950)

Software Baseline

Use the following prebuilt image for tutorial and validation:

amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312

The Docker build recipe remains unchanged:

docker/rocm/Dockerfile.rocm

Host Prerequisites

Before launching the container, ensure:

AMD ROCm 7.0.2 host driver stack is installed and healthy.
Docker has access to /dev/kfd and /dev/dri.
Dataset and model storage paths are ready.

Launch Container

NAME=verl_release
DOCKER=amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312

docker pull $DOCKER

docker run -it --name $NAME --device /dev/kfd --device /dev/dri \
  --privileged --network=host \
  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
  --shm-size=2048g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -w /workspace \
  $DOCKER \
  /bin/bash

Environment Check (Inside Container)

# ROCm and visible GPU targets
rocminfo | grep -E "gfx942|gfx950" || true

# PyTorch + ROCm sanity check
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("rocm :", torch.version.hip)
print("cuda_available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("gpu_count:", torch.cuda.device_count())
    print("device_0:", torch.cuda.get_device_name(0))
PY

Feature Support Matrix

Current support status

Category	Status	Notes
Runtime mode	Fully supported	Fully Async and Colocate are production-ready
Inference engine	vLLM validated	SGLang integration is ongoing
Trainer backend	Fully supported	FSDP, Megatron
Hardware	Fully supported	MI300X / MI325X (gfx942), MI355X (gfx950)

Example Workflow

1) Colocate mode + FSDP (GRPO, Qwen3-8B)

For Qwen3-8B FSDP training, enable both parameter and optimizer offload to avoid OOM.

# Configure these in your launch script or Hydra overrides:
# actor_rollout_ref.actor.fsdp_config.param_offload=True
# actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

2) Colocate mode + Megatron (GRPO, Qwen3.5-35B)

bash examples/grpo_trainer/run_qwen3_5-35b-megatron.sh

3) Fully Async mode

RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES and RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES are no longer required in this release.

# For qwen2.5-math-7b, update max_position_embeddings to 32768 in config.json after model download.
bash verl/experimental/fully_async_policy/shell/dapo_7b_math_fsdp2_4_4.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started with AMD ROCm

Overview

Software Baseline

Host Prerequisites

Launch Container

Environment Check (Inside Container)

Feature Support Matrix

Example Workflow

1) Colocate mode + FSDP (GRPO, Qwen3-8B)

2) Colocate mode + Megatron (GRPO, Qwen3.5-35B)

3) Fully Async mode

FilesExpand file tree

amd_quick_start.rst

Latest commit

History

amd_quick_start.rst

File metadata and controls

Getting started with AMD ROCm

Overview

Software Baseline

Host Prerequisites

Launch Container

Environment Check (Inside Container)

Feature Support Matrix

Example Workflow

1) Colocate mode + FSDP (GRPO, Qwen3-8B)

2) Colocate mode + Megatron (GRPO, Qwen3.5-35B)

3) Fully Async mode