Last updated: 05/17/2026.
Author: Mingjie Lu, Xiaohong Kou, Fuwei Yang
This document is a quick-start tutorial for running VeRL on AMD ROCm. It provides a production-style bring-up flow for container startup, environment verification, and training examples.
Current software and hardware scope:
- Runtime modes: fully supports Fully Async and Colocate.
- Inference engine: vLLM validated; SGLang support is ongoing.
- Trainer backends: FSDP, FSDP2 and Megatron.
- GPU targets:
- MI300X / MI325X (
gfx942) - MI355X (
gfx950)
- MI300X / MI325X (
Use the following prebuilt image for tutorial and validation:
amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312
The Docker build recipe remains unchanged:
Before launching the container, ensure:
- AMD ROCm 7.0.2 host driver stack is installed and healthy.
- Docker has access to
/dev/kfdand/dev/dri. - Dataset and model storage paths are ready.
NAME=verl_release
DOCKER=amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312
docker pull $DOCKER
docker run -it --name $NAME --device /dev/kfd --device /dev/dri \
--privileged --network=host \
--group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--shm-size=2048g \
--ulimit memlock=-1 --ulimit stack=67108864 \
-w /workspace \
$DOCKER \
/bin/bash# ROCm and visible GPU targets
rocminfo | grep -E "gfx942|gfx950" || true
# PyTorch + ROCm sanity check
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("rocm :", torch.version.hip)
print("cuda_available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("gpu_count:", torch.cuda.device_count())
print("device_0:", torch.cuda.get_device_name(0))
PY| Category | Status | Notes |
|---|---|---|
| Runtime mode | Fully supported | Fully Async and Colocate are production-ready |
| Inference engine | vLLM validated | SGLang integration is ongoing |
| Trainer backend | Fully supported | FSDP, Megatron |
| Hardware | Fully supported | MI300X / MI325X (gfx942), MI355X (gfx950) |
For Qwen3-8B FSDP training, enable both parameter and optimizer offload to avoid OOM.
# Configure these in your launch script or Hydra overrides:
# actor_rollout_ref.actor.fsdp_config.param_offload=True
# actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
bash examples/grpo_trainer/run_qwen3_8b_fsdp.shbash examples/grpo_trainer/run_qwen3_5-35b-megatron.shRAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES and
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES are no longer required in this release.
# For qwen2.5-math-7b, update max_position_embeddings to 32768 in config.json after model download.
bash verl/experimental/fully_async_policy/shell/dapo_7b_math_fsdp2_4_4.sh