Skip to content

Commit 821043a

Browse files
authored
Update Primus Docker base image from v26.1 to v26.2 (#642)
Update all references to the Primus base image across documentation, configuration files, CI/CD workflows, benchmark helpers, and example scripts to use the latest v26.2 release. Keep existing JAX/MaxText image references unchanged.
1 parent 7901d1b commit 821043a

File tree

18 files changed

+92
-50
lines changed

18 files changed

+92
-50
lines changed

.github/workflows/ci.yaml

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ env:
1414
PRIMUS_TURBO_AITER_COMMIT: e83f9903c07001a0ec29e85d223f6e6cdbe00859
1515
ROCSHMEM_COMMIT: 17ff985c026f9f97f85068647e863ab541dd5645 # Update version to 3.2.0 for 7.2.0 rocm release (#351) (#355)
1616
UCCL_COMMIT: 5afb4117893c58cc0c8557d9286336141a301053 # [EP]: fix fp8 error of internode_ll on amd gfx950 arch. (#710)
17-
BASE_IMAGE: docker.io/rocm/primus:v26.1
17+
BASE_IMAGE: docker.io/rocm/primus:v26.2
1818
MAXTEXT_BASE_IMAGE: docker.io/rocm/jax-training:maxtext-v26.1
1919

2020
jobs:
@@ -115,24 +115,26 @@ jobs:
115115
docker push docker.io/tasimage/primus:${{env.IMAGE_TAG}}
116116
docker login -u rocmshared -p ${{ secrets.ROCM_DOCKER_HUB_TOKEN }}
117117
118-
echo "> Build Docker Image with tag: ${{ env.IMAGE_TAG }}-ainic"
119-
start_time=$(date +%s)
120-
mkdir -p $GITHUB_WORKSPACE/.github/workflows/docker/ainic
121-
cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-56.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; }
122-
docker build -f $GITHUB_WORKSPACE/.github/workflows/docker/Dockerfile.ainic \
123-
--network=host \
124-
-t tasimage/primus:${{env.IMAGE_TAG}}-ainic \
125-
--build-arg BASE_IMAGE=docker.io/tasimage/primus:${{env.IMAGE_TAG}} \
126-
--build-arg AINIC_BUNDLE_PATH=ainic \
127-
$GITHUB_WORKSPACE/.github/workflows/docker
128-
end_time=$(date +%s)
129-
elapsed=$((end_time - start_time))
130-
echo "⏱️ [build primus docker-ainic] Total elapsed time: ${elapsed} seconds"
118+
# # Primus v26.2 already includes AINIC under /workspace. Re-enable this
119+
# # Dockerfile.ainic build only when we need to refresh the tasimage -ainic image.
120+
# echo "> Build Docker Image with tag: ${{ env.IMAGE_TAG }}-ainic"
121+
# start_time=$(date +%s)
122+
# mkdir -p $GITHUB_WORKSPACE/.github/workflows/docker/ainic
123+
# cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-56.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; }
124+
# docker build -f $GITHUB_WORKSPACE/.github/workflows/docker/Dockerfile.ainic \
125+
# --network=host \
126+
# -t tasimage/primus:${{env.IMAGE_TAG}}-ainic \
127+
# --build-arg BASE_IMAGE=docker.io/tasimage/primus:${{env.IMAGE_TAG}} \
128+
# --build-arg AINIC_BUNDLE_PATH=ainic \
129+
# $GITHUB_WORKSPACE/.github/workflows/docker
130+
# end_time=$(date +%s)
131+
# elapsed=$((end_time - start_time))
132+
# echo "⏱️ [build primus docker-ainic] Total elapsed time: ${elapsed} seconds"
131133
132-
docker tag tasimage/primus:${{env.IMAGE_TAG}}-ainic docker.io/tasimage/primus:${{env.IMAGE_TAG}}-ainic
133-
docker login -u tasimage -p ${{ secrets.PRIMUS_DOCKER_HUB_TOKEN }}
134-
docker push docker.io/tasimage/primus:${{env.IMAGE_TAG}}-ainic
135-
docker login -u rocmshared -p ${{ secrets.ROCM_DOCKER_HUB_TOKEN }}
134+
# docker tag tasimage/primus:${{env.IMAGE_TAG}}-ainic docker.io/tasimage/primus:${{env.IMAGE_TAG}}-ainic
135+
# docker login -u tasimage -p ${{ secrets.PRIMUS_DOCKER_HUB_TOKEN }}
136+
# docker push docker.io/tasimage/primus:${{env.IMAGE_TAG}}-ainic
137+
# docker login -u rocmshared -p ${{ secrets.ROCM_DOCKER_HUB_TOKEN }}
136138
137139
echo "> Build Docker Image with tag: ${{ env.IMAGE_TAG }}-v25.09-ainic"
138140
start_time=$(date +%s)
@@ -206,7 +208,7 @@ jobs:
206208
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch
207209
needs: [code-lint]
208210
# runs-on: [primus-lm-cicd-torch-j8knc]
209-
runs-on: [primus-lm-cicd-torch-tas8n-a16-40]
211+
runs-on: [primus-lm-cicd-v26.2-tas8n-a16-40]
210212
steps:
211213
- run: echo "🎉 Begin Primus-Turbo Checkout."
212214
- name: Set commit hash to env

.github/workflows/docker/Dockerfile

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
ARG BASE_IMAGE=docker.io/rocm/primus:v26.1
1+
ARG BASE_IMAGE=docker.io/rocm/primus:v26.2
22
FROM ${BASE_IMAGE}
33

44
ARG PRIMUS_TURBO_COMMIT
@@ -22,12 +22,14 @@ RUN rm -rf /var/lib/apt/lists/*
2222
# ---------------------------------------------------------------------------
2323
ENV ROCSHMEM_HOME=/opt/rocshmem
2424
ENV UCX_HOME=/opt/ucx
25-
ENV MPI_HOME=/opt/ompi
25+
# ENV MPI_HOME=/opt/ompi
26+
# Use the system OpenMPI prefix from the v26.2 base image.
27+
ENV MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
2628
ENV ROCM_HOME=/opt/rocm
2729
ENV PRIMUS_TURBO_FRAMEWORK=${PRIMUS_TURBO_FRAMEWORK}
2830

29-
ENV PATH="/opt/ompi/bin:/opt/ompi/sbin:${PATH}"
30-
ENV LD_LIBRARY_PATH="/opt/ompi/lib:${LD_LIBRARY_PATH}"
31+
# ENV PATH="/opt/ompi/bin:/opt/ompi/sbin:${PATH}"
32+
# ENV LD_LIBRARY_PATH="/opt/ompi/lib:${LD_LIBRARY_PATH}"
3133
ENV GPU_ARCHS="gfx942;gfx950"
3234
ENV PYTORCH_ROCM_ARCH="gfx942;gfx950"
3335
ENV HCC_AMDGPU_TARGET="gfx942,gfx950"

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Primus leverages AMD’s ROCm Docker images to provide a consistent, ready-to-ru
5858
1. **Pull the latest Docker image**
5959

6060
```bash
61-
docker pull docker.io/rocm/primus:v26.1
61+
docker pull docker.io/rocm/primus:v26.2
6262
```
6363

6464
2. **Clone the repository**
@@ -74,7 +74,7 @@ Primus leverages AMD’s ROCm Docker images to provide a consistent, ready-to-ru
7474
# Run training in container
7575
# NOTE: If your config downloads weights/tokenizer from Hugging Face Hub,
7676
# you typically need to pass HF_TOKEN into the container.
77-
./primus-cli container --image rocm/primus:v26.1 \
77+
./primus-cli container --image rocm/primus:v26.2 \
7878
--env HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
7979
-- train pretrain --config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
8080
```

benchmark/kernel/rccl/run_slurm.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
# Usage:
1111
# DOCKER_IMAGE=<image> sbatch run_slurm.sh
1212
# DOCKER_IMAGE=<image> NNODES=2 PARTITION=my-gpu sbatch run_slurm.sh
13-
# DOCKER_IMAGE=rocm/primus:v26.1 NNODES=2 sbatch -N2 -w smci355-ccs-aus-n04-[25,29] -p Compute-DCPT ./run_slurm.sh
13+
# DOCKER_IMAGE=rocm/primus:v26.2 NNODES=2 sbatch -N2 -w smci355-ccs-aus-n04-[25,29] -p Compute-DCPT ./run_slurm.sh
1414
#
1515
# Environment variables (all optional except DOCKER_IMAGE):
1616
# DOCKER_IMAGE Docker image to use (required)

benchmark/kernel/rccl/submit_pairs.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/usr/bin/env bash
22
set -euo pipefail
33

4-
DOCKER_IMAGE="rocm/primus:v26.1"
4+
DOCKER_IMAGE="rocm/primus:v26.2"
55
#DOCKER_IMAGE="docker.gpuperf:5000/aai_2026_training/rocm/primus_megatron:v25.11_gpt_oss_sink"
66
#DOCKER_IMAGE="docker.gpuperf:5000/gpuperf/primus:v26.1_sinkfa"
77
NNODES=2

docs/cli/PRIMUS-CLI-GUIDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ Primus CLI supports three execution modes, each suitable for different scenarios
9797
**Common Options**:
9898
| Option | Description | Example |
9999
|--------|-------------|---------|
100-
| `--image IMAGE` | Specify container image | `--image rocm/primus:v26.1` |
100+
| `--image IMAGE` | Specify container image | `--image rocm/primus:v26.2` |
101101
| `--volume PATH[:PATH]` | Mount directory | `--volume /data:/data` |
102102
| `--cpus N` | Limit CPU cores | `--cpus 16` |
103103
| `--memory SIZE` | Limit memory size | `--memory 128G` |
@@ -174,7 +174,7 @@ Primus CLI supports three execution modes, each suitable for different scenarios
174174
./primus-cli slurm srun -N 4 -- preflight --report-file-name preflight-report-4N
175175

176176
# if you are using AINIC in your cluster, use the appropriate configuration file
177-
# for preflight test, set docker image to rocm/primus:v26.1 in the configuration file
177+
# for preflight test, set docker image to rocm/primus:v26.2 in the configuration file
178178
./primus-cli --config runner/use_ainic.yaml slurm srun -N 2 -- preflight --report-file-name preflight-report-2N
179179
```
180180

@@ -215,7 +215,7 @@ slurm:
215215

216216
# Container configuration
217217
container:
218-
image: "rocm/primus:v26.1"
218+
image: "rocm/primus:v26.2"
219219
options:
220220
cpus: "32"
221221
memory: "256G"
@@ -645,7 +645,7 @@ Step 4: primus-cli-container.sh (on each node)
645645
├─ Load container.* config (image, devices, mounts, etc.)
646646
├─ Parse container params: --image rocm/megatron-lm:v25.8_py310
647647
├─ Merge config and CLI params
648-
│ Config: image=rocm/primus:v26.1
648+
│ Config: image=rocm/primus:v26.2
649649
│ CLI: --image rocm/megatron-lm:v25.8_py310
650650
│ Result: image=rocm/megatron-lm:v25.8_py310
651651
├─ Build container options

docs/cli/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ primus-cli direct -- benchmark gemm -M 4096 -N 4096 -K 4096
4848
| Mode | Use Case | Command Example |
4949
|------|----------|-----------------|
5050
| **Direct** | Local development, quick validation | `primus-cli direct -- train pretrain` |
51-
| **Container** | Environment isolation, dependency management | `primus-cli container --image rocm/primus:v26.1 -- train pretrain` |
51+
| **Container** | Environment isolation, dependency management | `primus-cli container --image rocm/primus:v26.2 -- train pretrain` |
5252
| **Slurm** | Multi-node distributed training | `primus-cli slurm srun -N 8 -- train pretrain` |
5353

5454
## 📖 Learn More

docs/quickstart.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ rocm-smi && docker --version
2121

2222
```bash
2323
# Pull Docker image
24-
docker pull docker.io/rocm/primus:v26.1
24+
docker pull docker.io/rocm/primus:v26.2
2525

2626
# Clone repository
2727
git clone --recurse-submodules https://github.com/AMD-AIG-AIMA/Primus.git
@@ -32,7 +32,7 @@ cd Primus
3232

3333
```bash
3434
# Run a quick benchmark in container
35-
./primus-cli container --image rocm/primus:v26.1 \
35+
./primus-cli container --image rocm/primus:v26.2 \
3636
-- benchmark gemm -M 4096 -N 4096 -K 4096
3737
```
3838

@@ -50,7 +50,7 @@ Use the Docker image you just pulled:
5050

5151
```bash
5252
# Run training in container (recommended for getting started)
53-
./primus-cli container --image rocm/primus:v26.1 \
53+
./primus-cli container --image rocm/primus:v26.2 \
5454
-- train pretrain --config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
5555
```
5656

@@ -62,7 +62,7 @@ Use the Docker image you just pulled:
6262
--config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
6363

6464
# Slurm mode (for multi-node cluster)
65-
./primus-cli slurm srun -N 8 -p gpu -- container --image rocm/primus:v26.1 \
65+
./primus-cli slurm srun -N 8 -p gpu -- container --image rocm/primus:v26.2 \
6666
-- train pretrain --config examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml
6767
```
6868

examples/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ We recommend using the official [rocm/megatron-lm Docker image](https://hub.dock
4949

5050
```bash
5151
# Pull the latest Docker image
52-
docker pull docker.io/rocm/primus:v26.1
52+
docker pull docker.io/rocm/primus:v26.2
5353

5454
```
5555

@@ -126,7 +126,7 @@ Multi-node training is launched via **SLURM**.
126126
Specify the number of nodes and the model config:
127127

128128
```bash
129-
export DOCKER_IMAGE="docker.io/rocm/primus:v26.1"
129+
export DOCKER_IMAGE="docker.io/rocm/primus:v26.2"
130130
export NNODES=8
131131

132132
# Example for megatron llama3.1_8B
@@ -285,7 +285,7 @@ When using the `create` command to start a new training workload, the following
285285
| `--gpu` | Number of GPUs | 8 |
286286
| `--exp` | Path to experiment (training config) file (required) ||
287287
| `--data_path` | Path to training data ||
288-
| `--image` | Docker image to use | `docker.io/rocm/primus:v26.1` |
288+
| `--image` | Docker image to use | `docker.io/rocm/primus:v26.2` |
289289
| `--hf_token` | HuggingFace token | Read from env var `HF_TOKEN` |
290290
| `--workspace` | Workspace name | `primus-safe-pretrain` |
291291
| `--nodelist` | Comma-separated list of node hostnames to run on ||

examples/run_k8s_pretrain.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ GPU="8"
1515
EXP_PATH=""
1616
DATA_PATH=""
1717
BACKEND="megatron"
18-
IMAGE="docker.io/rocm/primus:v26.1"
18+
IMAGE="docker.io/rocm/primus:v26.2"
1919
HF_TOKEN="${HF_TOKEN:-}"
2020
WORKSPACE="primus-safe-pretrain"
2121
NODELIST=""
@@ -38,7 +38,7 @@ Options for create:
3838
--backend <name> Training backend, e.g. megatron | torchtitan(default: megatron)
3939
--exp <exp_path> Path to EXP config (optional)
4040
--data_path <data_path> Data path (optional)
41-
--image <docker_image> Docker image to use (default: docker.io/rocm/primus:v26.1)
41+
--image <docker_image> Docker image to use (default: docker.io/rocm/primus:v26.2)
4242
--hf_token <token> HuggingFace token (default: from env HF_TOKEN)
4343
--workspace <workspace> Workspace name (default: safe-cluster-dev)
4444
--nodelist <node1,node2> Comma-separated list of node names to run on (optional)

0 commit comments

Comments
 (0)