Add rocm perf yml file #418

Ruturaj4 · 2025-05-12T15:43:44Z

This PR adds a new GitHub Actions workflow that:

Builds JAX with ROCm support inside a Docker container.
Runs training for the following MaxText models:

llama2_7b
gemma_2b
gpt3_6b
mixtral_8x1b

Captures stdout logs for each model and extracts per-step timing

Ignores step 0 (warmup) when computing metrics

Computes median_step_time per model and saves it to summary.json

Uploads logs and metrics as workflow artifacts

A Python analysis script (analyze_maxtext_logs.py) is added under jax/build/rocm/ to parse logs and generate the summary.

mrodden · 2025-05-12T18:40:46Z

.github/workflows/rocm-perf.yml

+                times.append(float(m.group(1)))
+    if times:
+        summary[model] = {
+            "median_step_time": round(float(np.median(times)), 3),
+            "steps_counted": len(times)
+        }


grab the parsed steps too with the summary

Suggested change

times.append(float(m.group(1)))

if times:

summary[model] = {

"median_step_time": round(float(np.median(times)), 3),

"steps_counted": len(times)

}

times.append(float(m.group(1)))

if times:

step_info = list([{"step": n, "time": t} for n,t in enumerate(times)])

summary[model] = {

"steps": step_info,

"median_step_time": round(float(np.median(times)), 3),

"steps_counted": len(times)

}

sounds good

.github/workflows/rocm-perf.yml

charleshofer · 2025-05-13T17:42:08Z

This is a port from the other performance CI PR, right? Could you add a description and link to the original PR?

i-chaochen · 2025-05-21T14:42:56Z

are we considering of grok and alphafold models? @Ruturaj4 @JehandadKhan

Yes, we are def planning to add alphafold, however grok testing takes too much time to download weights. If grok training can be done or if there are ways to run grok faster, we are happy to add those as well!

Arech8 · 2025-05-22T11:07:27Z

Why did you chose to report median step time?

I don't know the rationale for that, but in general, I'm not sure that median is a correct metric here. It rejects outliers and alone it totally doesn't describe distribution of values, but that is exactly what is important to know:

if there are important outliers (in any direction):
- it's important to investigate them. Like, if there's some java-like garbage collection step that make an app totally unresponsive, - this is just a no-go in many contexts. This is still true for model training/inferencing.
- outliers significantly influence total runtime and a perception of "fast" or "slow" for users.
robust statistics which median is a part of, are the best in describing shapes of any distributions, however no single metric of robust statistics is able to do that in isolation: several of them must be used. At least, min + max also, but quartiles (25% + 75%) are generally super useful also.
- if you want one value, a mean is much better in that as it contains equally weighted information from all samples, while median describes only 1 or 2 samples at best, leaving just nothing about the rest.
- mean have additional nice property that it allows to forecast a total runtime in a different circumstances. For example, if you've measured from 100 epochs that your average time per epoch is 1s, then you have some reasons to expect that 1000 epochs will last 1000 seconds. If you do the same for median - you can say nothing even about the next 100 epoch run.

TLDR: mean metric seem much better here. For the best results, I'd make 6 values: [0, 25, 50, 75, 100]% quantiles + mean too (b/c of the last bullet point) (and God forbid of stddev)

.github/workflows/rocm-perf.yml

i-chaochen · 2025-06-02T09:03:08Z

are we considering of grok and alphafold models? @Ruturaj4 @JehandadKhan

Yes, we are def planning to add alphafold, however grok testing takes too much time to download weights. If grok training can be done or if there are ways to run grok faster, we are happy to add those as well!

: ) why you not reply instead of edit it...I don't even notice you had reply.. @Ruturaj4

I remember we put grok weights on a shared directory last time to avoid duplicate download last time, not sure whether we can achieve this on CI, something like a perpetual storage node for grok weights?

Ruturaj4 · 2025-06-04T11:29:08Z

are we considering of grok and alphafold models? @Ruturaj4 @JehandadKhan
Yes, we are def planning to add alphafold, however grok testing takes too much time to download weights. If grok training can be done or if there are ways to run grok faster, we are happy to add those as well!

: ) why you not reply instead of edit it...I don't even notice you had reply.. @Ruturaj4

I remember we put grok weights on a shared directory last time to avoid duplicate download last time, not sure whether we can achieve this on CI, something like a perpetual storage node for grok weights?

ohh I accidently edited ur reply! Yup we can do something like that as long as we have space somewhere that ci nodes can access.

Ruturaj4 · 2025-06-04T11:32:31Z

Why did you chose to report median step time?

I don't know the rationale for that, but in general, I'm not sure that median is a correct metric here. It rejects outliers and alone it totally doesn't describe distribution of values, but that is exactly what is important to know:

if there are important outliers (in any direction):

it's important to investigate them. Like, if there's some java-like garbage collection step that make an app totally unresponsive, - this is just a no-go in many contexts. This is still true for model training/inferencing.

outliers significantly influence total runtime and a perception of "fast" or "slow" for users.

robust statistics which median is a part of, are the best in describing shapes of any distributions, however no single metric of robust statistics is able to do that in isolation: several of them must be used. At least, min + max also, but quartiles (25% + 75%) are generally super useful also.

if you want one value, a mean is much better in that as it contains equally weighted information from all samples, while median describes only 1 or 2 samples at best, leaving just nothing about the rest.

mean have additional nice property that it allows to forecast a total runtime in a different circumstances. For example, if you've measured from 100 epochs that your average time per epoch is 1s, then you have some reasons to expect that 1000 epochs will last 1000 seconds. If you do the same for median - you can say nothing even about the next 100 epoch run.

TLDR: mean metric seem much better here. For the best results, I'd make 6 values: [0, 25, 50, 75, 100]% quantiles + mean too (b/c of the last bullet point) (and God forbid of stddev)

I decided to use median as per the suggestions by Jehandad. But I will also add mean time as per ur suggestions.

psanal35

Looks good to me.

i-chaochen · 2025-06-26T16:13:13Z

these 4 models are all LLm, can we add one more for video? https://github.com/google-research/scenic/tree/main/scenic/projects/baselines

in addition, can we add one simulation model as well? for example https://github.com/Autodesk/XLB

furthermore, I have found out jax is very popular in ai for science community, and there are number of frameworks/models based on jax, wondering could we try one of them for jax perf CI? (maybe alphafold is enough?)

https://github.com/search?q=topic%3Ajax+org%3Asail-sg+fork%3Atrue&type=repositories

for example this one https://github.com/sail-sg/jax_xc

in general, hope we can cover LLm, video, simulation and AI4Science.

cc @JehandadKhan @Ruturaj4

Ruturaj4 closed this May 12, 2025

Ruturaj4 reopened this May 12, 2025

mrodden reviewed May 12, 2025

View reviewed changes

Ruturaj4 force-pushed the bring_rocm_dlm_perf branch from d2912f5 to 952cc4f Compare May 13, 2025 12:13

Arech8 reviewed May 30, 2025

View reviewed changes

.github/workflows/rocm-perf.yml Show resolved Hide resolved

Ruturaj4 force-pushed the bring_rocm_dlm_perf branch from 952cc4f to e396bec Compare June 4, 2025 11:25

Add rocm perf yml file

d6c595d

Ruturaj4 force-pushed the bring_rocm_dlm_perf branch from e396bec to d6c595d Compare June 4, 2025 13:18

psanal35 approved these changes Jun 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add rocm perf yml file #418

Add rocm perf yml file #418

Uh oh!

Ruturaj4 commented May 12, 2025 •

edited

Loading

Uh oh!

mrodden May 12, 2025

Uh oh!

Ruturaj4 May 15, 2025

Uh oh!

Uh oh!

charleshofer commented May 13, 2025

Uh oh!

i-chaochen commented May 21, 2025 •

edited by Ruturaj4

Loading

Uh oh!

Arech8 commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

i-chaochen commented Jun 2, 2025

Uh oh!

Ruturaj4 commented Jun 4, 2025

Uh oh!

Ruturaj4 commented Jun 4, 2025

Uh oh!

psanal35 left a comment

Uh oh!

i-chaochen commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add rocm perf yml file #418

Are you sure you want to change the base?

Add rocm perf yml file #418

Uh oh!

Conversation

Ruturaj4 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrodden May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Ruturaj4 May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

charleshofer commented May 13, 2025

Uh oh!

i-chaochen commented May 21, 2025 • edited by Ruturaj4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Arech8 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

i-chaochen commented Jun 2, 2025

Uh oh!

Ruturaj4 commented Jun 4, 2025

Uh oh!

Ruturaj4 commented Jun 4, 2025

Uh oh!

psanal35 left a comment

Choose a reason for hiding this comment

Uh oh!

i-chaochen commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Ruturaj4 commented May 12, 2025 •

edited

Loading

i-chaochen commented May 21, 2025 •

edited by Ruturaj4

Loading

Arech8 commented May 22, 2025 •

edited

Loading

i-chaochen commented Jun 26, 2025 •

edited

Loading