Skip to content

Fix PhimoeIntegrationTest#46539

Open
ydshieh wants to merge 8 commits into
mainfrom
fix_phimoe2
Open

Fix PhimoeIntegrationTest#46539
ydshieh wants to merge 8 commits into
mainfrom
fix_phimoe2

Conversation

@ydshieh

@ydshieh ydshieh commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Fix PhimoeIntegrationTest. See the comments.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ydshieh ydshieh changed the title [DON'T MERGE YET] Fix PhimoeIntegrationTest Fix PhimoeIntegrationTest Jun 10, 2026
Comment on lines +123 to +139
import accelerate.utils

# Our CI runners with A10 GPU use instance sharing, which makes each runner report ~750 GB of CPU RAM.
# With that much CPU memory visible, `device_map="auto"` fits all layers on GPU+CPU with nothing
# spilling to disk. This produces a device map that causes GPU OOM when the model is actually run.
# We patch `accelerate_max_memory` to cap CPU at 60 GiB — the amount a dedicated A10 instance
# normally sees — so that the same layers are offloaded to disk as on a real single-tenant runner,
# avoiding the OOM.
_original_get_max_memory = accelerate.utils.get_max_memory

def _cap_cpu_memory(max_memory=None):
result = _original_get_max_memory(max_memory)
result["cpu"] = min(result.get("cpu", float("inf")), 60 * 1024**3)
return result

cls.offload_dir = tempfile.TemporaryDirectory()
with patch("transformers.integrations.accelerate.accelerate_max_memory", _cap_cpu_memory):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add such a patching method in testing_utils because we might need to use it in several places

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, this is what we have in our CI runner, which shows CPU RAM total=747.90 GB and no offloaded to disk (without this PR).

WARNING transformers.modeling_utils:modeling_utils.py:4290 [device_map] Before _get_device_map — device_map='auto', max_memory=None, CPU RAM total=747.90 GB, available=727.44 GB, used=20.46 GB, percent=2.7%
WARNING transformers.modeling_utils:modeling_utils.py:4301 [device_map] After _get_device_map — device_map=OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 'cpu'), ('model.layers.17', 'cpu'), ('model.layers.18', 'cpu'), ('model.layers.19', 'cpu'), ('model.layers.20', 'cpu'), ('model.layers.21', 'cpu'), ('model.layers.22', 'cpu'), ('model.layers.23', 'cpu'), ('model.layers.24', 'cpu'), ('model.layers.25', 'cpu'), ('model.layers.26', 'cpu'), ('model.layers.27', 'cpu'), ('model.layers.28', 'cpu'), ('model.layers.29', 'cpu'), ('model.layers.30', 'cpu'), ('model.layers.31', 'cpu'), ('model.norm', 'cpu'), ('model.rotary_emb', 'cpu'), ('lm_head', 'cpu')]), CPU RAM total=747.90 GB, available=727.37 GB, used=20.53 GB, percent=2.7%

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, that should likely really be a general utility - given this will always happen for models hitting this case

experts_implementation="eager",
dtype="auto",
device_map="auto",
offload_folder=cls.offload_dir.name,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to specify offload_folder after @Cyrilvallez work

Comment on lines +174 to +175
[-3.5625, -2.4375, -1.3672, 0.3438, -0.7539, -0.4590, 0.6133, -0.4531, 0.2188, -1.2422],
[-0.9688, 0.3633, -0.4902, 2.3281, 0.6250, 3.1094, 0.3828, 0.1670, 0.5781, -2.1094],

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes somehow. But the other 2 integration tests of generations pass without any change. The model is fine.

@vasqu

vasqu commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Imo looks good, but would really wait on @Cyrilvallez here to double check since he does/did the most work regarding offloading. Maybe we can also not use accelerate, do we maybe have some util for that?

@ydshieh

ydshieh commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Just a bit more context:

Before this PR, on our scheduled CI: device map only has some in GPU and some in CPU, but nothing not on disk. This causes GPU OOM when the model running on inputs.

device_map=OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 'cpu'), ('model.layers.17', 'cpu'), ('model.layers.18', 'cpu'), ('model.layers.19', 'cpu'), ('model.layers.20', 'cpu'), ('model.layers.21', 'cpu'), ('model.layers.22', 'cpu'), ('model.layers.23', 'cpu'), ('model.layers.24', 'cpu'), ('model.layers.25', 'cpu'), ('model.layers.26', 'cpu'), ('model.layers.27', 'cpu'), ('model.layers.28', 'cpu'), ('model.layers.29', 'cpu'), ('model.layers.30', 'cpu'), ('model.layers.31', 'cpu'), ('model.norm', 'cpu'), ('model.rotary_emb', 'cpu'), ('lm_head', 'cpu')])

After this PR: device map only has some in GPU and some in CPU, and some on disk. The model could run on inputs and get outputs, no GPU OOM.

device_map=OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 'cpu'), ('model.layers.8', 'cpu'), ('model.layers.9', 'cpu'), ('model.layers.10', 'cpu'), ('model.layers.11', 'cpu'), ('model.layers.12', 'cpu'), ('model.layers.13', 'cpu'), ('model.layers.14', 'cpu'), ('model.layers.15', 'cpu'), ('model.layers.16', 'cpu'), ('model.layers.17', 'cpu'), ('model.layers.18', 'cpu'), ('model.layers.19', 'cpu'), ('model.layers.20', 'cpu'), ('model.layers.21', 'cpu'), ('model.layers.22', 'cpu'), ('model.layers.23', 'cpu'), ('model.layers.24', 'cpu'), ('model.layers.25', 'cpu'), ('model.layers.26', 'cpu'), ('model.layers.27', 'cpu'), ('model.layers.28', 'disk'), ('model.layers.29', 'disk'), ('model.layers.30', 'disk'), ('model.layers.31', 'disk'), ('model.norm', 'disk'), ('model.rotary_emb', 'disk'), ('lm_head', 'disk')])

It's unclear to me why the first case will cause GPU OOM while the second case is OK.

I will wait until Friday evening. Otherwise I will simply merge (after implementing a reusable helper patching method).

@vasqu vasqu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's merge as you planned then (after the utils refactor), just one small question

I think more models could benefit from this but no rush, one step after the other

cls.model = PhimoeForCausalLM.from_pretrained(
"microsoft/Phi-3.5-MoE-instruct", experts_implementation="eager", dtype="auto", device_map="auto"
)
import accelerate.utils

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something from our utils instead? Or is it always guaranteed to have accelerate atp 🤔

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean if we will always have accelerate installed ? At places like this, it's certain we have it.

But for the patch, I plan to use it in conftest.py (maybe, not 100% sure yet) so we don't need to patch every place manually. During the whole test sessions, it should not see the whole 750G memory however (it's a infra setting stuff). So yes, it's better not to rely on accelerate.

I will refine the logic.

Great point, thank you.

@ydshieh

ydshieh commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Will merge on Sunday. Don't want to have a surprisingly sad Saturday!

@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@ydshieh ydshieh force-pushed the fix_phimoe2 branch 2 times, most recently from 9932607 to 09c06b9 Compare June 15, 2026 15:39
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: phimoe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants