Fix `PhimoeIntegrationTest` by ydshieh · Pull Request #46539 · huggingface/transformers

ydshieh · 2026-06-10T13:20:34Z

What does this PR do?

Fix PhimoeIntegrationTest. See the comments.

HuggingFaceDocBuilderDev · 2026-06-10T13:36:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ydshieh · 2026-06-10T15:30:24Z

+            import accelerate.utils
+
+            # Our CI runners with A10 GPU use instance sharing, which makes each runner report ~750 GB of CPU RAM.
+            # With that much CPU memory visible, `device_map="auto"` fits all layers on GPU+CPU with nothing
+            # spilling to disk. This produces a device map that causes GPU OOM when the model is actually run.
+            # We patch `accelerate_max_memory` to cap CPU at 60 GiB — the amount a dedicated A10 instance
+            # normally sees — so that the same layers are offloaded to disk as on a real single-tenant runner,
+            # avoiding the OOM.
+            _original_get_max_memory = accelerate.utils.get_max_memory
+
+            def _cap_cpu_memory(max_memory=None):
+                result = _original_get_max_memory(max_memory)
+                result["cpu"] = min(result.get("cpu", float("inf")), 60 * 1024**3)
+                return result
+
+            cls.offload_dir = tempfile.TemporaryDirectory()
+            with patch("transformers.integrations.accelerate.accelerate_max_memory", _cap_cpu_memory):


We should probably add such a patching method in testing_utils because we might need to use it in several places

For the record, this is what we have in our CI runner, which shows CPU RAM total=747.90 GB and no offloaded to disk (without this PR).

WARNING transformers.modeling_utils:modeling_utils.py:4290 [device_map] Before _get_device_map — device_map='auto', max_memory=None, CPU RAM total=747.90 GB, available=727.44 GB, used=20.46 GB, percent=2.7%
WARNING transformers.modeling_utils:modeling_utils.py:4301 [device_map] After _get_device_map — device_map=OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 'cpu'), ('model.layers.17', 'cpu'), ('model.layers.18', 'cpu'), ('model.layers.19', 'cpu'), ('model.layers.20', 'cpu'), ('model.layers.21', 'cpu'), ('model.layers.22', 'cpu'), ('model.layers.23', 'cpu'), ('model.layers.24', 'cpu'), ('model.layers.25', 'cpu'), ('model.layers.26', 'cpu'), ('model.layers.27', 'cpu'), ('model.layers.28', 'cpu'), ('model.layers.29', 'cpu'), ('model.layers.30', 'cpu'), ('model.layers.31', 'cpu'), ('model.norm', 'cpu'), ('model.rotary_emb', 'cpu'), ('lm_head', 'cpu')]), CPU RAM total=747.90 GB, available=727.37 GB, used=20.53 GB, percent=2.7%

Oh wow, that should likely really be a general utility - given this will always happen for models hitting this case

ydshieh · 2026-06-10T15:30:50Z

+                    experts_implementation="eager",
+                    dtype="auto",
+                    device_map="auto",
+                    offload_folder=cls.offload_dir.name,


We need to specify offload_folder after @Cyrilvallez work

ydshieh · 2026-06-10T15:31:26Z

+                [-3.5625, -2.4375, -1.3672, 0.3438, -0.7539, -0.4590, 0.6133, -0.4531, 0.2188, -1.2422],
+                [-0.9688, 0.3633, -0.4902, 2.3281, 0.6250, 3.1094, 0.3828, 0.1670, 0.5781, -2.1094],


This changes somehow. But the other 2 integration tests of generations pass without any change. The model is fine.

vasqu · 2026-06-10T17:34:02Z

Imo looks good, but would really wait on @Cyrilvallez here to double check since he does/did the most work regarding offloading. Maybe we can also not use accelerate, do we maybe have some util for that?

ydshieh · 2026-06-11T08:07:18Z

Just a bit more context:

Before this PR, on our scheduled CI: device map only has some in GPU and some in CPU, but nothing not on disk. This causes GPU OOM when the model running on inputs.

device_map=OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 'cpu'), ('model.layers.17', 'cpu'), ('model.layers.18', 'cpu'), ('model.layers.19', 'cpu'), ('model.layers.20', 'cpu'), ('model.layers.21', 'cpu'), ('model.layers.22', 'cpu'), ('model.layers.23', 'cpu'), ('model.layers.24', 'cpu'), ('model.layers.25', 'cpu'), ('model.layers.26', 'cpu'), ('model.layers.27', 'cpu'), ('model.layers.28', 'cpu'), ('model.layers.29', 'cpu'), ('model.layers.30', 'cpu'), ('model.layers.31', 'cpu'), ('model.norm', 'cpu'), ('model.rotary_emb', 'cpu'), ('lm_head', 'cpu')])

After this PR: device map only has some in GPU and some in CPU, and some on disk. The model could run on inputs and get outputs, no GPU OOM.

device_map=OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 'cpu'), ('model.layers.8', 'cpu'), ('model.layers.9', 'cpu'), ('model.layers.10', 'cpu'), ('model.layers.11', 'cpu'), ('model.layers.12', 'cpu'), ('model.layers.13', 'cpu'), ('model.layers.14', 'cpu'), ('model.layers.15', 'cpu'), ('model.layers.16', 'cpu'), ('model.layers.17', 'cpu'), ('model.layers.18', 'cpu'), ('model.layers.19', 'cpu'), ('model.layers.20', 'cpu'), ('model.layers.21', 'cpu'), ('model.layers.22', 'cpu'), ('model.layers.23', 'cpu'), ('model.layers.24', 'cpu'), ('model.layers.25', 'cpu'), ('model.layers.26', 'cpu'), ('model.layers.27', 'cpu'), ('model.layers.28', 'disk'), ('model.layers.29', 'disk'), ('model.layers.30', 'disk'), ('model.layers.31', 'disk'), ('model.norm', 'disk'), ('model.rotary_emb', 'disk'), ('lm_head', 'disk')])

It's unclear to me why the first case will cause GPU OOM while the second case is OK.

I will wait until Friday evening. Otherwise I will simply merge (after implementing a reusable helper patching method).

vasqu

Ok let's merge as you planned then (after the utils refactor), just one small question

I think more models could benefit from this but no rush, one step after the other

vasqu · 2026-06-11T12:33:28Z

-            cls.model = PhimoeForCausalLM.from_pretrained(
-                "microsoft/Phi-3.5-MoE-instruct", experts_implementation="eager", dtype="auto", device_map="auto"
-            )
+            import accelerate.utils


Can we use something from our utils instead? Or is it always guaranteed to have accelerate atp 🤔

do you mean if we will always have accelerate installed ? At places like this, it's certain we have it.

But for the patch, I plan to use it in conftest.py (maybe, not 100% sure yet) so we don't need to patch every place manually. During the whole test sessions, it should not see the whole 750G memory however (it's a infra setting stuff). So yes, it's better not to rely on accelerate.

I will refine the logic.

Great point, thank you.

ydshieh · 2026-06-12T13:11:24Z

Will merge on Sunday. Don't want to have a surprisingly sad Saturday!

github-actions · 2026-06-12T13:21:12Z

CI Dashboard: View test results in Grafana

github-actions · 2026-06-15T18:38:59Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: phimoe

ydshieh changed the title ~~[DON'T MERGE YET] Fix PhimoeIntegrationTest~~ Fix PhimoeIntegrationTest Jun 10, 2026

ydshieh commented Jun 10, 2026

View reviewed changes

ydshieh requested review from Cyrilvallez, IlyasMoutawwakil and vasqu June 10, 2026 15:31

ydshieh force-pushed the fix_phimoe2 branch from e8d6a37 to 8c3de71 Compare June 10, 2026 19:40

vasqu approved these changes Jun 11, 2026

View reviewed changes

ydshieh added 6 commits June 15, 2026 16:08

fix

7a9ad30

fix

00de5f7

update

a677c70

fix

38b7b99

fix

eff8d05

fix

63cf04e

ydshieh force-pushed the fix_phimoe2 branch 2 times, most recently from 9932607 to 09c06b9 Compare June 15, 2026 15:39

device count

eefd4dc

ydshieh force-pushed the fix_phimoe2 branch from 8a68879 to eefd4dc Compare June 15, 2026 18:35

Merge branch 'main' into fix_phimoe2

de590a8

		[-3.5625, -2.4375, -1.3672, 0.3438, -0.7539, -0.4590, 0.6133, -0.4531, 0.2188, -1.2422],
		[-0.9688, 0.3633, -0.4902, 2.3281, 0.6250, 3.1094, 0.3828, 0.1670, 0.5781, -2.1094],

Conversation

ydshieh commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 10, 2026

Uh oh!

ydshieh Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ydshieh Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ydshieh Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ydshieh Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu commented Jun 10, 2026

Uh oh!

ydshieh commented Jun 11, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ydshieh Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ydshieh commented Jun 10, 2026 •

edited

Loading