Fix PhimoeIntegrationTest#46539
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
PhimoeIntegrationTestPhimoeIntegrationTest
| import accelerate.utils | ||
|
|
||
| # Our CI runners with A10 GPU use instance sharing, which makes each runner report ~750 GB of CPU RAM. | ||
| # With that much CPU memory visible, `device_map="auto"` fits all layers on GPU+CPU with nothing | ||
| # spilling to disk. This produces a device map that causes GPU OOM when the model is actually run. | ||
| # We patch `accelerate_max_memory` to cap CPU at 60 GiB — the amount a dedicated A10 instance | ||
| # normally sees — so that the same layers are offloaded to disk as on a real single-tenant runner, | ||
| # avoiding the OOM. | ||
| _original_get_max_memory = accelerate.utils.get_max_memory | ||
|
|
||
| def _cap_cpu_memory(max_memory=None): | ||
| result = _original_get_max_memory(max_memory) | ||
| result["cpu"] = min(result.get("cpu", float("inf")), 60 * 1024**3) | ||
| return result | ||
|
|
||
| cls.offload_dir = tempfile.TemporaryDirectory() | ||
| with patch("transformers.integrations.accelerate.accelerate_max_memory", _cap_cpu_memory): |
There was a problem hiding this comment.
We should probably add such a patching method in testing_utils because we might need to use it in several places
There was a problem hiding this comment.
For the record, this is what we have in our CI runner, which shows CPU RAM total=747.90 GB and no offloaded to disk (without this PR).
WARNING transformers.modeling_utils:modeling_utils.py:4290 [device_map] Before _get_device_map — device_map='auto', max_memory=None, CPU RAM total=747.90 GB, available=727.44 GB, used=20.46 GB, percent=2.7%
WARNING transformers.modeling_utils:modeling_utils.py:4301 [device_map] After _get_device_map — device_map=OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 'cpu'), ('model.layers.17', 'cpu'), ('model.layers.18', 'cpu'), ('model.layers.19', 'cpu'), ('model.layers.20', 'cpu'), ('model.layers.21', 'cpu'), ('model.layers.22', 'cpu'), ('model.layers.23', 'cpu'), ('model.layers.24', 'cpu'), ('model.layers.25', 'cpu'), ('model.layers.26', 'cpu'), ('model.layers.27', 'cpu'), ('model.layers.28', 'cpu'), ('model.layers.29', 'cpu'), ('model.layers.30', 'cpu'), ('model.layers.31', 'cpu'), ('model.norm', 'cpu'), ('model.rotary_emb', 'cpu'), ('lm_head', 'cpu')]), CPU RAM total=747.90 GB, available=727.37 GB, used=20.53 GB, percent=2.7%
There was a problem hiding this comment.
Oh wow, that should likely really be a general utility - given this will always happen for models hitting this case
| experts_implementation="eager", | ||
| dtype="auto", | ||
| device_map="auto", | ||
| offload_folder=cls.offload_dir.name, |
There was a problem hiding this comment.
We need to specify offload_folder after @Cyrilvallez work
| [-3.5625, -2.4375, -1.3672, 0.3438, -0.7539, -0.4590, 0.6133, -0.4531, 0.2188, -1.2422], | ||
| [-0.9688, 0.3633, -0.4902, 2.3281, 0.6250, 3.1094, 0.3828, 0.1670, 0.5781, -2.1094], |
There was a problem hiding this comment.
This changes somehow. But the other 2 integration tests of generations pass without any change. The model is fine.
|
Imo looks good, but would really wait on @Cyrilvallez here to double check since he does/did the most work regarding offloading. Maybe we can also not use accelerate, do we maybe have some util for that? |
|
Just a bit more context: Before this PR, on our scheduled CI: device map only has some in GPU and some in CPU, but nothing not on disk. This causes GPU OOM when the model running on inputs.
After this PR: device map only has some in GPU and some in CPU, and some on disk. The model could run on inputs and get outputs, no GPU OOM.
It's unclear to me why the first case will cause GPU OOM while the second case is OK. I will wait until Friday evening. Otherwise I will simply merge (after implementing a reusable helper patching method). |
vasqu
left a comment
There was a problem hiding this comment.
Ok let's merge as you planned then (after the utils refactor), just one small question
I think more models could benefit from this but no rush, one step after the other
| cls.model = PhimoeForCausalLM.from_pretrained( | ||
| "microsoft/Phi-3.5-MoE-instruct", experts_implementation="eager", dtype="auto", device_map="auto" | ||
| ) | ||
| import accelerate.utils |
There was a problem hiding this comment.
Can we use something from our utils instead? Or is it always guaranteed to have accelerate atp 🤔
There was a problem hiding this comment.
do you mean if we will always have accelerate installed ? At places like this, it's certain we have it.
But for the patch, I plan to use it in conftest.py (maybe, not 100% sure yet) so we don't need to patch every place manually. During the whole test sessions, it should not see the whole 750G memory however (it's a infra setting stuff). So yes, it's better not to rely on accelerate.
I will refine the logic.
Great point, thank you.
|
Will merge on Sunday. Don't want to have a surprisingly sad Saturday! |
|
CI Dashboard: View test results in Grafana |
9932607 to
09c06b9
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: phimoe |
What does this PR do?
Fix
PhimoeIntegrationTest. See the comments.