Inference support for `mps` device #355

pcuenca · 2022-09-04T20:14:31Z

This addresses the issues identified when assessing inference on Apple Silicon, see #292 (comment) for details.

Current status

Stable Diffusion pipeline works.
Results in CPU / MPU are reproducible when using the same seeds. Generators do not work in the mps device, so we needed some minor adjustments.
Some incompatible ops identified by failing tests were rewritten (or we fall back to CPU). But I have only verified test_models_unet and test_models_vae.
- See f20a0dd
- See 7f40f24 and MPS: invalid padding argument of size 8 pytorch/pytorch#84535
Determinism tests pass, but the solution is a workaround hack.

The hack

We perform a one-time "warmup" forward pass through unet and vae because the result from the first pass is different than subsequent results. I suspect something related to randomness might be at play, but I haven't identified the root cause. We have several options here:

Remove the hack, it's probably overkill. We can recommend users to run a full pass through the pipeline (1 step is enough) after moving it to the device. Downside is that determinism tests fail in test_models_unet and test_models_vae.
Find the cause and apply a proper solution.

I'd like to merge soon but don't like the hack. Can we remove it and live with some failing tests until we can find what's causing the issue?

Update: we only perform the hack during testing. Users are recommended to perform an initial pass if they care about reproducibility.

To do:

Some unet and vae tests still fail. For example, test_output_pretrained in AutoencoderKLTests.
Translate some incompatible ops or fall back to CPU.
Ensure tests pass.
~~Find a more principled workaround to perform the initial "warm up" pass~~. I tried to use forward hooks, but couldn't do it because they don't pass keyword arguments, which we use in many of our forward implementations.
Create fail case for first-pass issue (results are different than in subsequent passes). Deferred to be tracked in MPS: models require an initial pass for reproducibility #372.

Fixes #292.

Required when classifier-free guidance is enabled.

For some reason the first run produces results different than the rest.

This is especially important when using the mps device, because generators are not supported there. See for example pytorch/pytorch#84288. In addition, the other pipelines seem to use the same approach: generate the random samples then move to the appropriate device. After this change, generating an image in MPS produces the same result as when using the CPU, if the same seed is used.

HuggingFaceDocBuilderDev · 2022-09-04T20:24:19Z

The documentation is not available anymore as the PR was closed or merged.

Sampling from `posterior` must be done in CPU.

UNet tests now pass. See pytorch/pytorch#84535

patrickvonplaten

Not a big fan of the warmup abstraction - I'd prefer to just leave it to the user to call the warmup method. Ok with adding a warmup_mps() method to the ModelMixin, but don't like that this is automatically called - wdyt @pcuenca ?

src/diffusers/models/resnet.py

src/diffusers/models/vae.py

patrickvonplaten · 2022-09-05T09:45:49Z

So I'd advocate for the following:

Add an experimental _mps_warmup(...) method to ModelMixin and PipelineMixin that the user has to call themselves.
With the underscore _ and with a note we make sure to tell people it's experimental and due to a hack currently.
We can adapt the tests with some if device is statements to make them pass

pcuenca · 2022-09-05T09:57:58Z

So I'd advocate for the following:

Add an experimental _mps_warmup(...) method to ModelMixin and PipelineMixin that the user has to call themselves.

With the underscore _ and with a note we make sure to tell people it's experimental and due to a hack currently.

We can adapt the tests with some if device is statements to make them pass

Actually I think I'd recommend users to run a pipeline pass with 1 iteration instead before using the outputs, since that's easier. We can use the underscored methods ourselves just to make tests pass.

pcuenca · 2022-09-05T10:13:46Z

Not a big fan of the warmup abstraction - I'd prefer to just leave it to the user to call the warmup method. Ok with adding a warmup_mps() method to the ModelMixin, but don't like that this is automatically called - wdyt @pcuenca ?

Yes, you are right, it's not important enough to expose it to all users. Thanks!

@patrickvonplaten

We do apply warmup in the tests, but not during normal use. This adopts some PR suggestions by @patrickvonplaten.

pcuenca · 2022-09-05T12:23:25Z

I removed the new abstraction and the automatic warmup pass, and created a new README_mps.md file with installation instructions and the recommendation to run a 1-step inference.

src/diffusers/modeling_utils.py

src/diffusers/models/resnet.py

src/diffusers/models/unet_2d.py

src/diffusers/models/unet_2d_condition.py

src/diffusers/models/vae.py

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

tests/test_modeling_common.py

src/diffusers/models/resnet.py

tests/test_modeling_common.py

patrickvonplaten · 2022-09-06T22:20:49Z

Also @pcuenca - could you downgrade black to 22.3.0 so that make style works? :-) @anton-l let's upgrade black after the release on Thursday :-)

And just make changes to the tests.

Point to the documentation instead.

patrickvonplaten · 2022-09-08T10:57:16Z

tests/test_modeling_common.py

@@ -38,6 +40,11 @@ def test_from_pretrained_save_pretrained(self):
            new_model.to(torch_device)

        with torch.no_grad():
+            # Warmup pass when using mps (see #372)
+            if torch_device == "mps" and isinstance(model, ModelMixin):
+                _ = model(**self.dummy_input)


patrickvonplaten

Looks very nice! Feel free to merge whenever @pcuenca :-)

pcuenca · 2022-09-08T11:34:48Z

Thanks for the help!

@patrickvonplaten

* Initial support for mps in Stable Diffusion pipeline. * Initial "warmup" implementation when using mps. * Make some deterministic tests pass with mps. * Disable training tests when using mps. * SD: generate latents in CPU then move to device. This is especially important when using the mps device, because generators are not supported there. See for example pytorch/pytorch#84288. In addition, the other pipelines seem to use the same approach: generate the random samples then move to the appropriate device. After this change, generating an image in MPS produces the same result as when using the CPU, if the same seed is used. * Remove prints. * Pass AutoencoderKL test_output_pretrained with mps. Sampling from `posterior` must be done in CPU. * Style * Do not use torch.long for log op in mps device. * Perform incompatible padding ops in CPU. UNet tests now pass. See pytorch/pytorch#84535 * Style: fix import order. * Remove unused symbols. * Remove MPSWarmupMixin, do not apply automatically. We do apply warmup in the tests, but not during normal use. This adopts some PR suggestions by @patrickvonplaten. * Add comment for mps fallback to CPU step. * Add README_mps.md for mps installation and use. * Apply `black` to modified files. * Restrict README_mps to SD, show measures in table. * Make PNDM indexing compatible with mps. Addresses huggingface#239. * Do not use float64 when using LDMScheduler. Fixes huggingface#358. * Fix typo identified by @patil-suraj Co-authored-by: Suraj Patil <[email protected]> * Adapt example to new output style. * Restore 1:1 results reproducibility with CompVis. However, mps latents need to be generated in CPU because generators don't work in the mps device. * Move PyTorch nightly to requirements. * Adapt `test_scheduler_outputs_equivalence` ton MPS. * mps: skip training tests instead of ignoring silently. * Make VQModel tests pass on mps. * mps ddim tests: warmup, increase tolerance. * ScoreSdeVeScheduler indexing made mps compatible. * Make ldm pipeline tests pass using warmup. * Style * Simplify casting as suggested in PR. * Add Known Issues to readme. * `isort` import order. * Remove _mps_warmup helpers from ModelMixin. And just make changes to the tests. * Skip tests using unittest decorator for consistency. * Remove temporary var. * Remove spurious blank space. * Remove unused symbol. * Remove README_mps. Co-authored-by: Suraj Patil <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

pcuenca added 7 commits September 4, 2022 21:31

Initial support for mps in Stable Diffusion pipeline.

137e1d5

Required when classifier-free guidance is enabled.

Initial "warmup" implementation when using mps.

0ef1d1e

For some reason the first run produces results different than the rest.

Make some deterministic tests pass with mps.

ae5ea46

Disable training tests when using mps.

4ed22c2

Remove prints.

34c0eff

Merge remote-tracking branch 'origin/main' into mps

314d70a

pcuenca added 6 commits September 4, 2022 22:53

Pass AutoencoderKL test_output_pretrained with mps.

66b6752

Sampling from `posterior` must be done in CPU.

Style

db7da01

Do not use torch.long for log op in mps device.

f20a0dd

Perform incompatible padding ops in CPU.

7f40f24

UNet tests now pass. See pytorch/pytorch#84535

Style: fix import order.

7103993

Remove unused symbols.

c931d2a

pcuenca requested review from patrickvonplaten, patil-suraj and anton-l September 5, 2022 09:35

patrickvonplaten reviewed Sep 5, 2022

View reviewed changes

src/diffusers/models/resnet.py Outdated Show resolved Hide resolved

src/diffusers/models/vae.py Outdated Show resolved Hide resolved

pcuenca added 3 commits September 5, 2022 13:03

Remove MPSWarmupMixin, do not apply automatically.

d0e85f3

We do apply warmup in the tests, but not during normal use. This adopts some PR suggestions by @patrickvonplaten.

Add comment for mps fallback to CPU step.

692b1be

Add README_mps.md for mps installation and use.

36b6a46

anton-l mentioned this pull request Sep 5, 2022

Apple MPS error in unet_2d_condition.py #358

Closed

pcuenca added 2 commits September 5, 2022 14:06

Apply black to modified files.

261a784

Restrict README_mps to SD, show measures in table.

15d86ff

pcuenca requested a review from patrickvonplaten September 5, 2022 12:23

patrickvonplaten mentioned this pull request Sep 5, 2022

indices should be either on cpu or on the same device as the indexed tensor (cpu) #239

Closed

pcuenca requested review from patrickvonplaten and patil-suraj September 6, 2022 17:01