Skip to content

Conversation

@IlyasMoutawwakil
Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil commented Feb 4, 2025

What does this PR do?

This PR introduces upstream support for HPU torch device/backend:

  • HPU is the device name for Intel Gaudi Accelerators, a very powerful, energy efficient ASIC for AI workloads.
  • Gaudi1 is available on AWS since 2021, Gaudi2/Gaudi3 on Intel Dev Cloud and soon on IBM Cloud.
  • The documentation of the torch device is available here.

This is part of three PRs:

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch! From the accelerate side this looks fine; do you want to do big model inference while you're at it? (if not all good).

I assume accelerate test etc went well? 🤗

@muellerzr muellerzr requested a review from SunMarc February 5, 2025 21:38
@IlyasMoutawwakil IlyasMoutawwakil marked this pull request as draft February 6, 2025 07:50
@IlyasMoutawwakil
Copy link
Member Author

@muellerzr I forgot to mark it as a draft 😅.
I'm still debugging some issues, both from accelerate and optimum-habana side, should be ready by next week.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks for the PR ! +1 for big model inference also for a folow-up PR for example

@muellerzr
Copy link
Contributor

No worries @IlyasMoutawwakil ! Just let us know when you're all set to go 🫡

@IlyasMoutawwakil IlyasMoutawwakil force-pushed the hpu-support branch 2 times, most recently from acc6b01 to 81a37be Compare March 10, 2025 10:15
@IlyasMoutawwakil
Copy link
Member Author

All tests that don't require fp16/fp8 are passing on gaudi1.
I'm slowly reenabling some tests that are now passing on gaudi2 with Synapse 1.20+Pytorch 2.6, like pippy.
I added an explanation of every test skipped on HPU, there are mainly three reasons for skipping:

  • unsupported hpu device indexing (hpu:1).
  • unsupported empty_cache() op.
  • missing bnb support.

one last test that fails with no explanation is test_multi_device_merge_fsdp_weights, for now all I'm seeing is # Synapse detected a device critical error that requires a restart, I can investigate it later outside of the scope of this PR.

Comment on lines 5 to 7
pull_request:
branches:
- main
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be removed and only schedule will stay before merge.

Comment on lines +234 to +236
# is_fp8_available only checks for libraries
# ideally it should check for device capability as well
fp8_is_available = is_fp8_available()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that is_fp8_available() only checks for libraries availability and not for device capability.
check_fp8_capability is used for that but that's confusing no ? because is_fp16/bf16 do check for device capability of using those dtypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the behavior of is_fp8_available() to not break bc but it would make sense to have a single source of truth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the logic, I'll change it in a follow up and can see why that'd be confusing

@muellerzr muellerzr merged commit d9e6af8 into main Mar 11, 2025
27 of 28 checks passed
@muellerzr muellerzr deleted the hpu-support branch March 11, 2025 15:16
@regisss regisss mentioned this pull request Apr 29, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants