-
Notifications
You must be signed in to change notification settings - Fork 309
Inference tutorial - Part 3 of e2e series #2343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Cancelled JobsAs of commit ccc2932 with merge base 2898903 ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @jainapurva, by the way I'm adding a ![]() |
b93b892
to
ce675b8
Compare
docs/source/inference.rst
Outdated
.. note:: | ||
For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_. | ||
|
||
Inference with vLLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this section, can you replace with https://huggingface.co/pytorch/Qwen3-8B-int4wo-hqq#inference-with-vllm
it might be easier to do command line compared to code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, overall I feel we should add some more text in between code blocks so it feels more like a tutorial, and remove some duplicate code, which is distracting to readers
docs/source/serving.rst
Outdated
Step 1: Untie Embedding Weights | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step necessary actually? I don't think I had to do any of this for Llama models for example. Can you share the source for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the same steps as here: https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w. In case of any updates, we should update both the model card and tutorial with same instructions
Last tutorial of the 3 part series of using TorchAO in model lifecycle.