Replies: 1 comment
-
Turns out, it is a problem with hami which is not constricting the memory here properly. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am deploying a non-quant Qwen3 4B using LMDeploy. When I load it with transformers on a GPU, with float16 dtype it takes around ~9100MB of VRAM.
I was a bit shocked to discover that that same model, after online turbomind conversion it is quite tiny. I ran the below method in the lmdeploy container. The weird "HAMI" logs are coming from hami
Which means that this model is using just around 2663 MB of VRAM. How is it possible, what's going on here?
I don't know if it matters but I'm using a V100 here (cuda capability sm_70).
Beta Was this translation helpful? Give feedback.
All reactions