-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Implement support for AutoGPTQ for loading GPTQ quantized models. #1668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Awesome! I strongly believe that AutoGPTQ is the way forward for GPTQ and yeah it's been seeing rapid progress recently. I release a lot of GPTQs on HF and am really hoping that in the future it will be easier for users to use them. And I think AutoGPTQ is the right repo for the community to gather around for future development. I will try your PR shortly with my GPTQs |
|
Great it's finally working. I did the same thing 3 days ago and would get errors loading the state_dict. Anyone bench it yet vs ooba's gptq? |
|
Added support for offloading and multiple devices. However this requires more testing and I don't think it was working properly for me. To use these features, disable triton (--autogptq-triton flag). Especially multiple gpus would be useful to test, for that use the --gpu-memory flag since the ui has only one device. You will likely need |
|
OK I've done some testing. Firstly, thanks so much for getting this PR'd - it's awesome to see my models loading with AutoGPTQ! My findings so far. I'll do some more testing tomorrow. All testing with:
AutoGPTQ testingTriton
For comparison, GPTQ-for-LLaMa Triton records 15-16 token/s on this same GPU + model. CUDA
text-generation-webui specificWhen running server.py with
|
|
https://github.com/qwopqwop200/AutoGPTQ-no-act-order |
|
Just tried loading one of my models which doesn't use And I get: Folder structure: Command: Error: I tried creating a |
|
Additionally, the speed difference between AutoGPTQ and GPTQ-for-LLaMA is due to fused-attn in cuda. Also, in the case of triton, it occurs because of fused-attn,fused-mlp. |
|
Thank you for your reports. Only I didn't want to change the detection, just put this as an alternative when it tries to load GfL. The check for @qwopqwop200 I think a separate fork would go against the idea of implementing AutoGPTQ as an universal solution. Would it be possible to implement those optimisations for the AutoGPTQ main? |
|
I think it will be possible. I'll try later. |
AutoGPTQ only supports |
|
Now checking for quantize_config.json and if it exists wbits does not need to be manually set. I don't think it is related to anything in this commit but certain models fail to load. Example model |
|
Created a PR to update AutoGPTQ to provide optimizations. |
Nice. But does that mean that act-order needs to be passed? It can't be automatically checked for? |
|
An automatic check seems very hard to implement. |
Probably just have to do best effort trying to check the filename then. Do you have any ideas about the issue with pygmalion models? |
|
1.Given the way AutoGPTQ currently loads models, it's not a good idea to check whether or not by the name of a file. |
We can specify Example Associated If there is going to be a difference in inference with/without desc_act, I think there probably needs to be a new command line parameter and UI GPTQ parameter in ooba to specify if the model is desc_act or not, eg I will soon go back through all my GPTQ models on HF and add a quantize_config.json to every one. One issue is that in my older repos I have two model files per repo - one with desc_act (act-order), and one without. So I will have to move to using separate branches, one for the desc_act/act-order model, one for the no-desc_act/no-act-order model. @qwopqwop200 how big of an improvement is using desc_act? Maybe I should not even bother making desc_act models any more if they are always going to have problems for users on CUDA? So far I always made two models: one with groupsize=128 + desc_act, and one with groupsize=128 and no desc_act. I thought that would give users the choice of the 'best' model, or the 'compatible' model. But this is extra work and I think adds more confusion for users.. |
Oh yeah of course! OK. I guess I can go back through my older models and rename Or maybe I could PR a change to AutoGPTQ to also check for .pt files.. I will ask PanQiWei what they think. |
I agree completely. I think AutoGPTQ should support everything. Otherwise it is really confusing for users.
Yeah I understand. Maybe it is beyond the scope of this PR. But I do think the UI should support the features I mention. With GPTQ-for-Llama, a text-gen-ui user had to either specify But now with AutoGPTQ, we can avoid that extra work for users because we have But that could be done in a separate PR, after yours is merged. Or maybe ooba will do it himself. |
|
1.It's already implemented that way. |
Thank you!
But desc_act does seem to work with the current AutoGPTQ CUDA code? Using the latest AutoGPTQ in CUDA mode I can run inference on models I created with --act-order in GPTQ-for-Llama, and it does work. So is it only old versions of GPTQ-for-LLaMa CUDA that can't use desc_act/--act-order?
OK thank you. So for Llama, maybe it would be easier not to keep making desc_act/--act-order models, if it is going to cause performance or compatibility problems for some users. |
yes, And this optimization of auto-gptq is also obtained using old cuda kernel. |
|
Group size breaks 30b models for 24gb vram. So I liked act order to smarten them up slightly. You don't have to rename models, just put a symbolic link. |
|
|
@oobabooga now seriously, please explain why you closed this out of nowhere? |
|
@oobabooga ? 😢 |
|
I will keep an updated fork here so if you want to use AutoGPTQ with textgen, you can use that fork. |
|
Thanks LaaZa. My guess is that ooba closed it by mistake. Clicked the wrong button or something. There's not been any purge of other open PRs. And he said he was definitely interested. If we've not heard soon, maybe just open a new PR. |
Something else is going on. |
|
Based on..? |
|
I closed this because the author decided it would be funny to mock and insult me in another PR. This kind of behavior is not accepted in this repository. Cut the drama and try to be constructive instead. |
|
I needed code created here to load models without a |
This is a quick implementation of PanQiWei /AutoGPTQ for inference.
This is an alternative to the current GPTQ-for-LLaMA hopefully offering a more universally supported option which is not limited to one platform like linux.
AutoGPTQ supports cuda, triton(on linux) and
cpu. Splitting using pre_layer is not supported.#1263 implements another alternative for GPTQ, but due to relying on triton it is not universal. This pr should be compatible with it though as an option.
Right now requires newer than 0.0.5 version of AutoGPTQ in pypi, so build from source is required at the time of writing.
From my testing appears to be slightly slower than GPTQ-for-LLaMA triton branch and slower still with cuda. I have not compared against cuda versions of GPTQ-for-LLaMA. Probably slower than #1263
But AutoGPTQ is seeing rapid development and likely will have better performance while maintaining compatibility and I think this is the main benefit of this implementation.
Please give feedback and testing is appreciated.