Skip to content

Conversation

@LaaZa
Copy link
Contributor

@LaaZa LaaZa commented Apr 30, 2023

This is a quick implementation of PanQiWei /AutoGPTQ for inference.

This is an alternative to the current GPTQ-for-LLaMA hopefully offering a more universally supported option which is not limited to one platform like linux.

AutoGPTQ supports cuda, triton(on linux) and cpu. Splitting using pre_layer is not supported.

#1263 implements another alternative for GPTQ, but due to relying on triton it is not universal. This pr should be compatible with it though as an option.

Right now requires newer than 0.0.5 version of AutoGPTQ in pypi, so build from source is required at the time of writing.

From my testing appears to be slightly slower than GPTQ-for-LLaMA triton branch and slower still with cuda. I have not compared against cuda versions of GPTQ-for-LLaMA. Probably slower than #1263

But AutoGPTQ is seeing rapid development and likely will have better performance while maintaining compatibility and I think this is the main benefit of this implementation.

Please give feedback and testing is appreciated.

@TheBloke
Copy link
Contributor

Awesome! I strongly believe that AutoGPTQ is the way forward for GPTQ and yeah it's been seeing rapid progress recently.

I release a lot of GPTQs on HF and am really hoping that in the future it will be easier for users to use them. And I think AutoGPTQ is the right repo for the community to gather around for future development.

I will try your PR shortly with my GPTQs

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Apr 30, 2023

Great it's finally working. I did the same thing 3 days ago and would get errors loading the state_dict.

Anyone bench it yet vs ooba's gptq?

@LaaZa
Copy link
Contributor Author

LaaZa commented Apr 30, 2023

Added support for offloading and multiple devices.
Uses --gpu-memory and --cpu-memory and added --autogptq-device-map to use these features: Accelerate: device_map setting memory will use 'auto' unless something else is specified.

However this requires more testing and I don't think it was working properly for me.
My testing showed that vram was filled with the whole model regardless of what memory was specified and ram was fluctuating during loading while vram was already filled(not full but normal memory usage). And loading was very slow.

To use these features, disable triton (--autogptq-triton flag).

Especially multiple gpus would be useful to test, for that use the --gpu-memory flag since the ui has only one device.
--auto-devices will use ooba's automatic memory setting like for normal non-gptq model loading.

You will likely need protobuf==3.20

@TheBloke
Copy link
Contributor

OK I've done some testing. Firstly, thanks so much for getting this PR'd - it's awesome to see my models loading with AutoGPTQ!

My findings so far. I'll do some more testing tomorrow.

All testing with:

  • https://huggingface.co/TheBloke/wizardLM-7B-GPTQ
    • Bits = 4. Groupsize = 128. One model with desc_act, one without.
  • Ubuntu 20.04
  • NVidia 4090 24GB
  • CUDA 11.6
  • AutoGPTQ installed with pip install . as of e2c7cd4fb3765538569f903ca7f81563fce70c6e
  • LaaZa:AutoGPTQ as of 0fd4857646c56deb7b7a2b7df561e35e0a172d0b

AutoGPTQ testing

Triton

--model XYZ --wbits 4 --groupsize 128 --model_type llama --autogptq --autogptq-triton

  • act-order (desc_act = True) : WORKS. 11-12 token/s
  • no-act-order (desc_act = False) : WORKS. 11-12 token/s

For comparison, GPTQ-for-LLaMa Triton records 15-16 token/s on this same GPU + model.

CUDA

--model XYZ --wbits 4 --groupsize 128 --model_type llama --autogptq

text-generation-webui specific

When running server.py with --autogptq --autogptq-triton and no models yet downloaded

  • Download a model that has quantize_config.json defined then Refresh Model and choose model
    • UI throws error "no file named pytorch_model.bin" - it's trying to load it as a non-GPTQ model because no GPTQ params are defined. But there is a quantize_config.json so it could read this to know it's a GPTQ.
    • UI GPTQ params also should update to reflect the contents of quantize_config.json

When running server.py with --autogptq --autogptq-triton and one or more GPTQ model downloaded which has quantize_config.json

  • UI fails to load because it doesn't recognise the model as a GPTQ model:
root@cee7d387f502:~/textgen-auto# python server.py  --autogptq --autogptq-triton  --listen
Gradio HTTP request redirected to localhost :)
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda116.so
Loading act...
Traceback (most recent call last):
  File "/root/textgen-auto/server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/root/textgen-auto/modules/models.py", line 84, in load_model
    model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}"), low_cpu_mem_usage=True, torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16, trust_remote_code=trust_remote_code)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2405, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models/act.

It should recognise the quantize_config.json and detect it as a GPTQ.

Otherwise the user must always specify --wbits --groupsize --model_type params in order to launch the server, even when quantize_config.json exists.

Next steps

I'll do some more testing tomorrow, eg with GPU splitting, offloading, etc.

@qwopqwop200
Copy link

https://github.com/qwopqwop200/AutoGPTQ-no-act-order
Here's a fork of auto-gptq that takes an existing old cuda and provides faster speeds.
I'm sure you'll probably get a faster speed.
However, this version does not support using act-order and groupsize at the same time.

@TheBloke
Copy link
Contributor

TheBloke commented Apr 30, 2023

Just tried loading one of my models which doesn't use safetensor: https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g/tree/main

And I get: FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g

Folder structure:

root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g# ll
total 3809388
drwxrwxrwx  2 root root    3000362 Apr 30 23:45 ./
drwxrwxrwx 34 root root    3066244 Apr 30 23:28 ../
-rw-rw-rw-  1 root root        581 Apr 28 08:00 config.json
-rw-rw-rw-  1 root root        136 Apr 26 16:21 generation_config.json
-rw-rw-rw-  1 root root        411 Apr 26 16:21 special_tokens_map.json
-rw-rw-rw-  1 root root     499723 Apr 26 16:21 tokenizer.model
-rw-rw-rw-  1 root root        700 Apr 28 10:16 tokenizer_config.json
-rw-rw-rw-  1 root root 3894242469 Apr 12 23:08 vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt
root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g#

Command:

python server.py --model TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g  --autogptq --autogptq-triton  --listen --wbits 4 --groupsize 128 --model_type llama

Error:

Gradio HTTP request redirected to localhost :)
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda116.so
Loading TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g...
Traceback (most recent call last):
  File "/root/textgen-auto/server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/root/textgen-auto/modules/models.py", line 158, in load_model
    model = load_quantized(model_name)
  File "/root/textgen-auto/modules/AutoGPTQ_loader.py", line 46, in load_quantized
    raise FileNotFoundError(f'No quantized model found for {model_name}')
FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g
root@cee7d387f502:~/textgen-auto#

I tried creating a quantize_config.json as well but that didn't help.

@qwopqwop200
Copy link

Additionally, the speed difference between AutoGPTQ and GPTQ-for-LLaMA is due to fused-attn in cuda. Also, in the case of triton, it occurs because of fused-attn,fused-mlp.

@LaaZa
Copy link
Contributor Author

LaaZa commented May 1, 2023

Thank you for your reports.

Only --wbits should matter whether it knows to load a GPTQ model.
--model_type has no effect on AutoGPTQ, it is automatically detected and completely ignored by this.
quantize_config.json will override any other settings. I'll look into updating the UI.

I didn't want to change the detection, just put this as an alternative when it tries to load GfL. The check for quantize_config.json would have to happen in a different place. I'll look into it if I can change the loading criteria without changing the code too much.

@qwopqwop200 I think a separate fork would go against the idea of implementing AutoGPTQ as an universal solution. Would it be possible to implement those optimisations for the AutoGPTQ main?

@qwopqwop200
Copy link

I think it will be possible. I'll try later.

@LaaZa
Copy link
Contributor Author

LaaZa commented May 1, 2023

Just tried loading one of my models which doesn't use safetensor: https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g/tree/main

And I get: FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g

Folder structure:

root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g# ll
total 3809388
drwxrwxrwx  2 root root    3000362 Apr 30 23:45 ./
drwxrwxrwx 34 root root    3066244 Apr 30 23:28 ../
-rw-rw-rw-  1 root root        581 Apr 28 08:00 config.json
-rw-rw-rw-  1 root root        136 Apr 26 16:21 generation_config.json
-rw-rw-rw-  1 root root        411 Apr 26 16:21 special_tokens_map.json
-rw-rw-rw-  1 root root     499723 Apr 26 16:21 tokenizer.model
-rw-rw-rw-  1 root root        700 Apr 28 10:16 tokenizer_config.json
-rw-rw-rw-  1 root root 3894242469 Apr 12 23:08 vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt
root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g#

Command:

python server.py --model TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g  --autogptq --autogptq-triton  --listen --wbits 4 --groupsize 128 --model_type llama

Error:

Gradio HTTP request redirected to localhost :)
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda116.so
Loading TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g...
Traceback (most recent call last):
  File "/root/textgen-auto/server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/root/textgen-auto/modules/models.py", line 158, in load_model
    model = load_quantized(model_name)
  File "/root/textgen-auto/modules/AutoGPTQ_loader.py", line 46, in load_quantized
    raise FileNotFoundError(f'No quantized model found for {model_name}')
FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g
root@cee7d387f502:~/textgen-auto#

I tried creating a quantize_config.json as well but that didn't help.

AutoGPTQ only supports .safetensors and .bin I simply cannot pass the other extensions since it only takes the name without the extension and adds either .bin or .safetensors to it based on the use_safetensors.

@LaaZa
Copy link
Contributor Author

LaaZa commented May 1, 2023

Now checking for quantize_config.json and if it exists wbits does not need to be manually set.
UI is not updated. I want some input on how this should be done, cheking for config happens after the ui is updated normally on model load.

I don't think it is related to anything in this commit but certain models fail to load.
For whatever reason it happens with pygmalion models, including the LLaMA based pygmalion-7B, so it isn't about gptj.

ValueError: QuantLinear() does not have a parameter or a buffer named bias.
coming from auto_gptq/modeling/_base.py at line 540: model = accelerate.load_checkpoint_and_dispatch(...)

Example model gozfarb/pygmalion-7b-4bit-128g-cuda

@qwopqwop200
Copy link

Created a PR to update AutoGPTQ to provide optimizations.
This is enabled automatically if act-order and groupsize are not used at the same time.
https://github.com/PanQiWei/AutoGPTQ/tree/faster-cuda-no-actorder

@LaaZa
Copy link
Contributor Author

LaaZa commented May 1, 2023

Created a PR to update AutoGPTQ to provide optimizations.
This is enabled automatically if act-order and groupsize are not used at the same time.
https://github.com/PanQiWei/AutoGPTQ/tree/faster-cuda-no-actorder

Nice. But does that mean that act-order needs to be passed? It can't be automatically checked for?

@qwopqwop200
Copy link

An automatic check seems very hard to implement.

@LaaZa
Copy link
Contributor Author

LaaZa commented May 1, 2023

An automatic check seems very hard to implement.

Probably just have to do best effort trying to check the filename then. Do you have any ideas about the issue with pygmalion models?

@qwopqwop200
Copy link

1.Given the way AutoGPTQ currently loads models, it's not a good idea to check whether or not by the name of a file.
2.It seems that the model was made with the old version of GPTQ. You will need to do a fresh conversion of your model to the latest GPTQ.

@TheBloke
Copy link
Contributor

TheBloke commented May 1, 2023

Nice. But does that mean that act-order needs to be passed? It can't be automatically checked for?

We can specify desc_act = true in quantize_config.json and I think your GPTQ loader code will need to check for this.

Example BaseQuantizeConfig() call:

    return BaseQuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        desc_act=True
    )

Associated quantize_config.json:

{
  "bits": 4,
  "desc_act": true,
  "group_size": 128
}

If there is going to be a difference in inference with/without desc_act, I think there probably needs to be a new command line parameter and UI GPTQ parameter in ooba to specify if the model is desc_act or not, eg --desc_act to specify model is desc_act. So that older models can be loaded that don't have quantize_config.json

I will soon go back through all my GPTQ models on HF and add a quantize_config.json to every one. One issue is that in my older repos I have two model files per repo - one with desc_act (act-order), and one without. So I will have to move to using separate branches, one for the desc_act/act-order model, one for the no-desc_act/no-act-order model.

@qwopqwop200 how big of an improvement is using desc_act? Maybe I should not even bother making desc_act models any more if they are always going to have problems for users on CUDA?

So far I always made two models: one with groupsize=128 + desc_act, and one with groupsize=128 and no desc_act. I thought that would give users the choice of the 'best' model, or the 'compatible' model. But this is extra work and I think adds more confusion for users..

@TheBloke
Copy link
Contributor

TheBloke commented May 1, 2023

AutoGPTQ only supports .safetensors and .bin I simply cannot pass the other extensions since it only takes the name without the extension and adds either .bin or .safetensors to it based on the use_safetensors.

Oh yeah of course! OK. I guess I can go back through my older models and rename .pt files to .bin. In my later models I only use safetensors anyway.

Or maybe I could PR a change to AutoGPTQ to also check for .pt files.. I will ask PanQiWei what they think.

@TheBloke
Copy link
Contributor

TheBloke commented May 1, 2023

@qwopqwop200 I think a separate fork would go against the idea of implementing AutoGPTQ as an universal solution. Would it be possible to implement those optimisations for the AutoGPTQ main?

I agree completely. I think AutoGPTQ should support everything. Otherwise it is really confusing for users.

Only --wbits should matter whether it knows to load a GPTQ model. --model_type has no effect on AutoGPTQ, it is automatically detected and completely ignored by this. quantize_config.json will override any other settings. I'll look into updating the UI.

I didn't want to change the detection, just put this as an alternative when it tries to load GfL. The check for quantize_config.json would have to happen in a different place. I'll look into it if I can change the loading criteria without changing the code too much.

Yeah I understand. Maybe it is beyond the scope of this PR. But I do think the UI should support the features I mention. With GPTQ-for-Llama, a text-gen-ui user had to either specify --wbits --groupsize --model_type command line params, or else fill in the "GPTQ Params" in the UI and then use "Save settings for this model" to save config to YAML. If they didn't do that, the GPTQ model would fail to load because the UI would think it was an HF format model.

But now with AutoGPTQ, we can avoid that extra work for users because we have quantize_config.json. So I think it will be very helpful to users if text-gen-ui can make use of this file to automatically load the right settings for GPTQ models in all scenarios.

But that could be done in a separate PR, after yours is merged. Or maybe ooba will do it himself.

@qwopqwop200
Copy link

1.It's already implemented that way.
2.The current old cuda kernel is written assuming that desc_act is not enabled, so inference won't work.
3.There is a very small additional improvement of 0.01 to 0.03 when using act-order. However, for a specific model (ex:opt-66b) we experienced a significant improvement of around 0.2.

@TheBloke
Copy link
Contributor

TheBloke commented May 1, 2023

1.It's already implemented that way.

Thank you!

2.The current old cuda kernel is written assuming that desc_act is not enabled, so inference won't work.

But desc_act does seem to work with the current AutoGPTQ CUDA code? Using the latest AutoGPTQ in CUDA mode I can run inference on models I created with --act-order in GPTQ-for-Llama, and it does work.

So is it only old versions of GPTQ-for-LLaMa CUDA that can't use desc_act/--act-order?

3.There is a very small additional improvement of 0.01 to 0.03 when using act-order. However, for a specific model (ex:opt-66b) we experienced a significant improvement of around 0.2.

OK thank you. So for Llama, maybe it would be easier not to keep making desc_act/--act-order models, if it is going to cause performance or compatibility problems for some users.

@qwopqwop200
Copy link

qwopqwop200 commented May 1, 2023

But desc_act does seem to work with the current AutoGPTQ CUDA code? Using the latest AutoGPTQ in CUDA mode I can run inference on models I created with --act-order in GPTQ-for-Llama, and it does work.

So is it only old versions of GPTQ-for-LLaMa CUDA that can't use desc_act/--act-order?

yes, And this optimization of auto-gptq is also obtained using old cuda kernel.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 1, 2023

Group size breaks 30b models for 24gb vram. So I liked act order to smarten them up slightly.

You don't have to rename models, just put a symbolic link.

@LaaZa
Copy link
Contributor Author

LaaZa commented May 1, 2023

1.Given the way AutoGPTQ currently loads models, it's not a good idea to check whether or not by the name of a file.
2.It seems that the model was made with the old version of GPTQ. You will need to do a fresh conversion of your model to the latest GPTQ.

  1. What I meant was to try and check for no act order in the name if it is not specified elsewhere (setting or conf). Does desc_act=False trigger the faster cuda?

  2. Weird if they are all using old versions. Is there any way to load them anyway? GPTQ for LLaMA does.

@LaaZa
Copy link
Contributor Author

LaaZa commented May 12, 2023

@oobabooga now seriously, please explain why you closed this out of nowhere?

@Malrama
Copy link

Malrama commented May 14, 2023

@oobabooga ? 😢

@LaaZa
Copy link
Contributor Author

LaaZa commented May 14, 2023

I will keep an updated fork here so if you want to use AutoGPTQ with textgen, you can use that fork.

@TheBloke
Copy link
Contributor

Thanks LaaZa.

My guess is that ooba closed it by mistake. Clicked the wrong button or something. There's not been any purge of other open PRs. And he said he was definitely interested.

If we've not heard soon, maybe just open a new PR.

@LaaZa
Copy link
Contributor Author

LaaZa commented May 14, 2023

Thanks LaaZa.

My guess is that ooba closed it by mistake. Clicked the wrong button or something. There's not been any purge of other open PRs. And he said he was definitely interested.

If we've not heard soon, maybe just open a new PR.

Something else is going on.

@TheBloke
Copy link
Contributor

Based on..?

Repository owner locked as too heated and limited conversation to collaborators May 17, 2023
Repository owner deleted a comment from LaaZa May 17, 2023
Repository owner deleted a comment from MillionthOdin16 May 17, 2023
Repository owner deleted a comment from MillionthOdin16 May 17, 2023
Repository owner deleted a comment from LaaZa May 17, 2023
Repository owner deleted a comment from MillionthOdin16 May 17, 2023
@oobabooga
Copy link
Owner

I closed this because the author decided it would be funny to mock and insult me in another PR. This kind of behavior is not accepted in this repository. Cut the drama and try to be constructive instead.

@oobabooga oobabooga reopened this Jun 2, 2023
@oobabooga
Copy link
Owner

I needed code created here to load models without a quantize_config.json. To give @LaaZa due credit, I am merging the PR.

@oobabooga oobabooga merged commit 9c06660 into oobabooga:main Jun 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants