Implement support for AutoGPTQ for loading GPTQ quantized models. #1668

LaaZa · 2023-04-30T01:35:46Z

This is a quick implementation of PanQiWei /AutoGPTQ for inference.

This is an alternative to the current GPTQ-for-LLaMA hopefully offering a more universally supported option which is not limited to one platform like linux.

AutoGPTQ supports cuda, triton(on linux) and ~~cpu~~. Splitting using pre_layer is not supported.

#1263 implements another alternative for GPTQ, but due to relying on triton it is not universal. This pr should be compatible with it though as an option.

Right now requires newer than 0.0.5 version of AutoGPTQ in pypi, so build from source is required at the time of writing.

From my testing appears to be slightly slower than GPTQ-for-LLaMA triton branch and slower still with cuda. I have not compared against cuda versions of GPTQ-for-LLaMA. Probably slower than #1263

But AutoGPTQ is seeing rapid development and likely will have better performance while maintaining compatibility and I think this is the main benefit of this implementation.

Please give feedback and testing is appreciated.

…gpu.

TheBloke · 2023-04-30T08:30:23Z

Awesome! I strongly believe that AutoGPTQ is the way forward for GPTQ and yeah it's been seeing rapid progress recently.

I release a lot of GPTQs on HF and am really hoping that in the future it will be easier for users to use them. And I think AutoGPTQ is the right repo for the community to gather around for future development.

I will try your PR shortly with my GPTQs

Ph0rk0z · 2023-04-30T11:48:25Z

Great it's finally working. I did the same thing 3 days ago and would get errors loading the state_dict.

Anyone bench it yet vs ooba's gptq?

LaaZa · 2023-04-30T18:48:07Z

Added support for offloading and multiple devices.
Uses --gpu-memory and --cpu-memory and added --autogptq-device-map to use these features: Accelerate: device_map setting memory will use 'auto' unless something else is specified.

However this requires more testing and I don't think it was working properly for me.
My testing showed that vram was filled with the whole model regardless of what memory was specified and ram was fluctuating during loading while vram was already filled(not full but normal memory usage). And loading was very slow.

To use these features, disable triton (--autogptq-triton flag).

Especially multiple gpus would be useful to test, for that use the --gpu-memory flag since the ui has only one device.
--auto-devices will use ooba's automatic memory setting like for normal non-gptq model loading.

You will likely need protobuf==3.20

TheBloke · 2023-04-30T23:34:05Z

OK I've done some testing. Firstly, thanks so much for getting this PR'd - it's awesome to see my models loading with AutoGPTQ!

My findings so far. I'll do some more testing tomorrow.

All testing with:

https://huggingface.co/TheBloke/wizardLM-7B-GPTQ
- Bits = 4. Groupsize = 128. One model with desc_act, one without.
Ubuntu 20.04
NVidia 4090 24GB
CUDA 11.6
AutoGPTQ installed with pip install . as of e2c7cd4fb3765538569f903ca7f81563fce70c6e
LaaZa:AutoGPTQ as of 0fd4857646c56deb7b7a2b7df561e35e0a172d0b

AutoGPTQ testing

Triton

--model XYZ --wbits 4 --groupsize 128 --model_type llama --autogptq --autogptq-triton

act-order (desc_act = True) : WORKS. 11-12 token/s
no-act-order (desc_act = False) : WORKS. 11-12 token/s

For comparison, GPTQ-for-LLaMa Triton records 15-16 token/s on this same GPU + model.

CUDA

--model XYZ --wbits 4 --groupsize 128 --model_type llama --autogptq

act-order (desc_act = True) : GIBBERISH - AutoGPTQ issue
no-act-order (desc_act = False) : GIBBERISH - AutoGPTQ issue

text-generation-webui specific

When running server.py with `--autogptq --autogptq-triton` and no models yet downloaded

Download a model that has quantize_config.json defined then Refresh Model and choose model
- UI throws error "no file named pytorch_model.bin" - it's trying to load it as a non-GPTQ model because no GPTQ params are defined. But there is a quantize_config.json so it could read this to know it's a GPTQ.
- UI GPTQ params also should update to reflect the contents of quantize_config.json

When running server.py with `--autogptq --autogptq-triton` and one or more GPTQ model downloaded which has `quantize_config.json`

UI fails to load because it doesn't recognise the model as a GPTQ model:

root@cee7d387f502:~/textgen-auto# python server.py  --autogptq --autogptq-triton  --listen
Gradio HTTP request redirected to localhost :)
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda116.so
Loading act...
Traceback (most recent call last):
  File "/root/textgen-auto/server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/root/textgen-auto/modules/models.py", line 84, in load_model
    model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}"), low_cpu_mem_usage=True, torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16, trust_remote_code=trust_remote_code)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2405, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models/act.

It should recognise the quantize_config.json and detect it as a GPTQ.

Otherwise the user must always specify --wbits --groupsize --model_type params in order to launch the server, even when quantize_config.json exists.

Next steps

I'll do some more testing tomorrow, eg with GPU splitting, offloading, etc.

qwopqwop200 · 2023-04-30T23:36:27Z

https://github.com/qwopqwop200/AutoGPTQ-no-act-order
Here's a fork of auto-gptq that takes an existing old cuda and provides faster speeds.
I'm sure you'll probably get a faster speed.
However, this version does not support using act-order and groupsize at the same time.

TheBloke · 2023-04-30T23:46:23Z

Just tried loading one of my models which doesn't use safetensor: https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g/tree/main

And I get: FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g

Folder structure:

root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g# ll
total 3809388
drwxrwxrwx  2 root root    3000362 Apr 30 23:45 ./
drwxrwxrwx 34 root root    3066244 Apr 30 23:28 ../
-rw-rw-rw-  1 root root        581 Apr 28 08:00 config.json
-rw-rw-rw-  1 root root        136 Apr 26 16:21 generation_config.json
-rw-rw-rw-  1 root root        411 Apr 26 16:21 special_tokens_map.json
-rw-rw-rw-  1 root root     499723 Apr 26 16:21 tokenizer.model
-rw-rw-rw-  1 root root        700 Apr 28 10:16 tokenizer_config.json
-rw-rw-rw-  1 root root 3894242469 Apr 12 23:08 vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt
root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g#

Command:

python server.py --model TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g  --autogptq --autogptq-triton  --listen --wbits 4 --groupsize 128 --model_type llama

Error:

Gradio HTTP request redirected to localhost :)
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda116.so
Loading TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g...
Traceback (most recent call last):
  File "/root/textgen-auto/server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/root/textgen-auto/modules/models.py", line 158, in load_model
    model = load_quantized(model_name)
  File "/root/textgen-auto/modules/AutoGPTQ_loader.py", line 46, in load_quantized
    raise FileNotFoundError(f'No quantized model found for {model_name}')
FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g
root@cee7d387f502:~/textgen-auto#

I tried creating a quantize_config.json as well but that didn't help.

qwopqwop200 · 2023-04-30T23:47:12Z

Additionally, the speed difference between AutoGPTQ and GPTQ-for-LLaMA is due to fused-attn in cuda. Also, in the case of triton, it occurs because of fused-attn,fused-mlp.

LaaZa · 2023-05-01T00:03:45Z

Thank you for your reports.

Only --wbits should matter whether it knows to load a GPTQ model.
--model_type has no effect on AutoGPTQ, it is automatically detected and completely ignored by this.
quantize_config.json will override any other settings. I'll look into updating the UI.

I didn't want to change the detection, just put this as an alternative when it tries to load GfL. The check for quantize_config.json would have to happen in a different place. I'll look into it if I can change the loading criteria without changing the code too much.

@qwopqwop200 I think a separate fork would go against the idea of implementing AutoGPTQ as an universal solution. Would it be possible to implement those optimisations for the AutoGPTQ main?

qwopqwop200 · 2023-05-01T00:17:24Z

I think it will be possible. I'll try later.

LaaZa · 2023-05-01T00:29:28Z

Just tried loading one of my models which doesn't use safetensor: https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g/tree/main

And I get: FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g

Folder structure:

root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g# ll
total 3809388
drwxrwxrwx  2 root root    3000362 Apr 30 23:45 ./
drwxrwxrwx 34 root root    3066244 Apr 30 23:28 ../
-rw-rw-rw-  1 root root        581 Apr 28 08:00 config.json
-rw-rw-rw-  1 root root        136 Apr 26 16:21 generation_config.json
-rw-rw-rw-  1 root root        411 Apr 26 16:21 special_tokens_map.json
-rw-rw-rw-  1 root root     499723 Apr 26 16:21 tokenizer.model
-rw-rw-rw-  1 root root        700 Apr 28 10:16 tokenizer_config.json
-rw-rw-rw-  1 root root 3894242469 Apr 12 23:08 vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt
root@cee7d387f502:/workspace/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g#

Command:

python server.py --model TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g  --autogptq --autogptq-triton  --listen --wbits 4 --groupsize 128 --model_type llama

Error:

Gradio HTTP request redirected to localhost :)
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda116.so
Loading TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g...
Traceback (most recent call last):
  File "/root/textgen-auto/server.py", line 914, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/root/textgen-auto/modules/models.py", line 158, in load_model
    model = load_quantized(model_name)
  File "/root/textgen-auto/modules/AutoGPTQ_loader.py", line 46, in load_quantized
    raise FileNotFoundError(f'No quantized model found for {model_name}')
FileNotFoundError: No quantized model found for TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g
root@cee7d387f502:~/textgen-auto#

I tried creating a quantize_config.json as well but that didn't help.

AutoGPTQ only supports .safetensors and .bin I simply cannot pass the other extensions since it only takes the name without the extension and adds either .bin or .safetensors to it based on the use_safetensors.

…to it.

LaaZa · 2023-05-01T03:25:55Z

Now checking for quantize_config.json and if it exists wbits does not need to be manually set.
UI is not updated. I want some input on how this should be done, cheking for config happens after the ui is updated normally on model load.

I don't think it is related to anything in this commit but certain models fail to load.
For whatever reason it happens with pygmalion models, including the LLaMA based pygmalion-7B, so it isn't about gptj.

ValueError: QuantLinear() does not have a parameter or a buffer named bias.
coming from auto_gptq/modeling/_base.py at line 540: model = accelerate.load_checkpoint_and_dispatch(...)

Example model gozfarb/pygmalion-7b-4bit-128g-cuda

qwopqwop200 · 2023-05-01T04:19:10Z

Created a PR to update AutoGPTQ to provide optimizations.
This is enabled automatically if act-order and groupsize are not used at the same time.
https://github.com/PanQiWei/AutoGPTQ/tree/faster-cuda-no-actorder

LaaZa · 2023-05-01T04:32:10Z

Created a PR to update AutoGPTQ to provide optimizations.
This is enabled automatically if act-order and groupsize are not used at the same time.
https://github.com/PanQiWei/AutoGPTQ/tree/faster-cuda-no-actorder

Nice. But does that mean that act-order needs to be passed? It can't be automatically checked for?

qwopqwop200 · 2023-05-01T04:50:54Z

An automatic check seems very hard to implement.

LaaZa · 2023-05-01T05:12:14Z

An automatic check seems very hard to implement.

Probably just have to do best effort trying to check the filename then. Do you have any ideas about the issue with pygmalion models?

qwopqwop200 · 2023-05-01T09:27:42Z

1.Given the way AutoGPTQ currently loads models, it's not a good idea to check whether or not by the name of a file.
2.It seems that the model was made with the old version of GPTQ. You will need to do a fresh conversion of your model to the latest GPTQ.

TheBloke · 2023-05-01T09:32:01Z

Nice. But does that mean that act-order needs to be passed? It can't be automatically checked for?

We can specify desc_act = true in quantize_config.json and I think your GPTQ loader code will need to check for this.

Example BaseQuantizeConfig() call:

    return BaseQuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        desc_act=True
    )

Associated quantize_config.json:

{
  "bits": 4,
  "desc_act": true,
  "group_size": 128
}

If there is going to be a difference in inference with/without desc_act, I think there probably needs to be a new command line parameter and UI GPTQ parameter in ooba to specify if the model is desc_act or not, eg --desc_act to specify model is desc_act. So that older models can be loaded that don't have quantize_config.json

I will soon go back through all my GPTQ models on HF and add a quantize_config.json to every one. One issue is that in my older repos I have two model files per repo - one with desc_act (act-order), and one without. So I will have to move to using separate branches, one for the desc_act/act-order model, one for the no-desc_act/no-act-order model.

@qwopqwop200 how big of an improvement is using desc_act? Maybe I should not even bother making desc_act models any more if they are always going to have problems for users on CUDA?

So far I always made two models: one with groupsize=128 + desc_act, and one with groupsize=128 and no desc_act. I thought that would give users the choice of the 'best' model, or the 'compatible' model. But this is extra work and I think adds more confusion for users..

TheBloke · 2023-05-01T09:34:22Z

AutoGPTQ only supports .safetensors and .bin I simply cannot pass the other extensions since it only takes the name without the extension and adds either .bin or .safetensors to it based on the use_safetensors.

Oh yeah of course! OK. I guess I can go back through my older models and rename .pt files to .bin. In my later models I only use safetensors anyway.

Or maybe I could PR a change to AutoGPTQ to also check for .pt files.. I will ask PanQiWei what they think.

TheBloke · 2023-05-01T09:38:47Z

@qwopqwop200 I think a separate fork would go against the idea of implementing AutoGPTQ as an universal solution. Would it be possible to implement those optimisations for the AutoGPTQ main?

I agree completely. I think AutoGPTQ should support everything. Otherwise it is really confusing for users.

Only --wbits should matter whether it knows to load a GPTQ model. --model_type has no effect on AutoGPTQ, it is automatically detected and completely ignored by this. quantize_config.json will override any other settings. I'll look into updating the UI.

I didn't want to change the detection, just put this as an alternative when it tries to load GfL. The check for quantize_config.json would have to happen in a different place. I'll look into it if I can change the loading criteria without changing the code too much.

Yeah I understand. Maybe it is beyond the scope of this PR. But I do think the UI should support the features I mention. With GPTQ-for-Llama, a text-gen-ui user had to either specify --wbits --groupsize --model_type command line params, or else fill in the "GPTQ Params" in the UI and then use "Save settings for this model" to save config to YAML. If they didn't do that, the GPTQ model would fail to load because the UI would think it was an HF format model.

But now with AutoGPTQ, we can avoid that extra work for users because we have quantize_config.json. So I think it will be very helpful to users if text-gen-ui can make use of this file to automatically load the right settings for GPTQ models in all scenarios.

But that could be done in a separate PR, after yours is merged. Or maybe ooba will do it himself.

qwopqwop200 · 2023-05-01T09:54:25Z

1.It's already implemented that way.
2.The current old cuda kernel is written assuming that desc_act is not enabled, so inference won't work.
3.There is a very small additional improvement of 0.01 to 0.03 when using act-order. However, for a specific model (ex:opt-66b) we experienced a significant improvement of around 0.2.

TheBloke · 2023-05-01T10:03:47Z

1.It's already implemented that way.

Thank you!

2.The current old cuda kernel is written assuming that desc_act is not enabled, so inference won't work.

But desc_act does seem to work with the current AutoGPTQ CUDA code? Using the latest AutoGPTQ in CUDA mode I can run inference on models I created with --act-order in GPTQ-for-Llama, and it does work.

So is it only old versions of GPTQ-for-LLaMa CUDA that can't use desc_act/--act-order?

3.There is a very small additional improvement of 0.01 to 0.03 when using act-order. However, for a specific model (ex:opt-66b) we experienced a significant improvement of around 0.2.

OK thank you. So for Llama, maybe it would be easier not to keep making desc_act/--act-order models, if it is going to cause performance or compatibility problems for some users.

qwopqwop200 · 2023-05-01T10:11:55Z

But desc_act does seem to work with the current AutoGPTQ CUDA code? Using the latest AutoGPTQ in CUDA mode I can run inference on models I created with --act-order in GPTQ-for-Llama, and it does work.

So is it only old versions of GPTQ-for-LLaMa CUDA that can't use desc_act/--act-order?

yes, And this optimization of auto-gptq is also obtained using old cuda kernel.

Ph0rk0z · 2023-05-01T10:14:29Z

Group size breaks 30b models for 24gb vram. So I liked act order to smarten them up slightly.

You don't have to rename models, just put a symbolic link.

LaaZa · 2023-05-01T13:10:43Z

1.Given the way AutoGPTQ currently loads models, it's not a good idea to check whether or not by the name of a file.
2.It seems that the model was made with the old version of GPTQ. You will need to do a fresh conversion of your model to the latest GPTQ.

What I meant was to try and check for no act order in the name if it is not specified elsewhere (setting or conf). Does desc_act=False trigger the faster cuda?
Weird if they are all using old versions. Is there any way to load them anyway? GPTQ for LLaMA does.

LaaZa · 2023-05-12T07:39:14Z

@oobabooga now seriously, please explain why you closed this out of nowhere?

Malrama · 2023-05-14T09:56:13Z

@oobabooga ? 😢

LaaZa · 2023-05-14T15:47:41Z

I will keep an updated fork here so if you want to use AutoGPTQ with textgen, you can use that fork.

TheBloke · 2023-05-14T16:09:50Z

Thanks LaaZa.

My guess is that ooba closed it by mistake. Clicked the wrong button or something. There's not been any purge of other open PRs. And he said he was definitely interested.

If we've not heard soon, maybe just open a new PR.

LaaZa · 2023-05-14T16:35:11Z

Thanks LaaZa.

My guess is that ooba closed it by mistake. Clicked the wrong button or something. There's not been any purge of other open PRs. And he said he was definitely interested.

If we've not heard soon, maybe just open a new PR.

Something else is going on.

TheBloke · 2023-05-14T16:46:29Z

Based on..?

oobabooga · 2023-05-17T18:32:40Z

I closed this because the author decided it would be funny to mock and insult me in another PR. This kind of behavior is not accepted in this repository. Cut the drama and try to be constructive instead.

oobabooga · 2023-06-02T04:32:57Z

I needed code created here to load models without a quantize_config.json. To give @LaaZa due credit, I am merging the PR.

LaaZa added 2 commits April 30, 2023 04:13

Implement support for AutoGPTQ for loading GPTQ quantized models.

eb4f7f4

Disable cpu support for now. Something in the inference path assumes …

bb1b1b1

…gpu.

Implement offloading and splitting between multiple devices.

0fd4857

LaaZa mentioned this pull request Apr 30, 2023

AutoGPTQ in textgen AutoGPTQ/AutoGPTQ#36

Closed

Check for quantize_config.json and set wbits and groupsize according …

b173274

…to it.

TheBloke mentioned this pull request May 1, 2023

Faster cuda no actorder AutoGPTQ/AutoGPTQ#38

Merged

Merge branch 'oobabooga:main' into AutoGPTQ

a9b5703

LaaZa added 4 commits May 12, 2023 07:39

Merge branch 'oobabooga:main' into AutoGPTQ

b176477

Merge branch 'oobabooga:main' into AutoGPTQ

57a4808

Apply cpu memory fix to AutoGPTQ_loader

df1e100

Merge branch 'oobabooga:main' into AutoGPTQ

d212d60

LaaZa added 2 commits May 14, 2023 18:27

Update for AutoGPTQ

78e56de

Merge remote-tracking branch 'origin/AutoGPTQ' into AutoGPTQ

4ee76e6

LaaZa and others added 6 commits May 14, 2023 17:39

Merge branch 'oobabooga:main' into AutoGPTQ

96c2a6a

Merge branch 'oobabooga:main' into AutoGPTQ

7378332

Support .pt models

74d68a4

Merge remote-tracking branch 'origin/AutoGPTQ' into AutoGPTQ

b44fd4c

Merge branch 'oobabooga:main' into AutoGPTQ

a3f6ec9

Fix the model search with the .pt change

d26ed82

Repository owner locked as too heated and limited conversation to collaborators May 17, 2023

Repository owner deleted a comment from LaaZa May 17, 2023

Repository owner deleted a comment from MillionthOdin16 May 17, 2023

Repository owner deleted a comment from LaaZa May 17, 2023

Repository owner deleted a comment from MillionthOdin16 May 17, 2023

Merge branch 'main' into LaaZa-AutoGPTQ

f0ef6e5

oobabooga reopened this Jun 2, 2023

oobabooga merged commit 9c06660 into oobabooga:main Jun 2, 2023

Implement support for AutoGPTQ for loading GPTQ quantized models. #1668

Implement support for AutoGPTQ for loading GPTQ quantized models. #1668

Uh oh!

Conversation

LaaZa commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBloke commented Apr 30, 2023

Uh oh!

Ph0rk0z commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaaZa commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBloke commented Apr 30, 2023

All testing with:

AutoGPTQ testing

Triton

CUDA

text-generation-webui specific

When running server.py with --autogptq --autogptq-triton and no models yet downloaded

When running server.py with --autogptq --autogptq-triton and one or more GPTQ model downloaded which has quantize_config.json

Next steps

Uh oh!

qwopqwop200 commented Apr 30, 2023

Uh oh!

TheBloke commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qwopqwop200 commented Apr 30, 2023

Uh oh!

LaaZa commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qwopqwop200 commented May 1, 2023

Uh oh!

LaaZa commented May 1, 2023

Uh oh!

LaaZa commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qwopqwop200 commented May 1, 2023

Uh oh!

LaaZa commented May 1, 2023

Uh oh!

qwopqwop200 commented May 1, 2023

Uh oh!

LaaZa commented May 1, 2023

Uh oh!

qwopqwop200 commented May 1, 2023

Uh oh!

TheBloke commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBloke commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBloke commented May 1, 2023

Uh oh!

qwopqwop200 commented May 1, 2023

Uh oh!

TheBloke commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qwopqwop200 commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaaZa commented May 1, 2023

Uh oh!

LaaZa commented May 12, 2023

Uh oh!

Malrama commented May 14, 2023

Uh oh!

LaaZa commented May 14, 2023

Uh oh!

TheBloke commented May 14, 2023

LaaZa commented Apr 30, 2023 •

edited

Loading

Ph0rk0z commented Apr 30, 2023 •

edited

Loading

LaaZa commented Apr 30, 2023 •

edited

Loading

When running server.py with `--autogptq --autogptq-triton` and no models yet downloaded

When running server.py with `--autogptq --autogptq-triton` and one or more GPTQ model downloaded which has `quantize_config.json`

TheBloke commented Apr 30, 2023 •

edited

Loading

LaaZa commented May 1, 2023 •

edited

Loading

LaaZa commented May 1, 2023 •

edited

Loading

TheBloke commented May 1, 2023 •

edited

Loading

TheBloke commented May 1, 2023 •

edited

Loading

TheBloke commented May 1, 2023 •

edited

Loading

qwopqwop200 commented May 1, 2023 •

edited

Loading

Ph0rk0z commented May 1, 2023 •

edited

Loading

LaaZa commented May 14, 2023 •

edited

Loading