Adding IQ4_KSS: 4.0 bpw quants #89

ikawrakow · 2024-10-16T09:49:40Z

@Nexesenex has been asking for a 4.0 bpw quantization here and in llama.cpp. Well, here it is.

It uses the same non-linear grid as IQ4_K and IQ4_KS. Compared to IQ4_KS, we save 0.25 bpw by enforcing that the number of set bits in a group of 4 quants is even (i.e., we need 15 bits for 4 quants, so 3.75 bpw). Combined with 7+1 bits per block of 32 weights (7 bits for the scale + 1 bit indicating if there is a grid shift), we arrive at exactly 4.0 bpw. (well, there is also one float per tensor row, but that is < 0.01 bpw for 7B+ parameter models, so negligible). The best way I was able to come up with for packing the bits is to combine the 15 bits needed for the quants with the one extra bit per group of 4, needed for the block scale/grid shift, into a 16 bit unsigned integer. If prepared appropriately, the 15 quant bits can be converted to 16 bits for easier unpacking by just using v ^ (v >> 1) where v contains the 15 bits shifted 1 bit to the left. Assembling the scale from single bits stored in the uint16_t packed data is computationally more costly. My RTX-4080 GPU handles it gracefully, without noticeable impact on inference performance. Zen4 is also mostly OK as one can use the _mm512_cmpeq_epi16_mask instruction to pack the scale/shift bits back together. But on AVX2, ARM_NEON, and Metal, performance is noticeably lower compared to, say, IQ4_KS.

My initial idea for implementing the quantization function was to simply first quantize to IQ4_KS, and then prune to IQ4_KSS by flipping one bit per group of 4 (if number of set bits is odd), where the bit to be flipped is selected to minimize the difference to the original model weights. This kind of worked, but the resulting quantization error was higher than I was hoping for, so I ended up writing a dedicated IQ4_KSS method, where enforcing even number of set bits per group of 4 is incorporated into the block scale search. This makes quantization significantly slower than IQ4_KS (e.g., about 113 seconds vs 51 seconds for IQ4_KS to quantize a 7B parameter model on a Ryzen-7950X CPU).

In terms of quantization accuracy, these new quants mostly end up where one would expect them to be from the bpw vs quantization error curve established by other iqk-quants.

The first graph is for LLaMA-3.1-8B instruct. As I had recently done these calculation to compare with VPTQ, the new quantization approach from the Microsoft team claiming to be SOTA, the token embedding and output tensor are left as fp16, and the bpw only includes the tensors from the repeating layers. I have added labels to the 4-bit quants for easier disambiguation.

In all following graph the token embedding and output tensors are quantized, and the bpw is for the total model (i.e., total number of bits, including embeddding and output tensors, divided by total number of model parameters).

So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot.

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss.

Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads.

PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad.

45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0

Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads.

48.7 t/s -> 49.3 t/s

Nexesenex · 2024-10-16T20:38:07Z

Hey IK,

Congratulations and thank you. Now, I'm gonna try to make all of this work, because I ideally don't want to ever touch 3 bits quants ever again (except for attn_q.weight :P). I'll report my progresses. :D

Ok, I compiled it in debug thanks to @saood06 yesterday, and finally with Cuda today. Now I can work with this! I tested PPL512 on Sheared Llama 2.7b quickly, normal ftypes (pure, but Q6_K output and with imatrix).

FP16 = 7.3507
IQ4_K : 7.4172
IQ4_XS : 7.4444
IQ4_KS : 7.4322
IQ4_KSS : 7.4820 (!!!)
IQ3_S : 7.6666
IQ3_K : 7.6446 +/- 0.04566

Nexesenex · 2024-10-16T23:20:20Z

The new IQ4_KSS quant is really SOTA imo, and thank you very much. You're rocking the place, as usual.

Now, I see that IQ3_K is at 3.43bpw, and is close to IQ3_S, which was itself a bit better in its first version than its second one back when you launched it on official. Is there room to progress on IQ3_K?

I have already what I need for my own use now, but would you be willing to crack a IQ3_KM 3.65-3.75bpw, midrange between IQ3_K and IQ4_KSS. There might be a sweet spot for your maths around there, way below the usual IQ "line".

Also, I observed how Exllama v2 quantizes. Turboderp's tool calculates something akin to what quantize stats does in order to decide, in respect for a broad quant strategy, what tensor to quantize at which bpw, am I correct?

With an IQ3_KM and an IQ3_KSS, you might be able to drop down a bit (attn_q wise, and ffn_gate wise) the bpw of the quant strategies revolving in the 3 to 4.5 bpw bracket. Ofc, the logic applies on the whole scope, but that's a work I'm only able to suggest, not to do myself lol.

Then, if you were willing to code an automatic quantization system akin to Exllama v2, but maybe more rigorous on the skeleton "ftype" strategy employed (due to the knowledge gained in all the experimentation with FTYPES) and an automatic upscale or downscale (compared to the skeleton 'ftype" strategy) of the GGML type quant of a given tensor accordingly to its "error rate", then the process of strategization of the quants would be greatly helped, and the FTYPES also could be SOTA, on the top of your SOTA GGML_TYPES.

On my side, I ponder seriously about trying to rebase my KoboldCPP fork on your LlamaCPP clone, to offer the benefit of your quants to myself and others in daily use.

More practically, I tested your IQ6_K quant on Nemo 12b on ST/llama-server, and it indeed feels very like a Q8_0.
Your quants are amazing.
This night, I'm gonna quant a IQ4_KSS modified ftype for Mistral 123b. I can't wait ! :D

Iwan Kawrakow added 11 commits October 16, 2024 14:14

iq4_kss: WIP

fd89bf1

iq4_kss: CUDA dequantize works

b159b2b

So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot.

iq4_kss: slightly better quantization

b68c2cb

iq4_kss: another small quantization improvement

bb0e3f9

iq4_kss: CUDA works

026adac

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss.

iq4_kss: new bit arrangement - CUDA and Zen4 work

50fbe44

Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads.

iq4_kss: ARM_NEON. Predictably very slow

df09b88

iq4_kss: Metal

e01045b

PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad.

iq4_kss: somewhat faster Metal dot product

7cbe979

45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0

iq4_kss: AVX2

1469b22

Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads.

iq4_kss: very slightly faster Metal dot product

9612cd7

48.7 t/s -> 49.3 t/s

ikawrakow force-pushed the ik/iq4_kss branch from b15a234 to 9612cd7 Compare October 16, 2024 12:09

ikawrakow merged commit 76b97c8 into main Oct 16, 2024

ikawrakow mentioned this pull request Jul 23, 2025

IQ4_KSS improvements #642

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding IQ4_KSS: 4.0 bpw quants #89

Adding IQ4_KSS: 4.0 bpw quants #89

Uh oh!

ikawrakow commented Oct 16, 2024

Uh oh!

Nexesenex commented Oct 16, 2024 •

edited

Loading

Uh oh!

Nexesenex commented Oct 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

Adding IQ4_KSS: 4.0 bpw quants #89

Adding IQ4_KSS: 4.0 bpw quants #89

Uh oh!

Conversation

ikawrakow commented Oct 16, 2024

Uh oh!

Nexesenex commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nexesenex commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Nexesenex commented Oct 16, 2024 •

edited

Loading

Nexesenex commented Oct 16, 2024 •

edited

Loading