-
Notifications
You must be signed in to change notification settings - Fork 126
Adding IQ4_KSS: 4.0 bpw quants #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot.
TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss.
Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads.
PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad.
45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0
Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads.
48.7 t/s -> 49.3 t/s
Hey IK, Congratulations and thank you. Now, I'm gonna try to make all of this work, because I ideally don't want to ever touch 3 bits quants ever again (except for attn_q.weight :P). I'll report my progresses. :D Ok, I compiled it in debug thanks to @saood06 yesterday, and finally with Cuda today. Now I can work with this! I tested PPL512 on Sheared Llama 2.7b quickly, normal ftypes (pure, but Q6_K output and with imatrix). FP16 = 7.3507 |
The new IQ4_KSS quant is really SOTA imo, and thank you very much. You're rocking the place, as usual. Now, I see that IQ3_K is at 3.43bpw, and is close to IQ3_S, which was itself a bit better in its first version than its second one back when you launched it on official. Is there room to progress on IQ3_K? I have already what I need for my own use now, but would you be willing to crack a IQ3_KM 3.65-3.75bpw, midrange between IQ3_K and IQ4_KSS. There might be a sweet spot for your maths around there, way below the usual IQ "line". Also, I observed how Exllama v2 quantizes. Turboderp's tool calculates something akin to what quantize stats does in order to decide, in respect for a broad quant strategy, what tensor to quantize at which bpw, am I correct? With an IQ3_KM and an IQ3_KSS, you might be able to drop down a bit (attn_q wise, and ffn_gate wise) the bpw of the quant strategies revolving in the 3 to 4.5 bpw bracket. Ofc, the logic applies on the whole scope, but that's a work I'm only able to suggest, not to do myself lol. Then, if you were willing to code an automatic quantization system akin to Exllama v2, but maybe more rigorous on the skeleton "ftype" strategy employed (due to the knowledge gained in all the experimentation with FTYPES) and an automatic upscale or downscale (compared to the skeleton 'ftype" strategy) of the GGML type quant of a given tensor accordingly to its "error rate", then the process of strategization of the quants would be greatly helped, and the FTYPES also could be SOTA, on the top of your SOTA GGML_TYPES. On my side, I ponder seriously about trying to rebase my KoboldCPP fork on your LlamaCPP clone, to offer the benefit of your quants to myself and others in daily use. More practically, I tested your IQ6_K quant on Nemo 12b on ST/llama-server, and it indeed feels very like a Q8_0. |
@Nexesenex has been asking for a 4.0 bpw quantization here and in
llama.cpp
. Well, here it is.It uses the same non-linear grid as
IQ4_K
andIQ4_KS
. Compared toIQ4_KS
, we save 0.25 bpw by enforcing that the number of set bits in a group of 4 quants is even (i.e., we need 15 bits for 4 quants, so 3.75 bpw). Combined with 7+1 bits per block of 32 weights (7 bits for the scale + 1 bit indicating if there is a grid shift), we arrive at exactly 4.0 bpw. (well, there is also one float per tensor row, but that is < 0.01 bpw for 7B+ parameter models, so negligible). The best way I was able to come up with for packing the bits is to combine the 15 bits needed for the quants with the one extra bit per group of 4, needed for the block scale/grid shift, into a 16 bit unsigned integer. If prepared appropriately, the 15 quant bits can be converted to 16 bits for easier unpacking by just usingv ^ (v >> 1)
wherev
contains the 15 bits shifted 1 bit to the left. Assembling the scale from single bits stored in theuint16_t
packed data is computationally more costly. My RTX-4080 GPU handles it gracefully, without noticeable impact on inference performance. Zen4 is also mostly OK as one can use the_mm512_cmpeq_epi16_mask
instruction to pack the scale/shift bits back together. But onAVX2
,ARM_NEON
, andMetal
, performance is noticeably lower compared to, say,IQ4_KS
.My initial idea for implementing the quantization function was to simply first quantize to
IQ4_KS
, and then prune toIQ4_KSS
by flipping one bit per group of 4 (if number of set bits is odd), where the bit to be flipped is selected to minimize the difference to the original model weights. This kind of worked, but the resulting quantization error was higher than I was hoping for, so I ended up writing a dedicatedIQ4_KSS
method, where enforcing even number of set bits per group of 4 is incorporated into the block scale search. This makes quantization significantly slower thanIQ4_KS
(e.g., about 113 seconds vs 51 seconds forIQ4_KS
to quantize a 7B parameter model on a Ryzen-7950X CPU).In terms of quantization accuracy, these new quants mostly end up where one would expect them to be from the bpw vs quantization error curve established by other iqk-quants.
The first graph is for LLaMA-3.1-8B instruct. As I had recently done these calculation to compare with VPTQ, the new quantization approach from the Microsoft team claiming to be SOTA, the token embedding and output tensor are left as
fp16
, and the bpw only includes the tensors from the repeating layers. I have added labels to the 4-bit quants for easier disambiguation.In all following graph the token embedding and output tensors are quantized, and the bpw is for the total model (i.e., total number of bits, including embeddding and output tensors, divided by total number of model parameters).