[CUDA] Fixing quantization uint8 packing bug for NF4 and FP4 #1721
+50
−22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request addresses a bug in the quantization logic for FP4 and NF4 during 8-bit packing in the
kQuantizeBlockwise
CUDA kernel.The expected behaviour is to pack two quantized 4-bit values into a single unsigned char (uint8) variable to reduce the overall memory footprint. The issue in the current implementation is that when packing values, the intermediate variable used was not getting cleared between iterations.
Since we quantize values per block, the error in larger block sizes would blow up, rendering the algorithm ineffective when block sizes are larger than 512.
List of changes:
The following graph highlights the accuracy improvements: