Unsloth Dynamic 2.0 Per-Tensor Quantization Recipe for MLX #1062
Brooooooklyn
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Unsloth Dynamic 2.0 Per-Tensor Quantization Recipe for MLX
I've implemented Unsloth's Dynamic 2.0 per-tensor quantization strategy in mlx-node, targeting Qwen3.5 hybrid models (full attention + GatedDeltaNet layers). Sharing here since this approach could benefit mlx-lm users as well.
What it does
Instead of uniform N-bit quantization, the recipe assigns each weight tensor a different bit-width based on Unsloth's KLD sensitivity research (150+ benchmarks across 121 configurations):
gate_proj,up_projdown_projq/k/v_proj,in_proj_*embed_tokenslm_heado_proj,out_projCombined with AWQ pre-scaling (4 groups exploiting norm->projection pairs), this achieves ~3-bit average with significantly better quality than uniform Q3.
Key findings for Qwen3.5
linear_attn.out_projis the most sensitive tensor (KLD ~6.0) -- must stay bf16o_projhas no preceding norm layer, so AWQ can't help -- bf16 is the only safe optionembed_tokens/lm_headhas negligible size impact but dramatically reduces output degradationModels & Code & References
The imatrix data comes from Unsloth's open-source GGUF repos on huggingface, calibrated on conversational and coding data. Would love to see your testing in mlx-lm/mlx-vlm
Beta Was this translation helpful? Give feedback.
All reactions