[Bug] FP8 block-quant loader rejects artifacts whose safetensors save weight_scale rather than weight_scale_inv
Summary
A class of compressed-tensors-quantized artifacts saves FP8 block-quant scale tensors under the attribute name weight_scale (no _inv suffix), with mathematically identical content to the weight_scale_inv form vLLM's FP8 block-quant loader expects. The loader crashes with:
AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?
The crash site (vllm/model_executor/kernels/linear/scaled_mm/marlin.py:73 and marlin_utils_fp8.py:106) accesses layer.weight_scale_inv directly. The DeepseekV4 weight renaming mapper (vllm/models/deepseek_v4/nvidia/model.py:1511) only renames .scale → .weight_scale_inv; it does not handle the case where the artifact already uses the longer name .weight_scale.
A defensive getattr(layer, "weight_scale_inv", layer.weight_scale) fallback would accept both naming conventions transparently.
Reproducer
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (DeepSeek-V4-Flash, W4A16 routed experts + FP8 block 128×128 attention + MTP draft head). Safetensors built by llmcompressor's model_free_ptq path which produces keys named <module>.weight_scale instead of <module>.weight_scale_inv.
from safetensors import safe_open
with safe_open(".../layers.0.attn.wkv.weight", framework="pt") as f:
t = f.get_tensor("layers.0.attn.wkv.weight")
print(t.dtype, t.shape) # torch.float8_e4m3fn, (512, 4096)
s = f.get_tensor("layers.0.attn.wkv.weight_scale") # NOT weight_scale_inv
print(s.dtype, s.shape) # torch.bfloat16, (4, 32) — FP8 block 128×128 scales
The mathematical content is identical to the canonical form — the block-scaled FP8 weight reconstruction is weight * scale.repeat_interleave(128, dim=0).repeat_interleave(128, dim=1). Only the attribute name differs from the loader's expectation.
Why this matters
llmcompressor's newer model_free_ptq path (which bypasses the PreTrainedModel integration step and writes safetensors directly) emits this naming. Any downstream artifact built with that path hits the loader crash, even though the math is correct and the artifact was previously usable on older vLLM builds that were tolerant of the naming.
This is a class of bug, not a one-off. We've confirmed:
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (post-dequant shipping fix)
- The original (pre-dequant) FP8 attention in the same artifact also exhibits this — all 33,239 FP8/W4A16 scale tensors use
weight_scale (no _inv)
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (Card A) has the same llmcompressor-produced naming
Proposed fix
Two-line defensive patch:
--- a/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
+++ b/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
@@ -70,11 +70,13 @@ class CompressedTensorsW8A8Fp8MarlinScaledMMLinearKernel(...):
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
if self.block_quant:
+ # Accept either `weight_scale_inv` (canonical) or `weight_scale`
+ # (llmcompressor model_free_ptq naming). Math is identical.
+ scale = getattr(layer, "weight_scale_inv",
+ getattr(layer, "weight_scale", None))
weight, weight_scale_inv = process_fp8_weight_block_strategy(
- layer.weight, layer.weight_scale_inv
+ layer.weight, scale
)
# Update layer with new values
replace_parameter(layer, "weight", weight.data)
- replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+ # Always register the result under `weight_scale_inv` so downstream
+ # forward-path code (Dynamo, BlockScaledMMLinearKernel) finds it.
+ if hasattr(layer, "weight_scale_inv"):
+ replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+ else:
+ layer.register_parameter(
+ "weight_scale_inv",
+ torch.nn.Parameter(weight_scale_inv.data, requires_grad=False),
+ )
The same pattern likely applies to marlin_utils_fp8.py:106 (prepare_fp8_layer_for_marlin) and possibly to the DeepseekV4 renaming mapper in vllm/models/deepseek_v4/nvidia/model.py:1511 — we can extend the proposal to those sites in the same PR.
Open question for kylesayrs
Is weight_scale vs weight_scale_inv a deliberate semantic distinction (e.g., weight_scale_inv is the multiplicative inverse used during dequant fastpath, vs weight_scale for divide-by-scale dequant), or are they fully interchangeable in the FP8 block-quant path? The process_fp8_weight_block_strategy function appears to treat them identically, but if there's a subtle difference, the right fix is the source quantization step naming the attribute consistently, not the loader fallback.
We're happy to extend this PR (or split into a child PR for the model-mapper) once that's clarified.
Cross-references
cc @kylesayrs (compressed-tensors maintainer)
[Bug] FP8 block-quant loader rejects artifacts whose safetensors save
weight_scalerather thanweight_scale_invSummary
A class of compressed-tensors-quantized artifacts saves FP8 block-quant scale tensors under the attribute name
weight_scale(no_invsuffix), with mathematically identical content to theweight_scale_invform vLLM's FP8 block-quant loader expects. The loader crashes with:The crash site (
vllm/model_executor/kernels/linear/scaled_mm/marlin.py:73andmarlin_utils_fp8.py:106) accesseslayer.weight_scale_invdirectly. The DeepseekV4 weight renaming mapper (vllm/models/deepseek_v4/nvidia/model.py:1511) only renames.scale→.weight_scale_inv; it does not handle the case where the artifact already uses the longer name.weight_scale.A defensive
getattr(layer, "weight_scale_inv", layer.weight_scale)fallback would accept both naming conventions transparently.Reproducer
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP(DeepSeek-V4-Flash, W4A16 routed experts + FP8 block 128×128 attention + MTP draft head). Safetensors built byllmcompressor'smodel_free_ptqpath which produces keys named<module>.weight_scaleinstead of<module>.weight_scale_inv.The mathematical content is identical to the canonical form — the block-scaled FP8 weight reconstruction is
weight * scale.repeat_interleave(128, dim=0).repeat_interleave(128, dim=1). Only the attribute name differs from the loader's expectation.Why this matters
llmcompressor's newermodel_free_ptqpath (which bypasses the PreTrainedModel integration step and writes safetensors directly) emits this naming. Any downstream artifact built with that path hits the loader crash, even though the math is correct and the artifact was previously usable on older vLLM builds that were tolerant of the naming.This is a class of bug, not a one-off. We've confirmed:
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP(post-dequant shipping fix)weight_scale(no_inv)canada-quant/DeepSeek-V4-Flash-W4A16-FP8(Card A) has the same llmcompressor-produced namingProposed fix
Two-line defensive patch:
The same pattern likely applies to
marlin_utils_fp8.py:106(prepare_fp8_layer_for_marlin) and possibly to the DeepseekV4 renaming mapper invllm/models/deepseek_v4/nvidia/model.py:1511— we can extend the proposal to those sites in the same PR.Open question for kylesayrs
Is
weight_scalevsweight_scale_inva deliberate semantic distinction (e.g.,weight_scale_invis the multiplicative inverse used during dequant fastpath, vsweight_scalefor divide-by-scale dequant), or are they fully interchangeable in the FP8 block-quant path? Theprocess_fp8_weight_block_strategyfunction appears to treat them identically, but if there's a subtle difference, the right fix is the source quantization step naming the attribute consistently, not the loader fallback.We're happy to extend this PR (or split into a child PR for the model-mapper) once that's clarified.
Cross-references
canada-quantrepo audit: https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp/blob/main/docs/findings/cardd_marlin_patches_built_artifact_blocker_2026_05_25.md#40923 commentand#36889 reopen comment— same artifact, sibling bugs in the same Marlin path on SM 12.0.cc @kylesayrs (compressed-tensors maintainer)