Skip to content

[Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming #43564

@pasta-paul

Description

@pasta-paul

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save weight_scale rather than weight_scale_inv

Summary

A class of compressed-tensors-quantized artifacts saves FP8 block-quant scale tensors under the attribute name weight_scale (no _inv suffix), with mathematically identical content to the weight_scale_inv form vLLM's FP8 block-quant loader expects. The loader crashes with:

AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?

The crash site (vllm/model_executor/kernels/linear/scaled_mm/marlin.py:73 and marlin_utils_fp8.py:106) accesses layer.weight_scale_inv directly. The DeepseekV4 weight renaming mapper (vllm/models/deepseek_v4/nvidia/model.py:1511) only renames .scale.weight_scale_inv; it does not handle the case where the artifact already uses the longer name .weight_scale.

A defensive getattr(layer, "weight_scale_inv", layer.weight_scale) fallback would accept both naming conventions transparently.

Reproducer

canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (DeepSeek-V4-Flash, W4A16 routed experts + FP8 block 128×128 attention + MTP draft head). Safetensors built by llmcompressor's model_free_ptq path which produces keys named <module>.weight_scale instead of <module>.weight_scale_inv.

from safetensors import safe_open
with safe_open(".../layers.0.attn.wkv.weight", framework="pt") as f:
    t = f.get_tensor("layers.0.attn.wkv.weight")
    print(t.dtype, t.shape)         # torch.float8_e4m3fn, (512, 4096)
    s = f.get_tensor("layers.0.attn.wkv.weight_scale")  # NOT weight_scale_inv
    print(s.dtype, s.shape)         # torch.bfloat16, (4, 32) — FP8 block 128×128 scales

The mathematical content is identical to the canonical form — the block-scaled FP8 weight reconstruction is weight * scale.repeat_interleave(128, dim=0).repeat_interleave(128, dim=1). Only the attribute name differs from the loader's expectation.

Why this matters

llmcompressor's newer model_free_ptq path (which bypasses the PreTrainedModel integration step and writes safetensors directly) emits this naming. Any downstream artifact built with that path hits the loader crash, even though the math is correct and the artifact was previously usable on older vLLM builds that were tolerant of the naming.

This is a class of bug, not a one-off. We've confirmed:

  • canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (post-dequant shipping fix)
  • The original (pre-dequant) FP8 attention in the same artifact also exhibits this — all 33,239 FP8/W4A16 scale tensors use weight_scale (no _inv)
  • canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (Card A) has the same llmcompressor-produced naming

Proposed fix

Two-line defensive patch:

--- a/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
+++ b/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
@@ -70,11 +70,13 @@ class CompressedTensorsW8A8Fp8MarlinScaledMMLinearKernel(...):
     def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
         if self.block_quant:
+            # Accept either `weight_scale_inv` (canonical) or `weight_scale`
+            # (llmcompressor model_free_ptq naming). Math is identical.
+            scale = getattr(layer, "weight_scale_inv",
+                            getattr(layer, "weight_scale", None))
             weight, weight_scale_inv = process_fp8_weight_block_strategy(
-                layer.weight, layer.weight_scale_inv
+                layer.weight, scale
             )
             # Update layer with new values
             replace_parameter(layer, "weight", weight.data)
-            replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            # Always register the result under `weight_scale_inv` so downstream
+            # forward-path code (Dynamo, BlockScaledMMLinearKernel) finds it.
+            if hasattr(layer, "weight_scale_inv"):
+                replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            else:
+                layer.register_parameter(
+                    "weight_scale_inv",
+                    torch.nn.Parameter(weight_scale_inv.data, requires_grad=False),
+                )

The same pattern likely applies to marlin_utils_fp8.py:106 (prepare_fp8_layer_for_marlin) and possibly to the DeepseekV4 renaming mapper in vllm/models/deepseek_v4/nvidia/model.py:1511 — we can extend the proposal to those sites in the same PR.

Open question for kylesayrs

Is weight_scale vs weight_scale_inv a deliberate semantic distinction (e.g., weight_scale_inv is the multiplicative inverse used during dequant fastpath, vs weight_scale for divide-by-scale dequant), or are they fully interchangeable in the FP8 block-quant path? The process_fp8_weight_block_strategy function appears to treat them identically, but if there's a subtle difference, the right fix is the source quantization step naming the attribute consistently, not the loader fallback.

We're happy to extend this PR (or split into a child PR for the model-mapper) once that's clarified.

Cross-references

cc @kylesayrs (compressed-tensors maintainer)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions