[Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming

# [Bug] FP8 block-quant loader rejects artifacts whose safetensors save `weight_scale` rather than `weight_scale_inv`

## Summary

A class of compressed-tensors-quantized artifacts saves FP8 block-quant scale tensors under the attribute name `weight_scale` (no `_inv` suffix), with mathematically identical content to the `weight_scale_inv` form vLLM's FP8 block-quant loader expects. The loader crashes with:

```
AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?
```

The crash site (`vllm/model_executor/kernels/linear/scaled_mm/marlin.py:73` and `marlin_utils_fp8.py:106`) accesses `layer.weight_scale_inv` directly. The DeepseekV4 weight renaming mapper (`vllm/models/deepseek_v4/nvidia/model.py:1511`) only renames `.scale` → `.weight_scale_inv`; it does not handle the case where the artifact already uses the longer name `.weight_scale`.

A defensive `getattr(layer, "weight_scale_inv", layer.weight_scale)` fallback would accept both naming conventions transparently.

## Reproducer

`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP` (DeepSeek-V4-Flash, W4A16 routed experts + FP8 block 128×128 attention + MTP draft head). Safetensors built by `llmcompressor`'s `model_free_ptq` path which produces keys named `<module>.weight_scale` instead of `<module>.weight_scale_inv`.

```python
from safetensors import safe_open
with safe_open(".../layers.0.attn.wkv.weight", framework="pt") as f:
    t = f.get_tensor("layers.0.attn.wkv.weight")
    print(t.dtype, t.shape)         # torch.float8_e4m3fn, (512, 4096)
    s = f.get_tensor("layers.0.attn.wkv.weight_scale")  # NOT weight_scale_inv
    print(s.dtype, s.shape)         # torch.bfloat16, (4, 32) — FP8 block 128×128 scales
```

The mathematical content is identical to the canonical form — the block-scaled FP8 weight reconstruction is `weight * scale.repeat_interleave(128, dim=0).repeat_interleave(128, dim=1)`. Only the attribute name differs from the loader's expectation.

## Why this matters

`llmcompressor`'s newer `model_free_ptq` path (which bypasses the PreTrainedModel integration step and writes safetensors directly) emits this naming. Any downstream artifact built with that path hits the loader crash, even though the math is correct and the artifact was previously usable on older vLLM builds that were tolerant of the naming.

This is a class of bug, not a one-off. We've confirmed:

- `canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP` (post-dequant shipping fix)
- The original (pre-dequant) FP8 attention in the same artifact also exhibits this — all 33,239 FP8/W4A16 scale tensors use `weight_scale` (no `_inv`)
- `canada-quant/DeepSeek-V4-Flash-W4A16-FP8` (Card A) has the same llmcompressor-produced naming

## Proposed fix

Two-line defensive patch:

```diff
--- a/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
+++ b/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
@@ -70,11 +70,13 @@ class CompressedTensorsW8A8Fp8MarlinScaledMMLinearKernel(...):
     def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
         if self.block_quant:
+            # Accept either `weight_scale_inv` (canonical) or `weight_scale`
+            # (llmcompressor model_free_ptq naming). Math is identical.
+            scale = getattr(layer, "weight_scale_inv",
+                            getattr(layer, "weight_scale", None))
             weight, weight_scale_inv = process_fp8_weight_block_strategy(
-                layer.weight, layer.weight_scale_inv
+                layer.weight, scale
             )
             # Update layer with new values
             replace_parameter(layer, "weight", weight.data)
-            replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            # Always register the result under `weight_scale_inv` so downstream
+            # forward-path code (Dynamo, BlockScaledMMLinearKernel) finds it.
+            if hasattr(layer, "weight_scale_inv"):
+                replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            else:
+                layer.register_parameter(
+                    "weight_scale_inv",
+                    torch.nn.Parameter(weight_scale_inv.data, requires_grad=False),
+                )
```

The same pattern likely applies to `marlin_utils_fp8.py:106` (`prepare_fp8_layer_for_marlin`) and possibly to the DeepseekV4 renaming mapper in `vllm/models/deepseek_v4/nvidia/model.py:1511` — we can extend the proposal to those sites in the same PR.

## Open question for kylesayrs

Is `weight_scale` vs `weight_scale_inv` a deliberate semantic distinction (e.g., `weight_scale_inv` is the multiplicative inverse used during dequant fastpath, vs `weight_scale` for divide-by-scale dequant), or are they fully interchangeable in the FP8 block-quant path? The `process_fp8_weight_block_strategy` function appears to treat them identically, but if there's a subtle difference, the right fix is the source quantization step naming the attribute consistently, not the loader fallback.

We're happy to extend this PR (or split into a child PR for the model-mapper) once that's clarified.

## Cross-references

- `canada-quant` repo audit: https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp/blob/main/docs/findings/cardd_marlin_patches_built_artifact_blocker_2026_05_25.md
- This sat alongside our [`#40923 comment`](https://github.com/vllm-project/vllm/pull/40923#issuecomment-4530927937) and [`#36889 reopen comment`](https://github.com/vllm-project/vllm/pull/36889#issuecomment-4531289048) — same artifact, sibling bugs in the same Marlin path on SM 12.0.

cc @kylesayrs (compressed-tensors maintainer)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming #43564

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save `weight_scale` rather than `weight_scale_inv`

Summary

Reproducer

Why this matters

Proposed fix

Open question for kylesayrs

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming #43564

Description

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save weight_scale rather than weight_scale_inv

Summary

Reproducer

Why this matters

Proposed fix

Open question for kylesayrs

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save `weight_scale` rather than `weight_scale_inv`