Skip to content

Nvidia ModelOpt (NVFP4) compatibility for DiffusionGemma #46772

@ppabis

Description

@ppabis

Feature request

Currently, when I try to run DiffusionGemma in NVFP4 through transformers I see the warning

[transformers] Unknown quantization type, got modelopt - supported types are: ['awq', ..., 'gemma']

After installing nvidia-modelopt[hf] transformers is downgraded.

  • nvidia-modelopt[hf] installed nvidia-modelopt 0.46.0.dev70+g93dd08f42
  • Its hf extra requires transformers<5.10,>=4.56
  • Pip resolved that to transformers 5.9.0
  • With transformers 5.9.0, this fails immediately:

ImportError: cannot import name 'DiffusionGemmaForBlockDiffusion' from 'transformers'

When I use latest transformers==5.12.1 and nvidia-modelopt from Git, I get the following errors:

[transformers] This checkpoint seem corrupted. The tied weights mapping for this model specifies to tie model.decoder.layers.9.experts.down_proj to model.encoder.language_model.layers.9.experts.down_proj, but both are absent from the checkpoint, and we could not find another related tied weight for those keys
[transformers] DiffusionGemmaForBlockDiffusion LOAD REPORT from: nvidia/diffusiongemma-26B-A4B-it-NVFP4
Key                                                                      | Status     | 
-------------------------------------------------------------------------+------------+-
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.weight         | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.weight_scale_2 | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.weight_scale_2   | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.input_scale      | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.weight           | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.weight_scale     | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.input_scale    | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.weight_scale   | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.weight_scale   | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.weight         | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.input_scale    | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.weight_scale_2 | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.down_proj                          | MISSING    | 
model.encoder.language_model.layers.{0...29}.experts.gate_up_proj        | MISSING    | 
model.encoder.language_model.layers.{0...29}.experts.down_proj           | MISSING    | 
model.decoder.layers.{0...29}.experts.gate_up_proj                       | MISSING    | 

Notes:
- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING:      those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.

Motivation

The suggested path for DiffusionGemma in NVFP4 is to use vLLM however for a single user setup or custom scripts, using transformers library is much more reasonable.

Your contribution

I myself am not capable to do this, unless letting some model vibecode this. What GPT-5.5 suggested to fix is:

  1. Register a modelopt quantizer

    • Add quantizer_modelopt.py and map quant_method: "modelopt" in the auto-quantizer registry.
  2. Implement NVFP4 linear modules

    • Replace relevant nn.Linear layers with ModelOpt-compatible NVFP4 linear layers.
    • Keep packed 4-bit weights and both scaling tensors on GPU; never expand them to BF16/FP16.
    • Dispatch forward passes to NVIDIA’s ModelOpt/CUTLASS kernels.
  3. Map DiffusionGemma’s MoE checkpoint layout

    • Load checkpoint keys such as per-expert gate_proj, up_proj, and down_proj, including their weight_scale,
      weight_scale_2, and input_scale.
    • Adapt the DiffusionGemma MoE implementation so it accepts its per-expert, quantized layout instead of expecting fused BF16 gate_up_proj / down_proj tensors.
  4. Keep quantization metadata through loading

    • Ensure from_pretrained() does not emit “unknown quantization type … skipping quantization.”
    • Prevent the current false-success path where missing BF16 MoE tensors are randomly initialized.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions