Nvidia ModelOpt (NVFP4) compatibility for DiffusionGemma

### Feature request

Currently, when I try to run DiffusionGemma in NVFP4 through transformers I see the warning

```
[transformers] Unknown quantization type, got modelopt - supported types are: ['awq', ..., 'gemma']
```

After installing `nvidia-modelopt[hf]` transformers is downgraded.
  - nvidia-modelopt[hf] installed nvidia-modelopt 0.46.0.dev70+g93dd08f42
  - Its hf extra requires transformers<5.10,>=4.56
  - Pip resolved that to transformers 5.9.0
  - With transformers 5.9.0, this fails immediately:

`ImportError: cannot import name 'DiffusionGemmaForBlockDiffusion' from 'transformers'`

When I use latest `transformers==5.12.1` and `nvidia-modelopt` from Git, I get the following errors:

```
[transformers] This checkpoint seem corrupted. The tied weights mapping for this model specifies to tie model.decoder.layers.9.experts.down_proj to model.encoder.language_model.layers.9.experts.down_proj, but both are absent from the checkpoint, and we could not find another related tied weight for those keys
[transformers] DiffusionGemmaForBlockDiffusion LOAD REPORT from: nvidia/diffusiongemma-26B-A4B-it-NVFP4
Key                                                                      | Status     | 
-------------------------------------------------------------------------+------------+-
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.weight         | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.weight_scale_2 | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.weight_scale_2   | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.input_scale      | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.weight           | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.up_proj.weight_scale     | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.input_scale    | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.weight_scale   | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.weight_scale   | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.gate_proj.weight         | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.input_scale    | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.{0...127}.down_proj.weight_scale_2 | UNEXPECTED | 
model.decoder.layers.{0...29}.experts.down_proj                          | MISSING    | 
model.encoder.language_model.layers.{0...29}.experts.gate_up_proj        | MISSING    | 
model.encoder.language_model.layers.{0...29}.experts.down_proj           | MISSING    | 
model.decoder.layers.{0...29}.experts.gate_up_proj                       | MISSING    | 

Notes:
- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING:      those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
```

### Motivation

The suggested path for DiffusionGemma in NVFP4 is to use vLLM however for a single user setup or custom scripts, using transformers library is much more reasonable.

### Your contribution

I myself am not capable to do this, unless letting some model vibecode this. What GPT-5.5 suggested to fix is:
1. Register a modelopt quantizer
      - Add quantizer_modelopt.py and map quant_method: "modelopt" in the auto-quantizer registry.
                                                                                                                         
  2. Implement NVFP4 linear modules
      - Replace relevant nn.Linear layers with ModelOpt-compatible NVFP4 linear layers.
      - Keep packed 4-bit weights and both scaling tensors on GPU; never expand them to BF16/FP16.
      - Dispatch forward passes to NVIDIA’s ModelOpt/CUTLASS kernels.

  3. Map DiffusionGemma’s MoE checkpoint layout
      - Load checkpoint keys such as per-expert gate_proj, up_proj, and down_proj, including their weight_scale,
        weight_scale_2, and input_scale.               
      - Adapt the DiffusionGemma MoE implementation so it accepts its per-expert, quantized layout instead of expecting fused BF16 gate_up_proj / down_proj tensors.
               
4. Keep quantization metadata through loading
      - Ensure from_pretrained() does not emit “unknown quantization type … skipping quantization.”                      
      - Prevent the current false-success path where missing BF16 MoE tensors are randomly initialized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia ModelOpt (NVFP4) compatibility for DiffusionGemma #46772

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Nvidia ModelOpt (NVFP4) compatibility for DiffusionGemma #46772

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions