Extended & simplified n-to-1 kernel fusion via KernelConfig by michaelbenayoun · Pull Request #46339 · huggingface/transformers

michaelbenayoun · 2026-06-02T09:21:36Z

What does this PR do?

Extends the KernelConfig API with two orthogonal capabilities:

Module fusion: specify how Transformers modules should be fused together before a custom kernel is applied (n-to-1 replacement).
Parameter transformation: handle cases where a kernel expects weights in a different layout than the original modeling (e.g. fused linears).

Compared to previous PR, this approach is more explicit and way simpler, putting much of the burden to the kernel authors.

How it works

The kernel author needs to define two classes:

KernelName: defines the forward pass, used by the kernels library to kernelize the model
KernelNameLayout: defines the conversion_mapping as well as an __init__ method. This is used to monkey-patch the model

Having two classes because the kernels library prevents us from having stateful kernel classes.
While it might not be as pleasing as having one big class, it separates concerns.

Script for the examples

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, KernelConfig


model_id = "michaelbenayoun/qwen3-tiny-4kv-heads-4layers-random"
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("Hello, how are you?", return_tensors="pt")

# --- baseline: plain model, no fusion ---
print("=" * 60)
print("Loading baseline model (no fusion)...")
baseline = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", use_kernels=True)
baseline.eval()
inputs = {k: v.to(baseline.device) for k, v in inputs.items()}

with torch.no_grad():
    baseline_out = baseline(**inputs).logits
print("Baseline output shape:", baseline_out.shape)
# del baseline

# --- fused model ---
print("=" * 60)
print("Loading fused model...")

# kernel_repo_id = "michaelbenayoun/dummy-rmsnorm-mlp:RMSNormMLP"
# kernel_repo_id = "michaelbenayoun/dummy-rmsnorm-mlp-with-transformations:RMSNormMLP"
kernel_repo_id = "michaelbenayoun/dummy-rmsnorm-mlp-with-transformations-and-init:RMSNormMLP"
kernel_repo_id = "michaelbenayoun/dummy-rmsnorm-kernel-with-init:CustomRMSNorm"
kernel_config = KernelConfig(
    {
        # (
        #     ("RMSNorm", "model.layers.*.post_attention_layernorm"),
        #     ("MLP",     "model.layers.*.mlp"),
        # ): kernel_repo_id,
        "RMSNorm": kernel_repo_id,
    },
)

fused_model = AutoModelForCausalLM.from_pretrained(
    model_id, use_kernels=True, kernel_config=kernel_config, device_map="cuda"
)
fused_model.eval()
print(fused_model)

with torch.no_grad():
    fused_out = fused_model(**inputs).logits
print("Fused output shape:", fused_out.shape)

# --- compare ---
print("=" * 60)
print("Max diff fused vs baseline:", (fused_out - baseline_out).abs().max().item())

Example 1: Parameter transformation, no fusion

In this case, the KernelNameLayout class's __init__ method has the same signature as the module being replaced.

import torch
import torch.nn as nn

from transformers.conversion_mapping import WeightRenaming

class CustomRMSNormLayout(nn.Module):
    conversion_mapping = [
        WeightRenaming(
            source_patterns=r"(.*(?:input_layernorm|post_attention_layernorm|norm)\.)weight",
            target_patterns=r"\1scale",
        ),
    ]

    def __init__(self, hidden_size: int, eps: float = 1e-6):
        super().__init__()
        self.scale = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        pass  # replaced at runtime by kernelize


class CustomRMSNorm(nn.Module):
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        print("This dummy kernel is used")
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.scale * hidden_states.to(input_dtype)


class layers:
    CustomRMSNorm = CustomRMSNorm

Example 2: Fusion and parameter transformation

Compared to the first example, here we will fuse two modules in the original model into one module.
Because of this, the __init__ method does not have the same signature, but rather take the instantiated modules it's fusing.

import torch
import torch.nn as nn

from transformers import Concatenate, WeightConverter
from transformers.conversion_mapping import WeightRenaming


class RMSNormMLPLayout(nn.Module):
    conversion_mapping = [
        # norm.weight → scale (placed at post_attention_layernorm.scale)
        WeightRenaming(
            source_patterns=r"(.*post_attention_layernorm\.)weight",
            target_patterns=r"\1scale",
        ),
        # mlp.gate_proj + mlp.up_proj → post_attention_layernorm.gate_up_proj (concat)
        WeightConverter(
            ["mlp.gate_proj", "mlp.up_proj"],
            "post_attention_layernorm.gate_up_proj",
            [Concatenate(dim=0)],
        ),
        # mlp.down_proj.* → post_attention_layernorm.down_proj.*
        WeightRenaming(
            source_patterns=r"(.*\.)mlp\.(down_proj\..*)",
            target_patterns=r"\1post_attention_layernorm.\2",
        ),
    ]

    def __init__(self, norm, mlp):
        super().__init__()
        self.variance_epsilon = norm.variance_epsilon
        self.scale = nn.Parameter(torch.empty_like(norm.weight))
        self.gate_up_proj = nn.Linear(
            mlp.gate_proj.in_features,
            mlp.gate_proj.out_features + mlp.up_proj.out_features,
            bias=False,
            device=mlp.gate_proj.weight.device,
            dtype=mlp.gate_proj.weight.dtype,
        )
        self.down_proj = mlp.down_proj
        self.act_fn = mlp.act_fn

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        pass


class RMSNormMLP(nn.Module):
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        hidden_states = self.scale * hidden_states.to(input_dtype)
        gate, up = self.gate_up_proj(hidden_states).chunk(2, dim=-1)
        return self.down_proj(self.act_fn(gate) * up)


class layers:
    RMSNormMLP = RMSNormMLP

HuggingFaceDocBuilderDev · 2026-06-02T09:33:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Much better!

ArthurZucker · 2026-06-04T08:44:16Z

+            if self.kernel_config is not None:
+                from kernels import use_kernel_mapping
+
+                inherit_mapping = not self.kernel_config.use_local_kernel
+                with use_kernel_mapping(self.kernel_config.kernel_mapping, inherit_mapping=inherit_mapping):
+                    kernelize(self, device=Device(type=self.device.type), mode=mode)
+            else:
+                kernelize(self, device=Device(type=self.device.type), mode=mode)


Suggested change

if self.kernel_config is not None:

from kernels import use_kernel_mapping

inherit_mapping = not self.kernel_config.use_local_kernel

with use_kernel_mapping(self.kernel_config.kernel_mapping, inherit_mapping=inherit_mapping):

kernelize(self, device=Device(type=self.device.type), mode=mode)

else:

kernelize(self, device=Device(type=self.device.type), mode=mode)

kernelize(self, device=Device(type=self.device.type), mode=mode, self.kernel_config)

let's reduce surface as much as possible

kernelize is defined in kernels. I can make a PR there, but for now it cannot be changed here.

okay! we can also just create def kernelize to put in kernels utils!

ArthurZucker · 2026-06-04T08:55:41Z

+        for module in meta_model.modules():
+            module_cls = type(module)
+            if module_cls in seen:
+                continue
+            if not all(hasattr(module, name) for name in child_names):
+                continue
+            seen.add(module_cls)


I don't think we need to iterate over all the modules!
We could register like we do for the tp plan with explicit path, we like explicitness in general!

{ "layers.*.self_attn.q_proj" : XXXX}

MOST important comment IMO if the contract is more like this we have a lot of simplifications no?

We already have this contract.

kernel_config = KernelConfig( { ( ("RMSNorm", "model.layers.*.post_attention_layernorm"), ("MLP", "model.layers.*.mlp"), ): kernel_repo_id, }, )

I will update this loop

ArthurZucker · 2026-06-04T08:55:58Z

+    kernel_config.kernel_mapping = new_mapping
+
+
+def register_kernel_fusions(


let's do both in a single func!

ArthurZucker · 2026-06-04T08:56:30Z

+def _first_str_leaf(obj) -> str | None:
+    """Recursively extract the first string leaf from a potentially nested dict (device → mode → str)."""
+    if isinstance(obj, str):
+        return obj
+    if isinstance(obj, dict):
+        for v in obj.values():
+            result = _first_str_leaf(v)
+            if result is not None:
+                return result
+    return None


ArthurZucker · 2026-06-04T08:57:19Z

        ALLOW_ALL_KERNELS = False


+def make_kernel_init_parent_class(


this is super important needs to be documented well:

we replace the fused cls by identity

thus we have to patch some inits, etc etc c
also do we even have to patch inits when the proper class replaces the one that holds them?

ArthurZucker

Much much better! Its just missing a piece of doc / update the doc for monkey patching, maybe some bench if you have but that's fine for another PR !

Ty for iterating its quite nice now!

ArthurZucker · 2026-06-09T10:17:57Z

+    new_mapping: dict = {}
+
+    # We might need to instantiate the model on meta device.
+    # We do it lazily, only if we encounter a fused kernel.


ArthurZucker · 2026-06-09T10:18:12Z

+        else:
+            raise ValueError(f"Invalid hub repo {hub_repo!r} for layer {layer_name!r}")
+
+        repo_id, _, layer_name_in_repo = repo_str.partition(":")


ArthurZucker · 2026-06-09T10:19:18Z

+
+            if meta_model is None:
+                with torch.device("meta"):
+                    meta_model = cls(config)


Suggested change

meta_model = cls(config)

meta__modules = cls(config).named_modules()

we only need these

ah maybe it gets updated but that's good, you can'tupdate twice so its even better in a way no? (to not re-compute the named modules)

ArthurZucker · 2026-06-09T10:19:53Z

+            if self.kernel_config is not None:
+                from kernels import use_kernel_mapping
+
+                inherit_mapping = not self.kernel_config.use_local_kernel
+                with use_kernel_mapping(self.kernel_config.kernel_mapping, inherit_mapping=inherit_mapping):
+                    kernelize(self, device=Device(type=self.device.type), mode=mode)
+            else:
+                kernelize(self, device=Device(type=self.device.type), mode=mode)


okay! we can also just create def kernelize to put in kernels utils!

michaelbenayoun · 2026-06-09T12:49:33Z

For the kernelize refactor, it is done here: #46520.

…ace#46339) * feat: module fusion API for kernels * fix: improve __repr__ for fused modules * wip: integration to KernelConfig * wip: add temporary example * wip: pattern matching in KernelConfig and actual kernel repo * refactor: move relevant code to hub_kernels.py * docs: reformat docstring * refactor: remove comment * refactor: update example script for testing * wip: remove apply_fusions method * wip: add core feature for integration with the current fusing API * fix: move kernel mapping patching to kernelize * wip: update example script * wip: add transform_model method for WeightTransform * wip: conversion_mapping in Kernel * wip: remove things from __all__ * wip: remove imports * fix: remove register_fusion_pattern path * fix: remove unused attribute * wip: update experimentation script * refactor: add convert as abstract method * style: reformat hub_kernels.py * wip: transform_model API * wip: transform_model API, WeightTransform * wip: transform_model API, WeightConverter * wip: transform_model API, WeightConverter * wip: make transform_model idempotent * refactor: infer_kernel_fusion_transforms * style: regexs -> regexes * refactor: register_kernel_fusions * refactor: post transformation cleanup * style: fix comment * test: add TestApplyTransformsToMetaModel tests * test: add kernels test * test: fix hub_kernels package reload * style: ruff * refactor: do not create dynamic classes in test * refactor: no dynamic class creation in tests * refactor: test * fix: TYPE_CHECKING imports were broken * wip: get rid of transform_model methods * wip: move tests * wip: make conversion happen before fused module instantiation * refactor * wip: move conversion_mapping inside the init * wip: without any transform_model * wip: remove dead code * wip: api imrpovement * wip: refactor * wip: enable __init__ support in kernels * wip: fuse + init * clean: remove "dead" code * wip: use two classes in kernels * wip: remove docstring * test: add relevant tests * chore: remove experiment file * cleanup: remove helper function * cleanup: remove helper function * refactor: merge the two register kernel functions into one * cleanup: use explicit regex patterns to match for monkey patching * test: cleanup and update tests * doc: add docstring to make_parent_class_for_kernel_fusion

michaelbenayoun added 30 commits April 10, 2026 14:52

feat: module fusion API for kernels

b387190

fix: improve __repr__ for fused modules

6bc9402

wip: integration to KernelConfig

62d4454

wip: add temporary example

4082fe1

wip: pattern matching in KernelConfig and actual kernel repo

ac4a699

refactor: move relevant code to hub_kernels.py

e13111f

docs: reformat docstring

d9d53f0

refactor: remove comment

e1c7f3f

Merge branch 'main' into fused_kernels

db0b7f0

Merge branch 'main' into fused_kernels

e21d06e

refactor: update example script for testing

bd640ae

wip: remove apply_fusions method

323b000

wip: add core feature for integration with the current fusing API

fe3002d

fix: move kernel mapping patching to kernelize

b541453

wip: update example script

b3d73a7

wip: add transform_model method for WeightTransform

0845222

wip: conversion_mapping in Kernel

ecfd97d

Merge branch 'main' into extended_kernels_api

973a616

wip: remove things from __all__

0f0a64b

wip: remove imports

91177ae

fix: remove register_fusion_pattern path

2636c06

Merge branch 'main' into extended_kernels_api

573d9f0

fix: remove unused attribute

f7c15bd

wip: update experimentation script

f9d4299

refactor: add convert as abstract method

847dbd4

style: reformat hub_kernels.py

4443c9a

wip: transform_model API

51c59c9

wip: transform_model API, WeightTransform

4c58503

wip: transform_model API, WeightConverter

a7f983f

wip: transform_model API, WeightConverter

b8d860f

michaelbenayoun added 6 commits May 26, 2026 15:43

wip: remove dead code

ad0e24e

wip: api imrpovement

3924cf3

wip: refactor

c948834

wip: enable __init__ support in kernels

e0c0366

wip: fuse + init

da0fdae

clean: remove "dead" code

06add71

michaelbenayoun added 5 commits June 2, 2026 13:51

wip: use two classes in kernels

597bb8c

wip: remove docstring

1489146

Merge branch 'main' into extended_kernel_api_easy

024fd4c

test: add relevant tests

b20cb71

chore: remove experiment file

1fb1787

michaelbenayoun requested review from ArthurZucker and Cyrilvallez June 2, 2026 14:39

ArthurZucker reviewed Jun 4, 2026

View reviewed changes

michaelbenayoun added 6 commits June 4, 2026 11:25

cleanup: remove helper function

2e87c9f

cleanup: remove helper function

11ecc56

refactor: merge the two register kernel functions into one

d98489a

cleanup: use explicit regex patterns to match for monkey patching

89adf48

test: cleanup and update tests

c841433

doc: add docstring to make_parent_class_for_kernel_fusion

edf33ab

michaelbenayoun requested a review from ArthurZucker June 8, 2026 09:36

ArthurZucker approved these changes Jun 9, 2026

View reviewed changes

michaelbenayoun added this pull request to the merge queue Jun 9, 2026

Merged via the queue into huggingface:main with commit 5047f08 Jun 9, 2026
117 of 118 checks passed

michaelbenayoun deleted the extended_kernel_api_easy branch June 9, 2026 12:21

		kernel_config.kernel_mapping = new_mapping


		def register_kernel_fusions(

	meta_model = cls(config)
	meta__modules = cls(config).named_modules()

Conversation

michaelbenayoun commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How it works

Script for the examples

Example 1: Parameter transformation, no fusion

Example 2: Fusion and parameter transformation

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelbenayoun commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelbenayoun commented Jun 2, 2026 •

edited

Loading