Lfm2: thread `seq_idx` through ShortConv for packed/varlen inputs by ChangyiYang · Pull Request #46588 · huggingface/transformers

ChangyiYang · 2026-06-12T07:03:59Z

What this PR fixes

Lfm2ShortConv.cuda_kernels_forward calls causal_conv1d_fn without
seq_idx. When the model is run on packed/varlen inputs — e.g. multiple
samples flattened into a single sequence with boundaries marked via
cu_seqlens (derived from position_ids), as in any sequence-packed
training pipeline — the 1-D conv slides across sample boundaries and the
conv state of one sample leaks into the next.

Attention is already isolated per-sample (FlashAttention / cu_seqlens),
but conv state is not. This causes a per-token logprob divergence
between packed-and-unpacked forwards of the same sample, observable as
a higher k3_kl when comparing the actor's packed forward against an
inference engine's per-sample forward.

This is the standard pattern already used by bamba, mamba2,
qwen3_next, qwen3_5(_moe), falcon_h1, zamba2, nemotron_h,
granitemoehybrid, …: pass a per-token int tensor seq_idx (segment
i labels its own tokens with i) to causal_conv1d_fn; the CUDA
kernel resets conv state at every boundary.

Changes

Lfm2ShortConv.cuda_kernels_forward and Lfm2ShortConv.forward accept
a new seq_idx: torch.IntTensor | None = None kwarg and forward it to
causal_conv1d_fn.
Lfm2DecoderLayer.forward passes seq_idx=kwargs.get("seq_idx") to
self.conv(...) so callers can supply it through the model forward
kwargs (same convention as qwen3_next).
Backward-compatible: seq_idx=None (the default) preserves the prior
behaviour exactly; the causal_conv1d_fn kernel treats None as "no
boundaries". No change for unpacked / single-sequence inference.
LFM2-MoE inherits via class Lfm2MoeShortConv(Lfm2ShortConv): pass,
so the change benefits LFM2-MoE too.

Note on modular conversion

The change to the parent class body in modular_lfm2.py did not
propagate into the generated modeling_lfm2_moe.py when running
utils/modular_model_converter.py locally (the subclass body in
modular_lfm2_moe.py is pass). I applied the same change manually in
modeling_lfm2_moe.py to keep modeling and modular intent in sync.
Happy to instead extend modular_lfm2_moe.py with explicit method
overrides if reviewers prefer that, or to dig into the converter.

Caller usage (example)

# Build seq_idx for a flattened batch with cu_seqlens (LongTensor[N+1]):
seq_idx = torch.zeros(total_tokens, dtype=torch.int32, device=device)
for i in range(len(cu_seqlens) - 1):
    seq_idx[cu_seqlens[i]:cu_seqlens[i+1]] = i
outputs = model(input_ids=flat, position_ids=pos_ids, seq_idx=seq_idx)

🤖 Generated with Claude Code

github-actions · 2026-06-12T07:05:11Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: lfm2, lfm2_moe

`Lfm2ShortConv.cuda_kernels_forward` currently calls `causal_conv1d_fn` without `seq_idx`, so when the model is run on packed/varlen inputs (e.g. HuggingFace-trainer-style sequence packing or any caller that flattens multiple samples into a single sequence and supplies cu_seqlens via `position_ids`), the 1-D conv slides across sample boundaries and contaminates one sample with the trailing tokens of the previous one. Attention is already isolated per-sample via `cu_seqlens` / FlashAttention, but conv state leaks. This matches the pattern used by `bamba`, `mamba2`, `qwen3_next`, `qwen3_5(_moe)`, `falcon_h1`, `zamba2`, `nemotron_h`, `granitemoehybrid`, etc.: pass a per-token int tensor `seq_idx` (segment `i` labels its own tokens with `i`) to `causal_conv1d_fn`; the CUDA kernel resets conv state at every boundary. Changes: - `Lfm2ShortConv.cuda_kernels_forward` / `forward` accept `seq_idx` and forward it to `causal_conv1d_fn`. - `Lfm2DecoderLayer.forward` passes `seq_idx=kwargs.get("seq_idx")` to `self.conv(...)` so callers can supply it via the model `forward` kwargs (same pattern qwen3_next uses). - Backward-compatible: `seq_idx=None` (the default) preserves the prior behaviour. - Same change applied to `Lfm2MoeShortConv` / `Lfm2MoeDecoderLayer` so LFM2-MoE benefits. NOTE: I applied this to the generated `modeling_lfm2_moe.py` manually because the modular converter does not currently re-resolve the parent class body when re-running conversion for a `class Lfm2MoeShortConv(Lfm2ShortConv): pass` subclass; happy to revisit modular_lfm2_moe.py if reviewers prefer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-12T07:10:19Z

CI Dashboard: View test results in Grafana

Rocketknight1

Yes, this looks good! It matches Qwen, and I trust the Qwen logic because it has a lot of usage

HuggingFaceDocBuilderDev · 2026-06-12T13:08:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…path) (huggingface#46633) Followup to huggingface#46588. When `causal_conv1d` isn't installed (e.g. AMD ROCm containers), `Lfm2ShortConv.forward` falls back to `slow_forward` which uses plain `nn.Conv1d` and slides across the full packed sequence — so the boundary-smear bug persists exactly as it did before huggingface#46588 on that path. Add a `seq_idx` arg to `slow_forward`; when supplied (and there's no kv cache), slice `Bx` by segment boundaries derived from `seq_idx` and conv each packed sample independently, matching what `causal_conv1d_fn(..., seq_idx=...)` does on the fast path. Validated end-to-end in verl (LFM2-MoE 8B RL training, AMD MI300X, ROCm, no causal_conv1d): step-1 k3_kl drops 0.171 -> 0.093 once both the fast-path (huggingface#46588) and this slow-path patch are applied (the fast-path patch alone was a no-op there). Same change to `Lfm2MoeShortConv.slow_forward` applied manually (the modular converter currently doesn't re-resolve a parent body whose subclass is `class Lfm2MoeShortConv(Lfm2ShortConv): pass`, same caveat as huggingface#46588). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

ChangyiYang force-pushed the lfm2-conv-seq-idx branch from 4357107 to cfd7003 Compare June 12, 2026 07:09

Rocketknight1 approved these changes Jun 12, 2026

View reviewed changes

Rocketknight1 enabled auto-merge June 12, 2026 12:57

Rocketknight1 added this pull request to the merge queue Jun 12, 2026

Merged via the queue into huggingface:main with commit 2d68208 Jun 12, 2026
25 checks passed

ChangyiYang mentioned this pull request Jun 13, 2026

Lfm2: also thread seq_idx through ShortConv.slow_forward (non-fast-path) #46633

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lfm2: thread `seq_idx` through ShortConv for packed/varlen inputs#46588

Lfm2: thread `seq_idx` through ShortConv for packed/varlen inputs#46588
Rocketknight1 merged 1 commit into
huggingface:mainfrom
ChangyiYang:lfm2-conv-seq-idx

ChangyiYang commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Rocketknight1 left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ChangyiYang commented Jun 12, 2026

What this PR fixes

Changes

Note on modular conversion

Caller usage (example)

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants