Skip to content

Lfm2: thread seq_idx through ShortConv for packed/varlen inputs#46588

Merged
Rocketknight1 merged 1 commit into
huggingface:mainfrom
ChangyiYang:lfm2-conv-seq-idx
Jun 12, 2026
Merged

Lfm2: thread seq_idx through ShortConv for packed/varlen inputs#46588
Rocketknight1 merged 1 commit into
huggingface:mainfrom
ChangyiYang:lfm2-conv-seq-idx

Conversation

@ChangyiYang

Copy link
Copy Markdown
Contributor

What this PR fixes

Lfm2ShortConv.cuda_kernels_forward calls causal_conv1d_fn without
seq_idx. When the model is run on packed/varlen inputs — e.g. multiple
samples flattened into a single sequence with boundaries marked via
cu_seqlens (derived from position_ids), as in any sequence-packed
training pipeline — the 1-D conv slides across sample boundaries and the
conv state of one sample leaks into the next.

Attention is already isolated per-sample (FlashAttention / cu_seqlens),
but conv state is not. This causes a per-token logprob divergence
between packed-and-unpacked forwards of the same sample, observable as
a higher k3_kl when comparing the actor's packed forward against an
inference engine's per-sample forward.

This is the standard pattern already used by bamba, mamba2,
qwen3_next, qwen3_5(_moe), falcon_h1, zamba2, nemotron_h,
granitemoehybrid, …: pass a per-token int tensor seq_idx (segment
i labels its own tokens with i) to causal_conv1d_fn; the CUDA
kernel resets conv state at every boundary.

Changes

  • Lfm2ShortConv.cuda_kernels_forward and Lfm2ShortConv.forward accept
    a new seq_idx: torch.IntTensor | None = None kwarg and forward it to
    causal_conv1d_fn.
  • Lfm2DecoderLayer.forward passes seq_idx=kwargs.get("seq_idx") to
    self.conv(...) so callers can supply it through the model forward
    kwargs (same convention as qwen3_next).
  • Backward-compatible: seq_idx=None (the default) preserves the prior
    behaviour exactly; the causal_conv1d_fn kernel treats None as "no
    boundaries". No change for unpacked / single-sequence inference.
  • LFM2-MoE inherits via class Lfm2MoeShortConv(Lfm2ShortConv): pass,
    so the change benefits LFM2-MoE too.

Note on modular conversion

The change to the parent class body in modular_lfm2.py did not
propagate into the generated modeling_lfm2_moe.py when running
utils/modular_model_converter.py locally (the subclass body in
modular_lfm2_moe.py is pass). I applied the same change manually in
modeling_lfm2_moe.py to keep modeling and modular intent in sync.
Happy to instead extend modular_lfm2_moe.py with explicit method
overrides if reviewers prefer that, or to dig into the converter.

Caller usage (example)

# Build seq_idx for a flattened batch with cu_seqlens (LongTensor[N+1]):
seq_idx = torch.zeros(total_tokens, dtype=torch.int32, device=device)
for i in range(len(cu_seqlens) - 1):
    seq_idx[cu_seqlens[i]:cu_seqlens[i+1]] = i
outputs = model(input_ids=flat, position_ids=pos_ids, seq_idx=seq_idx)

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: lfm2, lfm2_moe

`Lfm2ShortConv.cuda_kernels_forward` currently calls `causal_conv1d_fn`
without `seq_idx`, so when the model is run on packed/varlen inputs (e.g.
HuggingFace-trainer-style sequence packing or any caller that flattens
multiple samples into a single sequence and supplies cu_seqlens via
`position_ids`), the 1-D conv slides across sample boundaries and
contaminates one sample with the trailing tokens of the previous one.
Attention is already isolated per-sample via `cu_seqlens` / FlashAttention,
but conv state leaks.

This matches the pattern used by `bamba`, `mamba2`, `qwen3_next`,
`qwen3_5(_moe)`, `falcon_h1`, `zamba2`, `nemotron_h`, `granitemoehybrid`,
etc.: pass a per-token int tensor `seq_idx` (segment `i` labels its own
tokens with `i`) to `causal_conv1d_fn`; the CUDA kernel resets conv state
at every boundary.

Changes:
- `Lfm2ShortConv.cuda_kernels_forward` / `forward` accept `seq_idx`
  and forward it to `causal_conv1d_fn`.
- `Lfm2DecoderLayer.forward` passes `seq_idx=kwargs.get("seq_idx")` to
  `self.conv(...)` so callers can supply it via the model `forward`
  kwargs (same pattern qwen3_next uses).
- Backward-compatible: `seq_idx=None` (the default) preserves the prior
  behaviour.
- Same change applied to `Lfm2MoeShortConv` / `Lfm2MoeDecoderLayer` so
  LFM2-MoE benefits. NOTE: I applied this to the generated
  `modeling_lfm2_moe.py` manually because the modular converter does
  not currently re-resolve the parent class body when re-running
  conversion for a `class Lfm2MoeShortConv(Lfm2ShortConv): pass`
  subclass; happy to revisit modular_lfm2_moe.py if reviewers prefer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@Rocketknight1 Rocketknight1 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good! It matches Qwen, and I trust the Qwen logic because it has a lot of usage

@Rocketknight1 Rocketknight1 enabled auto-merge June 12, 2026 12:57
@Rocketknight1 Rocketknight1 added this pull request to the merge queue Jun 12, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Merged via the queue into huggingface:main with commit 2d68208 Jun 12, 2026
25 checks passed
lcheng321 pushed a commit to lcheng321/transformers that referenced this pull request Jun 16, 2026
…path) (huggingface#46633)

Followup to huggingface#46588. When `causal_conv1d` isn't installed (e.g. AMD ROCm
containers), `Lfm2ShortConv.forward` falls back to `slow_forward` which
uses plain `nn.Conv1d` and slides across the full packed sequence — so
the boundary-smear bug persists exactly as it did before huggingface#46588 on that
path. Add a `seq_idx` arg to `slow_forward`; when supplied (and there's
no kv cache), slice `Bx` by segment boundaries derived from `seq_idx`
and conv each packed sample independently, matching what
`causal_conv1d_fn(..., seq_idx=...)` does on the fast path.

Validated end-to-end in verl (LFM2-MoE 8B RL training, AMD MI300X,
ROCm, no causal_conv1d): step-1 k3_kl drops 0.171 -> 0.093 once both
the fast-path (huggingface#46588) and this slow-path patch are applied (the
fast-path patch alone was a no-op there).

Same change to `Lfm2MoeShortConv.slow_forward` applied manually (the
modular converter currently doesn't re-resolve a parent body whose
subclass is `class Lfm2MoeShortConv(Lfm2ShortConv): pass`, same caveat
as huggingface#46588).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants