Lfm2: thread seq_idx through ShortConv for packed/varlen inputs#46588
Merged
Conversation
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: lfm2, lfm2_moe |
`Lfm2ShortConv.cuda_kernels_forward` currently calls `causal_conv1d_fn`
without `seq_idx`, so when the model is run on packed/varlen inputs (e.g.
HuggingFace-trainer-style sequence packing or any caller that flattens
multiple samples into a single sequence and supplies cu_seqlens via
`position_ids`), the 1-D conv slides across sample boundaries and
contaminates one sample with the trailing tokens of the previous one.
Attention is already isolated per-sample via `cu_seqlens` / FlashAttention,
but conv state leaks.
This matches the pattern used by `bamba`, `mamba2`, `qwen3_next`,
`qwen3_5(_moe)`, `falcon_h1`, `zamba2`, `nemotron_h`, `granitemoehybrid`,
etc.: pass a per-token int tensor `seq_idx` (segment `i` labels its own
tokens with `i`) to `causal_conv1d_fn`; the CUDA kernel resets conv state
at every boundary.
Changes:
- `Lfm2ShortConv.cuda_kernels_forward` / `forward` accept `seq_idx`
and forward it to `causal_conv1d_fn`.
- `Lfm2DecoderLayer.forward` passes `seq_idx=kwargs.get("seq_idx")` to
`self.conv(...)` so callers can supply it via the model `forward`
kwargs (same pattern qwen3_next uses).
- Backward-compatible: `seq_idx=None` (the default) preserves the prior
behaviour.
- Same change applied to `Lfm2MoeShortConv` / `Lfm2MoeDecoderLayer` so
LFM2-MoE benefits. NOTE: I applied this to the generated
`modeling_lfm2_moe.py` manually because the modular converter does
not currently re-resolve the parent class body when re-running
conversion for a `class Lfm2MoeShortConv(Lfm2ShortConv): pass`
subclass; happy to revisit modular_lfm2_moe.py if reviewers prefer.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4357107 to
cfd7003
Compare
Contributor
|
CI Dashboard: View test results in Grafana |
Rocketknight1
approved these changes
Jun 12, 2026
Rocketknight1
left a comment
Member
There was a problem hiding this comment.
Yes, this looks good! It matches Qwen, and I trust the Qwen logic because it has a lot of usage
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
lcheng321
pushed a commit
to lcheng321/transformers
that referenced
this pull request
Jun 16, 2026
…path) (huggingface#46633) Followup to huggingface#46588. When `causal_conv1d` isn't installed (e.g. AMD ROCm containers), `Lfm2ShortConv.forward` falls back to `slow_forward` which uses plain `nn.Conv1d` and slides across the full packed sequence — so the boundary-smear bug persists exactly as it did before huggingface#46588 on that path. Add a `seq_idx` arg to `slow_forward`; when supplied (and there's no kv cache), slice `Bx` by segment boundaries derived from `seq_idx` and conv each packed sample independently, matching what `causal_conv1d_fn(..., seq_idx=...)` does on the fast path. Validated end-to-end in verl (LFM2-MoE 8B RL training, AMD MI300X, ROCm, no causal_conv1d): step-1 k3_kl drops 0.171 -> 0.093 once both the fast-path (huggingface#46588) and this slow-path patch are applied (the fast-path patch alone was a no-op there). Same change to `Lfm2MoeShortConv.slow_forward` applied manually (the modular converter currently doesn't re-resolve a parent body whose subclass is `class Lfm2MoeShortConv(Lfm2ShortConv): pass`, same caveat as huggingface#46588). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR fixes
Lfm2ShortConv.cuda_kernels_forwardcallscausal_conv1d_fnwithoutseq_idx. When the model is run on packed/varlen inputs — e.g. multiplesamples flattened into a single sequence with boundaries marked via
cu_seqlens(derived fromposition_ids), as in any sequence-packedtraining pipeline — the 1-D conv slides across sample boundaries and the
conv state of one sample leaks into the next.
Attention is already isolated per-sample (FlashAttention /
cu_seqlens),but conv state is not. This causes a per-token logprob divergence
between packed-and-unpacked forwards of the same sample, observable as
a higher k3_kl when comparing the actor's packed forward against an
inference engine's per-sample forward.
This is the standard pattern already used by
bamba,mamba2,qwen3_next,qwen3_5(_moe),falcon_h1,zamba2,nemotron_h,granitemoehybrid, …: pass a per-token int tensorseq_idx(segmentilabels its own tokens withi) tocausal_conv1d_fn; the CUDAkernel resets conv state at every boundary.
Changes
Lfm2ShortConv.cuda_kernels_forwardandLfm2ShortConv.forwardaccepta new
seq_idx: torch.IntTensor | None = Nonekwarg and forward it tocausal_conv1d_fn.Lfm2DecoderLayer.forwardpassesseq_idx=kwargs.get("seq_idx")toself.conv(...)so callers can supply it through the modelforwardkwargs (same convention as
qwen3_next).seq_idx=None(the default) preserves the priorbehaviour exactly; the
causal_conv1d_fnkernel treatsNoneas "noboundaries". No change for unpacked / single-sequence inference.
class Lfm2MoeShortConv(Lfm2ShortConv): pass,so the change benefits LFM2-MoE too.
Note on modular conversion
The change to the parent class body in
modular_lfm2.pydid notpropagate into the generated
modeling_lfm2_moe.pywhen runningutils/modular_model_converter.pylocally (the subclass body inmodular_lfm2_moe.pyispass). I applied the same change manually inmodeling_lfm2_moe.pyto keep modeling and modular intent in sync.Happy to instead extend
modular_lfm2_moe.pywith explicit methodoverrides if reviewers prefer that, or to dig into the converter.
Caller usage (example)
🤖 Generated with Claude Code