Describe the bug
Follow-up to #4562. With the talker capture fix and an up-to-date SM70 decode
kernel, the Stage-0 CUDA graph (low_latency) profile produces correct,
intelligible speech in real time — the output transcribes back to the input
prompt. However, it is not bit-identical to the eager profile: the waveform
is a different (valid) rendering of the same words — near-zero correlation with
the eager output, and slightly quieter.
The divergence is isolated to the first decode step (the prefill->decode
transition). Per-step talker LM-logit means:
eager: step1 13.43 step2 0.312 step3 13.20 ...
graph: step1 13.43 step2 13.56 step3 13.59 ...
Step 1 (prefill, eager in both) and steps 3+ match eager within tolerance; only
step 2 (the first captured decode) diverges. The seed/feedback state is identical
at step 2 in both modes (count == 1, has_codes == 1), so it is not a
seed-timing difference.
Two candidate causes (not fully isolated):
- the audio-feedback embedding reading a stale
_decode_last_codes (the BOC seed)
under capture at the first decode — forcing has_codes = 1 does not change the
result, which is consistent with the codes themselves being stale rather than
the gate;
- the prefill->decode attention transition under
FULL_DECODE_ONLY (the first
replay after an eager prefill).
Impact
Parity / cosmetic only — the speech is correct and intelligible (verified by
transcription). Filing for awareness as a follow-up to the capture-crash fix;
not a blocker.
Environment
Tesla V100 / sm70, FLASH_ATTN_V100 backend.
Describe the bug
Follow-up to #4562. With the talker capture fix and an up-to-date SM70 decode
kernel, the Stage-0 CUDA graph (
low_latency) profile produces correct,intelligible speech in real time — the output transcribes back to the input
prompt. However, it is not bit-identical to the eager profile: the waveform
is a different (valid) rendering of the same words — near-zero correlation with
the eager output, and slightly quieter.
The divergence is isolated to the first decode step (the prefill->decode
transition). Per-step talker LM-logit means:
Step 1 (prefill, eager in both) and steps 3+ match eager within tolerance; only
step 2 (the first captured decode) diverges. The seed/feedback state is identical
at step 2 in both modes (
count == 1,has_codes == 1), so it is not aseed-timing difference.
Two candidate causes (not fully isolated):
_decode_last_codes(the BOC seed)under capture at the first decode — forcing
has_codes = 1does not change theresult, which is consistent with the codes themselves being stale rather than
the gate;
FULL_DECODE_ONLY(the firstreplay after an eager prefill).
Impact
Parity / cosmetic only — the speech is correct and intelligible (verified by
transcription). Filing for awareness as a follow-up to the capture-crash fix;
not a blocker.
Environment
Tesla V100 / sm70,
FLASH_ATTN_V100backend.