Skip to content

[Bug]: [HiggsAudioV3] Talker CUDA graph output differs from eager at the first decode step (parity follow-up) #4564

@jajmangold

Description

@jajmangold

Describe the bug

Follow-up to #4562. With the talker capture fix and an up-to-date SM70 decode
kernel, the Stage-0 CUDA graph (low_latency) profile produces correct,
intelligible
speech in real time — the output transcribes back to the input
prompt. However, it is not bit-identical to the eager profile: the waveform
is a different (valid) rendering of the same words — near-zero correlation with
the eager output, and slightly quieter.

The divergence is isolated to the first decode step (the prefill->decode
transition). Per-step talker LM-logit means:

eager:  step1 13.43   step2 0.312   step3 13.20   ...
graph:  step1 13.43   step2 13.56   step3 13.59   ...

Step 1 (prefill, eager in both) and steps 3+ match eager within tolerance; only
step 2 (the first captured decode) diverges. The seed/feedback state is identical
at step 2 in both modes (count == 1, has_codes == 1), so it is not a
seed-timing difference.

Two candidate causes (not fully isolated):

  • the audio-feedback embedding reading a stale _decode_last_codes (the BOC seed)
    under capture at the first decode — forcing has_codes = 1 does not change the
    result, which is consistent with the codes themselves being stale rather than
    the gate;
  • the prefill->decode attention transition under FULL_DECODE_ONLY (the first
    replay after an eager prefill).

Impact

Parity / cosmetic only — the speech is correct and intelligible (verified by
transcription). Filing for awareness as a follow-up to the capture-crash fix;
not a blocker.

Environment

Tesla V100 / sm70, FLASH_ATTN_V100 backend.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions