Describe the bug
The Higgs-Audio-v3 Stage-0 talker crashes during CUDA graph capture when the
higgs_multimodal_qwen3_low_latency profile is used (Stage-0
enforce_eager: false, cudagraph_mode: FULL_DECODE_ONLY).
HiggsAudioV3Talker._decode_request_token_positions filters its result with
boolean-mask indexing:
decode_mask = (spans == 1) & (starts >= 0) & (starts < int(num_tokens))
return req_rows[decode_mask], starts[decode_mask]
req_rows[decode_mask] produces a data-dependent output shape, which forces
a host synchronization and is illegal during CUDA graph stream capture. This is
backend-independent (it is a CUDA stream-capture rule, not specific to any
attention backend).
Repro
- Deploy
higgs-audio-v3-tts-4b with the Stage-0 CUDA graph profile
(deploy/higgs_multimodal_qwen3_low_latency.yaml:
enforce_eager: false, compilation_config.cudagraph_mode: FULL_DECODE_ONLY).
- Run any TTS request.
- Capture fails:
torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
File ".../higgs_audio_v3_talker.py", in forward
hidden_states = self._apply_audio_feedback(hidden_states, input_ids)
File ".../higgs_audio_v3_talker.py", in _apply_audio_feedback
req_rows, token_positions = self._decode_request_token_positions(num_tokens, hidden_states.device)
File ".../higgs_audio_v3_talker.py", in _decode_request_token_positions
return req_rows[decode_mask], starts[decode_mask]
Observed on Tesla V100 / sm70 (FLASH_ATTN_V100 backend), but the boolean-mask is
backend-independent.
Fix
Under the external decode CUDA graph the decode batch is always a uniform
single-token decode (every span == 1), so decode_mask is all-True and the
filtered result equals the unfiltered tensors. Returning them directly when
_use_external_decode_cudagraph is set keeps the captured shape static and fixes
the crash, with no change to the eager path. A PR follows.
Describe the bug
The Higgs-Audio-v3 Stage-0 talker crashes during CUDA graph capture when the
higgs_multimodal_qwen3_low_latencyprofile is used (Stage-0enforce_eager: false,cudagraph_mode: FULL_DECODE_ONLY).HiggsAudioV3Talker._decode_request_token_positionsfilters its result withboolean-mask indexing:
req_rows[decode_mask]produces a data-dependent output shape, which forcesa host synchronization and is illegal during CUDA graph stream capture. This is
backend-independent (it is a CUDA stream-capture rule, not specific to any
attention backend).
Repro
higgs-audio-v3-tts-4bwith the Stage-0 CUDA graph profile(
deploy/higgs_multimodal_qwen3_low_latency.yaml:enforce_eager: false,compilation_config.cudagraph_mode: FULL_DECODE_ONLY).Observed on Tesla V100 / sm70 (FLASH_ATTN_V100 backend), but the boolean-mask is
backend-independent.
Fix
Under the external decode CUDA graph the decode batch is always a uniform
single-token decode (every span == 1), so
decode_maskis all-True and thefiltered result equals the unfiltered tensors. Returning them directly when
_use_external_decode_cudagraphis set keeps the captured shape static and fixesthe crash, with no change to the eager path. A PR follows.