[Bug]: [HiggsAudioV3] Talker crashes CUDA graph capture (low_latency profile) due to boolean-mask indexing

### Describe the bug

The Higgs-Audio-v3 Stage-0 talker crashes during CUDA graph capture when the
`higgs_multimodal_qwen3_low_latency` profile is used (Stage-0
`enforce_eager: false`, `cudagraph_mode: FULL_DECODE_ONLY`).

`HiggsAudioV3Talker._decode_request_token_positions` filters its result with
boolean-mask indexing:

```python
decode_mask = (spans == 1) & (starts >= 0) & (starts < int(num_tokens))
return req_rows[decode_mask], starts[decode_mask]
```

`req_rows[decode_mask]` produces a **data-dependent output shape**, which forces
a host synchronization and is illegal during CUDA graph stream capture. This is
backend-independent (it is a CUDA stream-capture rule, not specific to any
attention backend).

### Repro

1. Deploy `higgs-audio-v3-tts-4b` with the Stage-0 CUDA graph profile
   (`deploy/higgs_multimodal_qwen3_low_latency.yaml`:
   `enforce_eager: false`, `compilation_config.cudagraph_mode: FULL_DECODE_ONLY`).
2. Run any TTS request.
3. Capture fails:

```
torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
  File ".../higgs_audio_v3_talker.py", in forward
    hidden_states = self._apply_audio_feedback(hidden_states, input_ids)
  File ".../higgs_audio_v3_talker.py", in _apply_audio_feedback
    req_rows, token_positions = self._decode_request_token_positions(num_tokens, hidden_states.device)
  File ".../higgs_audio_v3_talker.py", in _decode_request_token_positions
    return req_rows[decode_mask], starts[decode_mask]
```

Observed on Tesla V100 / sm70 (FLASH_ATTN_V100 backend), but the boolean-mask is
backend-independent.

### Fix

Under the external decode CUDA graph the decode batch is always a uniform
single-token decode (every span == 1), so `decode_mask` is all-True and the
filtered result equals the unfiltered tensors. Returning them directly when
`_use_external_decode_cudagraph` is set keeps the captured shape static and fixes
the crash, with no change to the eager path. A PR follows.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [HiggsAudioV3] Talker crashes CUDA graph capture (low_latency profile) due to boolean-mask indexing #4562

Describe the bug

Repro

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: [HiggsAudioV3] Talker crashes CUDA graph capture (low_latency profile) due to boolean-mask indexing #4562

Description

Describe the bug

Repro

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions