[RFC]: Qwen3-Omni supports streaming input

## 1. Summary

This RFC proposes adding **streaming input** support to vLLM-Omni services: clients may push input data (text, audio, video, or mixed multimodal) incrementally before the full request is ready; the server may start inference once enough data has been received, reducing time-to-first-byte, supporting real-time capture, and enabling very large payloads.

---

## 2. Motivation and Goals

### 2.1 Motivation

- **Real-time or long-form input**: When users stream audio while speaking or video while recording, waiting for the full upload before inference increases latency to first token/byte.
- **Large payloads**: Uploading long audio/video in one shot risks timeouts and high memory use; chunked upload reduces per-request body size and timeout risk.
- **Consistent experience**: Symmetric with streaming output; end-to-end low latency with “streaming input + streaming output.”

### 2.2 Goals

- Support clients submitting multimodal input **in a streaming fashion** (text, audio, video, or mixed).
- Allow the server to **start** Thinker (or the first stage) **early** when **enough data** has been received, without waiting for the full input.
- Protocol and API must align with existing OpenAI-compatible and vLLM scheduling models for incremental rollout and fallback.
- Do not require all models/stages to support “start before input is complete”; allow configuration to “buffer until complete” for semantic consistency.

---

## 3. Terminology and Scope

| Term | Meaning |
|------|----------|
| **Streaming input** | The client splits input for a single logical request into ordered chunks; the server may process input before it is complete. |
| **Input chunk** | A single uploaded unit of data (e.g., a segment of audio bytes, text, or video). |
| **End of input (EOS)** | The client explicitly signals that all input for the request has been sent. |
| **Start condition** | Server policy: allow inference to start when certain conditions are met (e.g., at least N bytes received, or first complete modality). |

**Scope**: Only “client → vLLM-Omni gateway/API layer” input streaming. Internal model execution (Thinker/Talker/Code2Wav) continues to follow existing logic and async_chunk.

---

## 4. Current State and Constraints

### 4.1 Current Behavior

- Requests are typically a **single HTTP body**: `messages` / `input` / multimodal payload submitted once in the `POST` body.
- The server creates a scheduled request and runs Thinker only after **fully parsing** the request body; there is no official “receive-and-compute” protocol.

### 4.2 Constraints

- Must coexist with existing **OpenAI-compatible** chat/completion and custom multimodal APIs (same service may support both “non-streaming” and “streaming input” requests).
- Scheduler and engine today work at “request” granularity (one request, one full prompt); streaming input introduces “incomplete requests” or “multiple chunks forming one request.”
- Multimodal order and boundaries: if text and audio are interleaved, chunk boundaries and types must be defined so the server can concatenate and align correctly (e.g., by timestamp or sequence number).

---

## 5. Design

### 5.1 Overview

- **Session/request identity**: A streaming input request first establishes a “streaming session” (e.g., via a **streaming session ID** or **request ID**); subsequent chunks carry this ID so the server can attribute uploads to the same logical request.
- **Two modes** (one or both may be supported):
  - **Mode A: Chunked HTTP, single connection**  
    The client sends the body with `Transfer-Encoding: chunked` in **one HTTP request**; the server parses and buffers chunks, and may create/update a scheduled request and start inference when the “start condition” is met.
  - **Mode B: Multi-request append**  
    The client first `POST`s to create a streaming session (optionally with an initial chunk), then appends chunks via `PATCH` or `POST /append`, and finally marks end via `POST /finish` or a chunk with `end_of_input=true`; the server starts or continues inference when the end is received or when the start condition is met.
- **Start condition**: Configurable or policy-driven, e.g.:
  - “At least one complete modality” (e.g., first text message complete, or first audio segment meets minimum length);
  - “At least N bytes/chunks received”;
  - Or “start only on EOS” (buffer until complete; matches current behavior).

### 5.2 Data Model (Logical View)

- **StreamingInputSession** (server-side):
  - `session_id`: Unique identifier.
  - `created_at` / `timeout`: For discarding sessions that never complete.
  - `accumulated_input`: Concatenated or structured representation of received chunks (text buffer, audio buffer, video buffer, etc.).
  - `state`: e.g. `open` / `started` / `finished` (inference triggered / input ended).
- **InputChunk** (single chunk):
  - `sequence_id` / `offset`: For ordering and deduplication.
  - `modality`: `text` | `audio` | `video` | `image` (optional).
  - `payload`: Raw bytes or base64; or a reference (e.g. URL) for server fetch (this RFC may leave reference format unspecified initially).
  - `end_of_input`: Whether this is the last chunk for the request.

### 5.3 Integration with the Scheduler

- **Option 1 (buffer until start condition, then create request)**: Streaming input is accumulated at the API layer until the “start condition” is met; then a **full prompt/multimodal payload** is built and one scheduled request is created via the existing path. Later chunks only append to the request’s input buffer and **do not** change the prompt already in the scheduler (no “mid-stream prompt change”; only “early start”).
- **Option 2 (dynamic prompt extension)**: The scheduler supports “incomplete requests” and allows appending input before Thinker starts or at a specific stage; more complex; can be a follow-up extension.

This RFC recommends **Option 1 first**: streaming input only affects “when to create the request” and “how to accumulate input,” keeping the “one request, one full prompt” assumption; optional appends after start are for stats/audit or reserved for future use.

---

## 6. Protocol and API

### 6.1 Mode A: Chunked HTTP, Single Connection

- **Endpoint**: Same as today, e.g. `POST /v1/chat/completions` or `POST /v1/audio/speech`; distinguish streaming input via **headers**:
  - `X-Streaming-Input: true` or `Content-Type: application/x-multipart-streaming` (or another custom type).
- **Body**: `Transfer-Encoding: chunked`; each chunk is a small frame, e.g.:
  - Length (4 or 8 bytes, binary) or JSON lines (NDJSON): `{"modality":"text","payload":"base64...","end_of_input":false}`.
- **Response**: Same as existing streaming output when `stream: true` (SSE); first event may be sent after the start condition is met and the request is created.

**Pros**: Single connection; no CORS preflight for a second request. **Cons**: Frame format, error recovery, and timeout semantics must be defined.

### 6.2 Mode B: Multi-Request Append

- **Create session**: `POST /v1/streaming_input/sessions`  
  - Body (optional): Initial parameters (e.g. `model`, `voice`) and optional first chunk.  
  - Response: `{ "session_id": "uuid", "expires_in": 300 }`.
- **Append**: `POST /v1/streaming_input/sessions/{session_id}/chunks`  
  - Body: `{ "sequence_id": 0, "modality": "text", "payload": "base64...", "end_of_input": false }`.  
  - Server returns `202 Accepted` or `200 OK`; optionally `started: true` in the body to indicate inference has been triggered.
- **Finish**: Set `end_of_input: true` on the last chunk, or call `POST /v1/streaming_input/sessions/{session_id}/finish`.
- **Get result**:  
  - For streaming output, agree at session creation (e.g. `stream: true`); results delivered via SSE or WebSocket (same as existing streaming output).  
  - Or poll `GET /v1/streaming_input/sessions/{session_id}/result` (for non-streaming output).

**Pros**: Fits existing REST style; easy retries and auth. **Cons**: Multiple round-trips; slightly higher latency for the first chunk.

### 6.3 Recommendation

- **Phase 1**: Implement **Mode B** (multi-request append) first for easier integration with routing, auth, and rate limiting, and clear session lifecycle.
- **Phase 2**: If further reduction in time-to-first-byte is needed, add Mode A (chunked single connection) or WebSocket (see below).

### 6.4 WebSocket (Optional Extension)

- Single connection, bidirectional: client sends input chunks over the same WebSocket; server pushes streaming output (e.g. token/audio chunks) on the same connection.
- Message format can reuse Mode B’s JSON structure (e.g. `{ "type": "input_chunk", "sequence_id": 0, "modality": "audio", "payload": "base64...", "end_of_input": false }`); server pushes `{ "type": "output_chunk", ... }` or existing SSE-compatible format.
- This RFC does not require WebSocket in the first release; it is left as an extension point.

---

## 7. Implementation Considerations

### 7.1 Session and Timeout

- If no `end_of_input` or any chunk is received within **T seconds** after session creation, the streaming session should be closed and resources released; T is configurable (e.g. 300s).
- Repeated `finish` or `end_of_input` for the same `session_id` should be idempotent.

### 7.2 Ordering and Deduplication

- Server orders chunks by `sequence_id` or arrival order; on out-of-order chunks, buffer until in order and merge, or reject with 4xx.
- Duplicate `sequence_id` may be deduplicated (process once).

### 7.3 Security and Rate Limiting

- Streaming input sessions count toward per-user/per-key concurrency and QPS limits.
- Total payload size per session should be capped to avoid unbounded memory; return 413 and close the session when exceeded.

### 7.4 Relationship to Async Chunk

- Streaming input only determines **when** and **with what content** a **request** is created and submitted to the scheduler; once in the scheduler, inter-stage flow remains the existing async_chunk (Thinker→Talker→Code2Wav).
- If the start condition is met on the first segment, that segment can enter Thinker earlier; combined with async_chunk, end-to-end TTFT/TTFA can be reduced further.

### 7.5 Suggested Configuration

- `streaming_input.enabled`: Enable or disable streaming input API.
- `streaming_input.session_timeout_sec`: Session idle timeout.
- `streaming_input.max_payload_per_session`: Maximum accumulated bytes per session.
- `streaming_input.start_policy`: `on_first_chunk` | `on_end_only` | `on_min_length`, etc., to decide when to create the scheduled request.

---

## 8. Alternatives and Open Questions

### 8.1 Alternatives

- **Chunked upload only (no early start)**: Keep “one request, one full prompt”; use multipart or similar only to reduce single-request body size; do not start before input is complete. Fits “fix upload timeouts only” without pursuing earlier first-byte.
- **Client-side buffering only**: Client waits until input is complete, then one-shot POST; no server changes. Does not reduce time-to-first-byte; relies on async_chunk for inter-stage optimization only.

### 8.2 Open Questions

- **Interleaved multimodal**: Alignment rules when text and audio/video arrive interleaved (by timestamp vs. by order) and model-side interface (incremental “text-then-audio” prompt) need to align with concrete model implementations.
- **Replay and debugging**: Whether to record/replay streaming input for reproduction; if so, storage format and retention policy are TBD.
- **Upstream vLLM**: If the main vLLM library later adds streaming input extension points, how this design should plug in (e.g. reuse its session or chunk format).

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Qwen3-Omni supports streaming input #1951

1. Summary

2. Motivation and Goals

2.1 Motivation

2.2 Goals

3. Terminology and Scope

4. Current State and Constraints

4.1 Current Behavior

4.2 Constraints

5. Design

5.1 Overview

5.2 Data Model (Logical View)

5.3 Integration with the Scheduler

6. Protocol and API

6.1 Mode A: Chunked HTTP, Single Connection

6.2 Mode B: Multi-Request Append

6.3 Recommendation

6.4 WebSocket (Optional Extension)

7. Implementation Considerations

7.1 Session and Timeout

7.2 Ordering and Deduplication

7.3 Security and Rate Limiting

7.4 Relationship to Async Chunk

7.5 Suggested Configuration

8. Alternatives and Open Questions

8.1 Alternatives

8.2 Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Term	Meaning
Streaming input	The client splits input for a single logical request into ordered chunks; the server may process input before it is complete.
Input chunk	A single uploaded unit of data (e.g., a segment of audio bytes, text, or video).
End of input (EOS)	The client explicitly signals that all input for the request has been sent.
Start condition	Server policy: allow inference to start when certain conditions are met (e.g., at least N bytes received, or first complete modality).

[RFC]: Qwen3-Omni supports streaming input #1951

Description

1. Summary

2. Motivation and Goals

2.1 Motivation

2.2 Goals

3. Terminology and Scope

4. Current State and Constraints

4.1 Current Behavior

4.2 Constraints

5. Design

5.1 Overview

5.2 Data Model (Logical View)

5.3 Integration with the Scheduler

6. Protocol and API

6.1 Mode A: Chunked HTTP, Single Connection

6.2 Mode B: Multi-Request Append

6.3 Recommendation

6.4 WebSocket (Optional Extension)

7. Implementation Considerations

7.1 Session and Timeout

7.2 Ordering and Deduplication

7.3 Security and Rate Limiting

7.4 Relationship to Async Chunk

7.5 Suggested Configuration

8. Alternatives and Open Questions

8.1 Alternatives

8.2 Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions