1. Summary
This RFC proposes adding streaming input support to vLLM-Omni services: clients may push input data (text, audio, video, or mixed multimodal) incrementally before the full request is ready; the server may start inference once enough data has been received, reducing time-to-first-byte, supporting real-time capture, and enabling very large payloads.
2. Motivation and Goals
2.1 Motivation
- Real-time or long-form input: When users stream audio while speaking or video while recording, waiting for the full upload before inference increases latency to first token/byte.
- Large payloads: Uploading long audio/video in one shot risks timeouts and high memory use; chunked upload reduces per-request body size and timeout risk.
- Consistent experience: Symmetric with streaming output; end-to-end low latency with “streaming input + streaming output.”
2.2 Goals
- Support clients submitting multimodal input in a streaming fashion (text, audio, video, or mixed).
- Allow the server to start Thinker (or the first stage) early when enough data has been received, without waiting for the full input.
- Protocol and API must align with existing OpenAI-compatible and vLLM scheduling models for incremental rollout and fallback.
- Do not require all models/stages to support “start before input is complete”; allow configuration to “buffer until complete” for semantic consistency.
3. Terminology and Scope
| Term |
Meaning |
| Streaming input |
The client splits input for a single logical request into ordered chunks; the server may process input before it is complete. |
| Input chunk |
A single uploaded unit of data (e.g., a segment of audio bytes, text, or video). |
| End of input (EOS) |
The client explicitly signals that all input for the request has been sent. |
| Start condition |
Server policy: allow inference to start when certain conditions are met (e.g., at least N bytes received, or first complete modality). |
Scope: Only “client → vLLM-Omni gateway/API layer” input streaming. Internal model execution (Thinker/Talker/Code2Wav) continues to follow existing logic and async_chunk.
4. Current State and Constraints
4.1 Current Behavior
- Requests are typically a single HTTP body:
messages / input / multimodal payload submitted once in the POST body.
- The server creates a scheduled request and runs Thinker only after fully parsing the request body; there is no official “receive-and-compute” protocol.
4.2 Constraints
- Must coexist with existing OpenAI-compatible chat/completion and custom multimodal APIs (same service may support both “non-streaming” and “streaming input” requests).
- Scheduler and engine today work at “request” granularity (one request, one full prompt); streaming input introduces “incomplete requests” or “multiple chunks forming one request.”
- Multimodal order and boundaries: if text and audio are interleaved, chunk boundaries and types must be defined so the server can concatenate and align correctly (e.g., by timestamp or sequence number).
5. Design
5.1 Overview
- Session/request identity: A streaming input request first establishes a “streaming session” (e.g., via a streaming session ID or request ID); subsequent chunks carry this ID so the server can attribute uploads to the same logical request.
- Two modes (one or both may be supported):
- Mode A: Chunked HTTP, single connection
The client sends the body with Transfer-Encoding: chunked in one HTTP request; the server parses and buffers chunks, and may create/update a scheduled request and start inference when the “start condition” is met.
- Mode B: Multi-request append
The client first POSTs to create a streaming session (optionally with an initial chunk), then appends chunks via PATCH or POST /append, and finally marks end via POST /finish or a chunk with end_of_input=true; the server starts or continues inference when the end is received or when the start condition is met.
- Start condition: Configurable or policy-driven, e.g.:
- “At least one complete modality” (e.g., first text message complete, or first audio segment meets minimum length);
- “At least N bytes/chunks received”;
- Or “start only on EOS” (buffer until complete; matches current behavior).
5.2 Data Model (Logical View)
- StreamingInputSession (server-side):
session_id: Unique identifier.
created_at / timeout: For discarding sessions that never complete.
accumulated_input: Concatenated or structured representation of received chunks (text buffer, audio buffer, video buffer, etc.).
state: e.g. open / started / finished (inference triggered / input ended).
- InputChunk (single chunk):
sequence_id / offset: For ordering and deduplication.
modality: text | audio | video | image (optional).
payload: Raw bytes or base64; or a reference (e.g. URL) for server fetch (this RFC may leave reference format unspecified initially).
end_of_input: Whether this is the last chunk for the request.
5.3 Integration with the Scheduler
- Option 1 (buffer until start condition, then create request): Streaming input is accumulated at the API layer until the “start condition” is met; then a full prompt/multimodal payload is built and one scheduled request is created via the existing path. Later chunks only append to the request’s input buffer and do not change the prompt already in the scheduler (no “mid-stream prompt change”; only “early start”).
- Option 2 (dynamic prompt extension): The scheduler supports “incomplete requests” and allows appending input before Thinker starts or at a specific stage; more complex; can be a follow-up extension.
This RFC recommends Option 1 first: streaming input only affects “when to create the request” and “how to accumulate input,” keeping the “one request, one full prompt” assumption; optional appends after start are for stats/audit or reserved for future use.
6. Protocol and API
6.1 Mode A: Chunked HTTP, Single Connection
- Endpoint: Same as today, e.g.
POST /v1/chat/completions or POST /v1/audio/speech; distinguish streaming input via headers:
X-Streaming-Input: true or Content-Type: application/x-multipart-streaming (or another custom type).
- Body:
Transfer-Encoding: chunked; each chunk is a small frame, e.g.:
- Length (4 or 8 bytes, binary) or JSON lines (NDJSON):
{"modality":"text","payload":"base64...","end_of_input":false}.
- Response: Same as existing streaming output when
stream: true (SSE); first event may be sent after the start condition is met and the request is created.
Pros: Single connection; no CORS preflight for a second request. Cons: Frame format, error recovery, and timeout semantics must be defined.
6.2 Mode B: Multi-Request Append
- Create session:
POST /v1/streaming_input/sessions
- Body (optional): Initial parameters (e.g.
model, voice) and optional first chunk.
- Response:
{ "session_id": "uuid", "expires_in": 300 }.
- Append:
POST /v1/streaming_input/sessions/{session_id}/chunks
- Body:
{ "sequence_id": 0, "modality": "text", "payload": "base64...", "end_of_input": false }.
- Server returns
202 Accepted or 200 OK; optionally started: true in the body to indicate inference has been triggered.
- Finish: Set
end_of_input: true on the last chunk, or call POST /v1/streaming_input/sessions/{session_id}/finish.
- Get result:
- For streaming output, agree at session creation (e.g.
stream: true); results delivered via SSE or WebSocket (same as existing streaming output).
- Or poll
GET /v1/streaming_input/sessions/{session_id}/result (for non-streaming output).
Pros: Fits existing REST style; easy retries and auth. Cons: Multiple round-trips; slightly higher latency for the first chunk.
6.3 Recommendation
- Phase 1: Implement Mode B (multi-request append) first for easier integration with routing, auth, and rate limiting, and clear session lifecycle.
- Phase 2: If further reduction in time-to-first-byte is needed, add Mode A (chunked single connection) or WebSocket (see below).
6.4 WebSocket (Optional Extension)
- Single connection, bidirectional: client sends input chunks over the same WebSocket; server pushes streaming output (e.g. token/audio chunks) on the same connection.
- Message format can reuse Mode B’s JSON structure (e.g.
{ "type": "input_chunk", "sequence_id": 0, "modality": "audio", "payload": "base64...", "end_of_input": false }); server pushes { "type": "output_chunk", ... } or existing SSE-compatible format.
- This RFC does not require WebSocket in the first release; it is left as an extension point.
7. Implementation Considerations
7.1 Session and Timeout
- If no
end_of_input or any chunk is received within T seconds after session creation, the streaming session should be closed and resources released; T is configurable (e.g. 300s).
- Repeated
finish or end_of_input for the same session_id should be idempotent.
7.2 Ordering and Deduplication
- Server orders chunks by
sequence_id or arrival order; on out-of-order chunks, buffer until in order and merge, or reject with 4xx.
- Duplicate
sequence_id may be deduplicated (process once).
7.3 Security and Rate Limiting
- Streaming input sessions count toward per-user/per-key concurrency and QPS limits.
- Total payload size per session should be capped to avoid unbounded memory; return 413 and close the session when exceeded.
7.4 Relationship to Async Chunk
- Streaming input only determines when and with what content a request is created and submitted to the scheduler; once in the scheduler, inter-stage flow remains the existing async_chunk (Thinker→Talker→Code2Wav).
- If the start condition is met on the first segment, that segment can enter Thinker earlier; combined with async_chunk, end-to-end TTFT/TTFA can be reduced further.
7.5 Suggested Configuration
streaming_input.enabled: Enable or disable streaming input API.
streaming_input.session_timeout_sec: Session idle timeout.
streaming_input.max_payload_per_session: Maximum accumulated bytes per session.
streaming_input.start_policy: on_first_chunk | on_end_only | on_min_length, etc., to decide when to create the scheduled request.
8. Alternatives and Open Questions
8.1 Alternatives
- Chunked upload only (no early start): Keep “one request, one full prompt”; use multipart or similar only to reduce single-request body size; do not start before input is complete. Fits “fix upload timeouts only” without pursuing earlier first-byte.
- Client-side buffering only: Client waits until input is complete, then one-shot POST; no server changes. Does not reduce time-to-first-byte; relies on async_chunk for inter-stage optimization only.
8.2 Open Questions
- Interleaved multimodal: Alignment rules when text and audio/video arrive interleaved (by timestamp vs. by order) and model-side interface (incremental “text-then-audio” prompt) need to align with concrete model implementations.
- Replay and debugging: Whether to record/replay streaming input for reproduction; if so, storage format and retention policy are TBD.
- Upstream vLLM: If the main vLLM library later adds streaming input extension points, how this design should plug in (e.g. reuse its session or chunk format).
1. Summary
This RFC proposes adding streaming input support to vLLM-Omni services: clients may push input data (text, audio, video, or mixed multimodal) incrementally before the full request is ready; the server may start inference once enough data has been received, reducing time-to-first-byte, supporting real-time capture, and enabling very large payloads.
2. Motivation and Goals
2.1 Motivation
2.2 Goals
3. Terminology and Scope
Scope: Only “client → vLLM-Omni gateway/API layer” input streaming. Internal model execution (Thinker/Talker/Code2Wav) continues to follow existing logic and async_chunk.
4. Current State and Constraints
4.1 Current Behavior
messages/input/ multimodal payload submitted once in thePOSTbody.4.2 Constraints
5. Design
5.1 Overview
The client sends the body with
Transfer-Encoding: chunkedin one HTTP request; the server parses and buffers chunks, and may create/update a scheduled request and start inference when the “start condition” is met.The client first
POSTs to create a streaming session (optionally with an initial chunk), then appends chunks viaPATCHorPOST /append, and finally marks end viaPOST /finishor a chunk withend_of_input=true; the server starts or continues inference when the end is received or when the start condition is met.5.2 Data Model (Logical View)
session_id: Unique identifier.created_at/timeout: For discarding sessions that never complete.accumulated_input: Concatenated or structured representation of received chunks (text buffer, audio buffer, video buffer, etc.).state: e.g.open/started/finished(inference triggered / input ended).sequence_id/offset: For ordering and deduplication.modality:text|audio|video|image(optional).payload: Raw bytes or base64; or a reference (e.g. URL) for server fetch (this RFC may leave reference format unspecified initially).end_of_input: Whether this is the last chunk for the request.5.3 Integration with the Scheduler
This RFC recommends Option 1 first: streaming input only affects “when to create the request” and “how to accumulate input,” keeping the “one request, one full prompt” assumption; optional appends after start are for stats/audit or reserved for future use.
6. Protocol and API
6.1 Mode A: Chunked HTTP, Single Connection
POST /v1/chat/completionsorPOST /v1/audio/speech; distinguish streaming input via headers:X-Streaming-Input: trueorContent-Type: application/x-multipart-streaming(or another custom type).Transfer-Encoding: chunked; each chunk is a small frame, e.g.:{"modality":"text","payload":"base64...","end_of_input":false}.stream: true(SSE); first event may be sent after the start condition is met and the request is created.Pros: Single connection; no CORS preflight for a second request. Cons: Frame format, error recovery, and timeout semantics must be defined.
6.2 Mode B: Multi-Request Append
POST /v1/streaming_input/sessionsmodel,voice) and optional first chunk.{ "session_id": "uuid", "expires_in": 300 }.POST /v1/streaming_input/sessions/{session_id}/chunks{ "sequence_id": 0, "modality": "text", "payload": "base64...", "end_of_input": false }.202 Acceptedor200 OK; optionallystarted: truein the body to indicate inference has been triggered.end_of_input: trueon the last chunk, or callPOST /v1/streaming_input/sessions/{session_id}/finish.stream: true); results delivered via SSE or WebSocket (same as existing streaming output).GET /v1/streaming_input/sessions/{session_id}/result(for non-streaming output).Pros: Fits existing REST style; easy retries and auth. Cons: Multiple round-trips; slightly higher latency for the first chunk.
6.3 Recommendation
6.4 WebSocket (Optional Extension)
{ "type": "input_chunk", "sequence_id": 0, "modality": "audio", "payload": "base64...", "end_of_input": false }); server pushes{ "type": "output_chunk", ... }or existing SSE-compatible format.7. Implementation Considerations
7.1 Session and Timeout
end_of_inputor any chunk is received within T seconds after session creation, the streaming session should be closed and resources released; T is configurable (e.g. 300s).finishorend_of_inputfor the samesession_idshould be idempotent.7.2 Ordering and Deduplication
sequence_idor arrival order; on out-of-order chunks, buffer until in order and merge, or reject with 4xx.sequence_idmay be deduplicated (process once).7.3 Security and Rate Limiting
7.4 Relationship to Async Chunk
7.5 Suggested Configuration
streaming_input.enabled: Enable or disable streaming input API.streaming_input.session_timeout_sec: Session idle timeout.streaming_input.max_payload_per_session: Maximum accumulated bytes per session.streaming_input.start_policy:on_first_chunk|on_end_only|on_min_length, etc., to decide when to create the scheduled request.8. Alternatives and Open Questions
8.1 Alternatives
8.2 Open Questions