You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Real-time video understanding — where a model continuously processes a live camera or video stream and responds to user queries about what it sees — is rapidly becoming a core capability of frontier AI platforms (Gemini Live, Doubao, Apple Visual Intelligence). vLLM-OMNI is uniquely positioned: it already has Qwen3-Omni's 3-stage pipeline, async chunk streaming, and audio-in-video interleaving. This RFC proposes adding streaming video input to complete the "Doubao experience": point camera, speak question, get spoken answer.
No open-source serving engine currently offers streaming video in with text+speech out. This RFC fills that gap.
Market Context
pie title AI Video Market Segments (2025 est., $B)
"Visual Inspection — $30B" : 30
"Video Analytics — $13B" : 13
"Video Surveillance — $6B" : 6
"Robotics / Embodied AI — $5B" : 5
Even the best model trails humans by 9 points. Massive room for improvement.
Use Cases
Use Case
FPS
Latency Target
Interactive video call (Doubao-style)
1-4
<1s
Robotics / embodied AI
2-8
<500ms
Manufacturing QC
1-2
<1s
Security / surveillance
0.5-2
<2s
Accessibility (scene narration)
1-2
<1s
Target Experience
sequenceDiagram
participant U as User (Camera + Mic)
participant WS as WebSocket
participant T as Thinker
participant TK as Talker
participant CW as Code2Wav
U->>WS: session.config {model, modalities}
loop Every frame (1-4 FPS)
U->>WS: video.frame {base64 JPEG}
WS->>T: Vision encode + incremental prefill
Note over T: KV cache grows as frames arrive
end
U->>WS: video.query "What do you see?"
Note over T: Most prefill already done — append query tokens only
T->>T: LLM generate (near-instant start)
T-->>WS: response.text.delta "A person is..."
T->>TK: Hidden states (async chunk)
TK->>TK: Text → speech codes
TK->>CW: Codec codes
CW->>CW: Codes → waveform
CW-->>WS: response.audio (binary)
WS-->>U: Streaming text + audio
U->>WS: video.done
Loading
Phased Implementation
gantt
title Implementation Roadmap
dateFormat YYYY-MM-DD
axisFormat %b
section Phase 1
WebSocket endpoint + Qwen3-Omni :done, p1, 2026-03-26, 7d
section Phase 2
EVS frame pruning :p2, after p1, 5d
section Phase 3
Combined video + audio input :p3, after p2, 7d
section Phase 4
Async chunk for video :p4, after p3, 5d
section Phase 5
KV cache eviction :p5, after p4, 10d
section Phase 6
Expand to other models :p6, after p4, 14d
section Phase 7
Gradio demo + benchmarks :p7, after p5, 7d
Loading
Phase 1: End-to-End Streaming Video with Qwen3-Omni
flowchart LR
A[Webcam / Video File] -->|base64 JPEG| B[WebSocket Handler]
B -->|Encode + Prefill| C[Thinker KV Cache]
C -->|Query arrives| D[Qwen3-Omni Thinker]
D -->|text tokens| E[Talker]
E -->|codec codes| F[Code2Wav]
D -.->|text deltas| G[Client]
F -.->|audio bytes| G
Loading
New WebSocket endpoint /v1/video/chat/stream. Frames arrive incrementally and are vision-encoded and prefilled into KV cache as they arrive — so by the time a query is submitted, most of the prefill is already done and TTFT is minimized. The handler appends query tokens to the existing KV cache and starts generation immediately. Responses stream back as text deltas and audio bytes.
Key design: multi-turn within a single session. Frames persist in KV cache across queries — user can send more frames, then ask another question about the accumulated context.
Upstream vLLM already ships EVS (multimodal/evs.py) which computes cosine similarity between consecutive frame embeddings and retains only the most dissimilar ones. For streaming, we add a lightweight pre-filter in the handler: compare each incoming frame against the last retained frame using pixel-level similarity. Frames above a configurable threshold (default 0.95) are dropped before reaching the vision encoder.
Expected impact: 2-5x fewer frames processed for static/slow scenes, proportional reduction in encoder compute and KV cache usage.
Phase 3: Combined Video + Audio Input
flowchart TB
subgraph "Client"
CAM[Camera<br>1-4 FPS] -->|video.frame| WS[WebSocket]
MIC[Microphone<br>16kHz PCM] -->|audio.chunk| WS
end
subgraph "Handler"
WS --> VB[Video Buffer]
WS --> AB[Audio Buffer]
VB --> MRG[Merge into<br>Chat Request]
AB --> MRG
end
subgraph "Qwen3-Omni Thinker"
MRG -->|image_url + audio_url| INT[Audio-Video<br>Token Interleaving]
INT --> |temporal alignment| LLM[LLM Decode]
end
Loading
Extend the WebSocket protocol with audio.chunk events alongside video.frame. When a query is committed, the handler merges both buffers into a single chat request with image_url + audio_url content blocks. Qwen3-Omni's existing use_audio_in_video mechanism interleaves audio and video tokens at the temporal level — splitting frames into chunks and alternating vision tokens with corresponding audio tokens.
This enables the full "Doubao experience" — user speaks a question while showing their camera, model understands both modalities simultaneously.
Phase 4: Async Chunk for Video
sequenceDiagram
participant TH as Thinker
participant TK as Talker
participant CW as Code2Wav
participant CL as Client
Note over TH: Prefill all video frames + query
TH->>TH: Generate token 1
TH->>TK: Forward token 1 (async)
TH->>TH: Generate token 2
TK->>TK: Process token 1
TH->>TK: Forward token 2
TH->>TH: Generate token 3
TK->>CW: First codec chunk (25 frames)
TK->>TK: Process token 2
CW->>CL: First audio chunk
TH->>TK: Forward token 3
Note over TH,CL: Talker starts before Thinker finishes!
Loading
The async chunk infrastructure already exists for audio-only Qwen3-Omni (see docs/design/feature/async_chunk_design.md). The key insight: it's model-input agnostic. The thinker2talker_async_chunk processor works identically whether the thinker input was text, audio, or video — it only cares about the thinker's output hidden states. Enabling async chunk for video requires only a new stage config with async_chunk: true.
Impact from existing benchmarks: ~92% reduction in time-to-first-audio (6.5s to 0.52s at concurrency=1), ~17% E2E latency improvement at concurrency=10. Source: docs/design/feature/async_chunk_design.md internal benchmarks.
Phase 5: KV Cache Eviction for Long Streams
flowchart LR
subgraph "KV Cache — fixed memory budget"
direction TB
S["Sink Tokens<br>System prompt + first frames<br>~512 tokens — never evicted"]
VW["Vision Window<br>Last N seconds of frames<br>Sliding — oldest evicted"]
TW["Text Window<br>Recent responses<br>Sliding — oldest evicted"]
end
NEW[New Frame] -->|Append| VW
VW -->|Overflow| EVICT[Evict oldest<br>vision block]
style S fill:#4CAF50,color:white
style VW fill:#2196F3,color:white
style TW fill:#FF9800,color:white
Loading
For long streaming sessions, the KV cache grows unboundedly. Three eviction strategies, in order of complexity:
Strategy A — Fixed Sliding Window: Keep KV for last N seconds only. Simple, loses history beyond the window.
Strategy B — Attention Sink + Sliding Window: Based on StreamingVLM (MIT+NVIDIA). Keep "sink" tokens permanently plus a sliding window of recent context. Proven stable for 3+ hours on H100.
Strategy C — KV Compression: Merge old frame KVs into compressed representations. MemStream achieves 16x compression via adaptive key selection; LiveVLM merges per-frame KVs into single tuples, processing 44x more frames on the same hardware.
Recommended path: start with A, iterate to B.
KV cache lifetime estimates (Qwen3-Omni 30B-A3B, A100 80GB, ~60GB available after model weights, 256 tokens/frame):
With eviction (any strategy), streaming duration becomes unlimited regardless of FPS. The "10" bar represents the axis cap — actual duration is unbounded.
The handler is model-agnostic — it builds standard chat completion requests. Each new model needs verification that image_url content blocks are processed correctly, plus a stage config for omni models with audio output.
Phase 7: Gradio Demo + Benchmarks
Gradio demo — extend the existing Qwen3-Omni demo with a "Live Camera" tab:
WebRTC or periodic snapshot capture from user's webcam
Real-time display of model responses (text + audio playback)
Motivation
Real-time video understanding — where a model continuously processes a live camera or video stream and responds to user queries about what it sees — is rapidly becoming a core capability of frontier AI platforms (Gemini Live, Doubao, Apple Visual Intelligence). vLLM-OMNI is uniquely positioned: it already has Qwen3-Omni's 3-stage pipeline, async chunk streaming, and audio-in-video interleaving. This RFC proposes adding streaming video input to complete the "Doubao experience": point camera, speak question, get spoken answer.
No open-source serving engine currently offers streaming video in with text+speech out. This RFC fills that gap.
Market Context
pie title AI Video Market Segments (2025 est., $B) "Visual Inspection — $30B" : 30 "Video Analytics — $13B" : 13 "Video Surveillance — $6B" : 6 "Robotics / Embodied AI — $5B" : 5Sources:
Benchmarks
xychart-beta title "StreamingBench Scores — higher is better (human = 91.66)" x-axis ["Human", "Seed1.5-VL", "Gemini 1.5 Pro", "MiniCPM-o 2.6", "GPT-4o", "Qwen2-VL 7B"] y-axis "Score (%)" 50 --> 95 bar [91.66, 82.80, 67.07, 66.01, 60.15, 54.14]Source: StreamingBench Leaderboard — 900 videos, 4,500 human-curated QA pairs, 18 tasks.
Even the best model trails humans by 9 points. Massive room for improvement.
Use Cases
Target Experience
sequenceDiagram participant U as User (Camera + Mic) participant WS as WebSocket participant T as Thinker participant TK as Talker participant CW as Code2Wav U->>WS: session.config {model, modalities} loop Every frame (1-4 FPS) U->>WS: video.frame {base64 JPEG} WS->>T: Vision encode + incremental prefill Note over T: KV cache grows as frames arrive end U->>WS: video.query "What do you see?" Note over T: Most prefill already done — append query tokens only T->>T: LLM generate (near-instant start) T-->>WS: response.text.delta "A person is..." T->>TK: Hidden states (async chunk) TK->>TK: Text → speech codes TK->>CW: Codec codes CW->>CW: Codes → waveform CW-->>WS: response.audio (binary) WS-->>U: Streaming text + audio U->>WS: video.donePhased Implementation
gantt title Implementation Roadmap dateFormat YYYY-MM-DD axisFormat %b section Phase 1 WebSocket endpoint + Qwen3-Omni :done, p1, 2026-03-26, 7d section Phase 2 EVS frame pruning :p2, after p1, 5d section Phase 3 Combined video + audio input :p3, after p2, 7d section Phase 4 Async chunk for video :p4, after p3, 5d section Phase 5 KV cache eviction :p5, after p4, 10d section Phase 6 Expand to other models :p6, after p4, 14d section Phase 7 Gradio demo + benchmarks :p7, after p5, 7dPhase 1: End-to-End Streaming Video with Qwen3-Omni
flowchart LR A[Webcam / Video File] -->|base64 JPEG| B[WebSocket Handler] B -->|Encode + Prefill| C[Thinker KV Cache] C -->|Query arrives| D[Qwen3-Omni Thinker] D -->|text tokens| E[Talker] E -->|codec codes| F[Code2Wav] D -.->|text deltas| G[Client] F -.->|audio bytes| GNew WebSocket endpoint
/v1/video/chat/stream. Frames arrive incrementally and are vision-encoded and prefilled into KV cache as they arrive — so by the time a query is submitted, most of the prefill is already done and TTFT is minimized. The handler appends query tokens to the existing KV cache and starts generation immediately. Responses stream back as text deltas and audio bytes.Key design: multi-turn within a single session. Frames persist in KV cache across queries — user can send more frames, then ask another question about the accumulated context.
Phase 2: EVS Frame Pruning for Streaming
flowchart LR subgraph "Without EVS" A1[Frame 1] --> B1[Encoder] A2[Frame 2<br>nearly identical] --> B1 A3[Frame 3<br>nearly identical] --> B1 A4[Frame 4<br>scene change] --> B1 B1 -->|4 frames| C1[LLM] end subgraph "With EVS" D1[Frame 1] --> E1{Similar to<br>last retained?} E1 -->|No| F1[Keep] D2[Frame 2] --> E2{Similar?} E2 -->|Yes| G1[Drop] D3[Frame 3] --> E3{Similar?} E3 -->|Yes| G2[Drop] D4[Frame 4] --> E4{Similar?} E4 -->|No| F2[Keep] F1 --> H1[Encoder] F2 --> H1 H1 -->|2 frames| I1[LLM] endUpstream vLLM already ships EVS (
multimodal/evs.py) which computes cosine similarity between consecutive frame embeddings and retains only the most dissimilar ones. For streaming, we add a lightweight pre-filter in the handler: compare each incoming frame against the last retained frame using pixel-level similarity. Frames above a configurable threshold (default 0.95) are dropped before reaching the vision encoder.Expected impact: 2-5x fewer frames processed for static/slow scenes, proportional reduction in encoder compute and KV cache usage.
Phase 3: Combined Video + Audio Input
flowchart TB subgraph "Client" CAM[Camera<br>1-4 FPS] -->|video.frame| WS[WebSocket] MIC[Microphone<br>16kHz PCM] -->|audio.chunk| WS end subgraph "Handler" WS --> VB[Video Buffer] WS --> AB[Audio Buffer] VB --> MRG[Merge into<br>Chat Request] AB --> MRG end subgraph "Qwen3-Omni Thinker" MRG -->|image_url + audio_url| INT[Audio-Video<br>Token Interleaving] INT --> |temporal alignment| LLM[LLM Decode] endExtend the WebSocket protocol with
audio.chunkevents alongsidevideo.frame. When a query is committed, the handler merges both buffers into a single chat request withimage_url+audio_urlcontent blocks. Qwen3-Omni's existinguse_audio_in_videomechanism interleaves audio and video tokens at the temporal level — splitting frames into chunks and alternating vision tokens with corresponding audio tokens.This enables the full "Doubao experience" — user speaks a question while showing their camera, model understands both modalities simultaneously.
Phase 4: Async Chunk for Video
sequenceDiagram participant TH as Thinker participant TK as Talker participant CW as Code2Wav participant CL as Client Note over TH: Prefill all video frames + query TH->>TH: Generate token 1 TH->>TK: Forward token 1 (async) TH->>TH: Generate token 2 TK->>TK: Process token 1 TH->>TK: Forward token 2 TH->>TH: Generate token 3 TK->>CW: First codec chunk (25 frames) TK->>TK: Process token 2 CW->>CL: First audio chunk TH->>TK: Forward token 3 Note over TH,CL: Talker starts before Thinker finishes!The async chunk infrastructure already exists for audio-only Qwen3-Omni (see
docs/design/feature/async_chunk_design.md). The key insight: it's model-input agnostic. Thethinker2talker_async_chunkprocessor works identically whether the thinker input was text, audio, or video — it only cares about the thinker's output hidden states. Enabling async chunk for video requires only a new stage config withasync_chunk: true.Impact from existing benchmarks: ~92% reduction in time-to-first-audio (6.5s to 0.52s at concurrency=1), ~17% E2E latency improvement at concurrency=10. Source:
docs/design/feature/async_chunk_design.mdinternal benchmarks.Phase 5: KV Cache Eviction for Long Streams
flowchart LR subgraph "KV Cache — fixed memory budget" direction TB S["Sink Tokens<br>System prompt + first frames<br>~512 tokens — never evicted"] VW["Vision Window<br>Last N seconds of frames<br>Sliding — oldest evicted"] TW["Text Window<br>Recent responses<br>Sliding — oldest evicted"] end NEW[New Frame] -->|Append| VW VW -->|Overflow| EVICT[Evict oldest<br>vision block] style S fill:#4CAF50,color:white style VW fill:#2196F3,color:white style TW fill:#FF9800,color:whiteFor long streaming sessions, the KV cache grows unboundedly. Three eviction strategies, in order of complexity:
Strategy A — Fixed Sliding Window: Keep KV for last N seconds only. Simple, loses history beyond the window.
Strategy B — Attention Sink + Sliding Window: Based on StreamingVLM (MIT+NVIDIA). Keep "sink" tokens permanently plus a sliding window of recent context. Proven stable for 3+ hours on H100.
Strategy C — KV Compression: Merge old frame KVs into compressed representations. MemStream achieves 16x compression via adaptive key selection; LiveVLM merges per-frame KVs into single tuples, processing 44x more frames on the same hardware.
Recommended path: start with A, iterate to B.
KV cache lifetime estimates (Qwen3-Omni 30B-A3B, A100 80GB, ~60GB available after model weights, 256 tokens/frame):
xychart-beta title "Streaming Duration Before KV Cache Fills" x-axis ["1 FPS", "2 FPS", "4 FPS", "4 FPS + FP8 (H100)", "Any FPS + Eviction"] y-axis "Hours" 0 --> 10 bar [4.5, 2.0, 1.0, 2.0, 10]Phase 6: Expand to Other Models
graph TD subgraph "Full Omni Pipeline — video+audio in/out" Q3O["Qwen3-Omni<br>(Phase 1)"] Q25O[Qwen2.5-Omni] MCPM[MiniCPM-o 2.6] end subgraph "Text-Only Output — video in, text out" Q3VL[Qwen3-VL] Q25VL[Qwen2.5-VL] IVL[InternVL 2.5/3] end Q3O -->|Same handler| Q25O Q3O -->|Adapter needed| MCPM Q3O -.->|Chat-only path| Q3VL Q3VL -.-> Q25VL Q3VL -.-> IVL style Q3O fill:#4CAF50,color:white style Q25O fill:#8BC34A,color:white style MCPM fill:#8BC34A,color:white style Q3VL fill:#90CAF9 style Q25VL fill:#90CAF9 style IVL fill:#90CAF9The handler is model-agnostic — it builds standard chat completion requests. Each new model needs verification that
image_urlcontent blocks are processed correctly, plus a stage config for omni models with audio output.Phase 7: Gradio Demo + Benchmarks
Gradio demo — extend the existing Qwen3-Omni demo with a "Live Camera" tab:
Benchmarks:
Related Work