[RFC] Streaming Video Input for Omni-Modal Real-Time Interaction

## Motivation

Real-time video understanding — where a model continuously processes a live camera or video stream and responds to user queries about what it sees — is rapidly becoming a core capability of frontier AI platforms (Gemini Live, Doubao, Apple Visual Intelligence). vLLM-OMNI is uniquely positioned: it already has Qwen3-Omni's 3-stage pipeline, async chunk streaming, and audio-in-video interleaving. This RFC proposes adding streaming video input to complete the "Doubao experience": point camera, speak question, get spoken answer.

**No open-source serving engine currently offers streaming video in with text+speech out.** This RFC fills that gap.

### Market Context

```mermaid
pie title AI Video Market Segments (2025 est., $B)
 "Visual Inspection — $30B" : 30
 "Video Analytics — $13B" : 13
 "Video Surveillance — $6B" : 6
 "Robotics / Embodied AI — $5B" : 5
```

Sources:
- Visual Inspection: [Mordor Intelligence, AI Visual Inspection Market 2025](https://www.mordorintelligence.com/industry-reports/ai-powered-visual-inspection-market)
- Video Analytics: [Mordor Intelligence, AI Video Analytics Market 2025](https://www.mordorintelligence.com/industry-reports/global-ai-video-analytics-market)
- Video Surveillance: [SNS Insider, AI in Video Surveillance Market 2025](https://www.globenewswire.com/news-release/2026/03/23/3260344/0/en/AI-in-Video-Surveillance-Market-Size-to-Worth-USD-49-02-Billion-by-2035-Research-by-SNS-Insider.html)
- AI video company funding $3.08B in 2025 (+94.6% YoY): [Crunchbase, AI Funding Trends 2025](https://news.crunchbase.com/ai/big-funding-trends-charts-eoy-2025/)

### Benchmarks

```mermaid
xychart-beta
 title "StreamingBench Scores — higher is better (human = 91.66)"
 x-axis ["Human", "Seed1.5-VL", "Gemini 1.5 Pro", "MiniCPM-o 2.6", "GPT-4o", "Qwen2-VL 7B"]
 y-axis "Score (%)" 50 --> 95
 bar [91.66, 82.80, 67.07, 66.01, 60.15, 54.14]
```

Source: [StreamingBench Leaderboard](https://streamingbench.github.io/) — 900 videos, 4,500 human-curated QA pairs, 18 tasks.

Even the best model trails humans by **9 points**. Massive room for improvement.

### Use Cases

| Use Case | FPS | Latency Target |
|----------|-----|----------------|
| Interactive video call (Doubao-style) | 1-4 | <1s |
| Robotics / embodied AI | 2-8 | <500ms |
| Manufacturing QC | 1-2 | <1s |
| Security / surveillance | 0.5-2 | <2s |
| Accessibility (scene narration) | 1-2 | <1s |

---

## Target Experience

```mermaid
sequenceDiagram
 participant U as User (Camera + Mic)
 participant WS as WebSocket
 participant T as Thinker
 participant TK as Talker
 participant CW as Code2Wav

 U->>WS: session.config {model, modalities}
 loop Every frame (1-4 FPS)
 U->>WS: video.frame {base64 JPEG}
 WS->>T: Vision encode + incremental prefill
 Note over T: KV cache grows as frames arrive
 end
 U->>WS: video.query "What do you see?"

 Note over T: Most prefill already done — append query tokens only
 T->>T: LLM generate (near-instant start)
 T-->>WS: response.text.delta "A person is..."
 T->>TK: Hidden states (async chunk)
 TK->>TK: Text → speech codes
 TK->>CW: Codec codes
 CW->>CW: Codes → waveform
 CW-->>WS: response.audio (binary)

 WS-->>U: Streaming text + audio
 U->>WS: video.done
```

---

## Phased Implementation

```mermaid
gantt
 title Implementation Roadmap
 dateFormat YYYY-MM-DD
 axisFormat %b

 section Phase 1
 WebSocket endpoint + Qwen3-Omni :done, p1, 2026-03-26, 7d

 section Phase 2
 EVS frame pruning :p2, after p1, 5d

 section Phase 3
 Combined video + audio input :p3, after p2, 7d

 section Phase 4
 Async chunk for video :p4, after p3, 5d

 section Phase 5
 KV cache eviction :p5, after p4, 10d

 section Phase 6
 Expand to other models :p6, after p4, 14d

 section Phase 7
 Gradio demo + benchmarks :p7, after p5, 7d
```

---

### Phase 1: End-to-End Streaming Video with Qwen3-Omni

> **Status:** PR #2202

```mermaid
flowchart LR
 A[Webcam / Video File] -->|base64 JPEG| B[WebSocket Handler]
 B -->|Encode + Prefill| C[Thinker KV Cache]
 C -->|Query arrives| D[Qwen3-Omni Thinker]
 D -->|text tokens| E[Talker]
 E -->|codec codes| F[Code2Wav]
 D -.->|text deltas| G[Client]
 F -.->|audio bytes| G
```

New WebSocket endpoint `/v1/video/chat/stream`. Frames arrive incrementally and are **vision-encoded and prefilled into KV cache as they arrive** — so by the time a query is submitted, most of the prefill is already done and TTFT is minimized. The handler appends query tokens to the existing KV cache and starts generation immediately. Responses stream back as text deltas and audio bytes.

Key design: multi-turn within a single session. Frames persist in KV cache across queries — user can send more frames, then ask another question about the accumulated context.

---

### Phase 2: EVS Frame Pruning for Streaming

```mermaid
flowchart LR
 subgraph "Without EVS"
 A1[Frame 1] --> B1[Encoder]
 A2[Frame 2 nearly identical] --> B1
 A3[Frame 3 nearly identical] --> B1
 A4[Frame 4 scene change] --> B1
 B1 -->|4 frames| C1[LLM]
 end

 subgraph "With EVS"
 D1[Frame 1] --> E1{Similar to last retained?}
 E1 -->|No| F1[Keep]
 D2[Frame 2] --> E2{Similar?}
 E2 -->|Yes| G1[Drop]
 D3[Frame 3] --> E3{Similar?}
 E3 -->|Yes| G2[Drop]
 D4[Frame 4] --> E4{Similar?}
 E4 -->|No| F2[Keep]
 F1 --> H1[Encoder]
 F2 --> H1
 H1 -->|2 frames| I1[LLM]
 end
```

Upstream vLLM already ships EVS (`multimodal/evs.py`) which computes cosine similarity between consecutive frame embeddings and retains only the most dissimilar ones. For streaming, we add a lightweight **pre-filter** in the handler: compare each incoming frame against the last retained frame using pixel-level similarity. Frames above a configurable threshold (default 0.95) are dropped before reaching the vision encoder.

Expected impact: 2-5x fewer frames processed for static/slow scenes, proportional reduction in encoder compute and KV cache usage.

---

### Phase 3: Combined Video + Audio Input

```mermaid
flowchart TB
 subgraph "Client"
 CAM[Camera 1-4 FPS] -->|video.frame| WS[WebSocket]
 MIC[Microphone 16kHz PCM] -->|audio.chunk| WS
 end

 subgraph "Handler"
 WS --> VB[Video Buffer]
 WS --> AB[Audio Buffer]
 VB --> MRG[Merge into Chat Request]
 AB --> MRG
 end

 subgraph "Qwen3-Omni Thinker"
 MRG -->|image_url + audio_url| INT[Audio-Video Token Interleaving]
 INT --> |temporal alignment| LLM[LLM Decode]
 end
```

Extend the WebSocket protocol with `audio.chunk` events alongside `video.frame`. When a query is committed, the handler merges both buffers into a single chat request with `image_url` + `audio_url` content blocks. Qwen3-Omni's existing `use_audio_in_video` mechanism interleaves audio and video tokens at the temporal level — splitting frames into chunks and alternating vision tokens with corresponding audio tokens.

This enables the full "Doubao experience" — user speaks a question while showing their camera, model understands both modalities simultaneously.

---

### Phase 4: Async Chunk for Video

```mermaid
sequenceDiagram
 participant TH as Thinker
 participant TK as Talker
 participant CW as Code2Wav
 participant CL as Client

 Note over TH: Prefill all video frames + query

 TH->>TH: Generate token 1
 TH->>TK: Forward token 1 (async)
 TH->>TH: Generate token 2
 TK->>TK: Process token 1
 TH->>TK: Forward token 2
 TH->>TH: Generate token 3
 TK->>CW: First codec chunk (25 frames)
 TK->>TK: Process token 2
 CW->>CL: First audio chunk
 TH->>TK: Forward token 3

 Note over TH,CL: Talker starts before Thinker finishes!
```

The async chunk infrastructure already exists for audio-only Qwen3-Omni (see `docs/design/feature/async_chunk_design.md`). The key insight: **it's model-input agnostic**. The `thinker2talker_async_chunk` processor works identically whether the thinker input was text, audio, or video — it only cares about the thinker's output hidden states. Enabling async chunk for video requires only a new stage config with `async_chunk: true`.

Impact from existing benchmarks: ~92% reduction in time-to-first-audio (6.5s to 0.52s at concurrency=1), ~17% E2E latency improvement at concurrency=10. Source: `docs/design/feature/async_chunk_design.md` internal benchmarks.

---

### Phase 5: KV Cache Eviction for Long Streams

```mermaid
flowchart LR
 subgraph "KV Cache — fixed memory budget"
 direction TB
 S["Sink Tokens System prompt + first frames ~512 tokens — never evicted"]
 VW["Vision Window Last N seconds of frames Sliding — oldest evicted"]
 TW["Text Window Recent responses Sliding — oldest evicted"]
 end

 NEW[New Frame] -->|Append| VW
 VW -->|Overflow| EVICT[Evict oldest vision block]

 style S fill:#4CAF50,color:white
 style VW fill:#2196F3,color:white
 style TW fill:#FF9800,color:white
```

For long streaming sessions, the KV cache grows unboundedly. Three eviction strategies, in order of complexity:

**Strategy A — Fixed Sliding Window:** Keep KV for last N seconds only. Simple, loses history beyond the window.

**Strategy B — Attention Sink + Sliding Window:** Based on [StreamingVLM](https://arxiv.org/abs/2510.09608) (MIT+NVIDIA). Keep "sink" tokens permanently plus a sliding window of recent context. Proven stable for 3+ hours on H100.

**Strategy C — KV Compression:** Merge old frame KVs into compressed representations. [MemStream](https://arxiv.org/html/2602.18434) achieves 16x compression via adaptive key selection; [LiveVLM](https://arxiv.org/abs/2505.15269) merges per-frame KVs into single tuples, processing 44x more frames on the same hardware.

Recommended path: start with A, iterate to B.

**KV cache lifetime estimates** (Qwen3-Omni 30B-A3B, A100 80GB, ~60GB available after model weights, 256 tokens/frame):

```mermaid
xychart-beta
 title "Streaming Duration Before KV Cache Fills"
 x-axis ["1 FPS", "2 FPS", "4 FPS", "4 FPS + FP8 (H100)", "Any FPS + Eviction"]
 y-axis "Hours" 0 --> 10
 bar [4.5, 2.0, 1.0, 2.0, 10]
```

> With eviction (any strategy), streaming duration becomes **unlimited** regardless of FPS. The "10" bar represents the axis cap — actual duration is unbounded.

---

### Phase 6: Expand to Other Models

```mermaid
graph TD
 subgraph "Full Omni Pipeline — video+audio in/out"
 Q3O["Qwen3-Omni (Phase 1)"]
 Q25O[Qwen2.5-Omni]
 MCPM[MiniCPM-o 2.6]
 end

 subgraph "Text-Only Output — video in, text out"
 Q3VL[Qwen3-VL]
 Q25VL[Qwen2.5-VL]
 IVL[InternVL 2.5/3]
 end

 Q3O -->|Same handler| Q25O
 Q3O -->|Adapter needed| MCPM
 Q3O -.->|Chat-only path| Q3VL
 Q3VL -.-> Q25VL
 Q3VL -.-> IVL

 style Q3O fill:#4CAF50,color:white
 style Q25O fill:#8BC34A,color:white
 style MCPM fill:#8BC34A,color:white
 style Q3VL fill:#90CAF9
 style Q25VL fill:#90CAF9
 style IVL fill:#90CAF9
```

The handler is model-agnostic — it builds standard chat completion requests. Each new model needs verification that `image_url` content blocks are processed correctly, plus a stage config for omni models with audio output.

---

### Phase 7: Gradio Demo + Benchmarks

**Gradio demo** — extend the existing Qwen3-Omni demo with a "Live Camera" tab:
- WebRTC or periodic snapshot capture from user's webcam
- Real-time display of model responses (text + audio playback)
- Controls: FPS slider, query input, modality toggles, EVS threshold

**Benchmarks:**
- [StreamingBench](https://streamingbench.github.io/): 900 videos, 4,500 QA pairs, 18 tasks — establish baseline score
- Latency: TTFT, TTFA, E2E at various FPS and concurrency levels
- Memory: KV cache usage over time at 1/2/4 FPS with and without eviction
- Throughput: max concurrent streaming sessions on A100/H100

---

## Related Work

| Paper / Project | Key Contribution | Reference |
|----------------|-----------------|-----------|
| StreamingVLM | 8 FPS on H100, 3+ hours, attention sink approach | [arXiv 2510.09608](https://arxiv.org/abs/2510.09608) |
| LiveVLM | Training-free KV compression, 44x more frames | [arXiv 2505.15269](https://arxiv.org/abs/2505.15269) |
| MemStream | 16x adaptive KV compression | [arXiv 2602.18434](https://arxiv.org/html/2602.18434) |
| StreamingBench | Evaluation benchmark, 18 tasks, human baseline 91.66% | [streamingbench.github.io](https://streamingbench.github.io/) |
| NVIDIA Live VLM WebUI | WebRTC webcam to VLM backend | [GitHub](https://github.com/NVIDIA-AI-IOT/live-vlm-webui) |
| Qwen3-Omni | 507ms video+audio latency, SOTA 32/36 AV benchmarks | [GitHub](https://github.com/QwenLM/Qwen3-Omni) |
| vLLM Streaming Input | StreamingInput API, /v1/realtime WebSocket | [vllm-project/vllm#25066](https://github.com/vllm-project/vllm/issues/25066) |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Streaming Video Input for Omni-Modal Real-Time Interaction #2201

Motivation

Market Context

Benchmarks

Use Cases

Target Experience

Phased Implementation

Phase 1: End-to-End Streaming Video with Qwen3-Omni

Phase 2: EVS Frame Pruning for Streaming

Phase 3: Combined Video + Audio Input

Phase 4: Async Chunk for Video

Phase 5: KV Cache Eviction for Long Streams

Phase 6: Expand to Other Models

Phase 7: Gradio Demo + Benchmarks

Related Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Use Case	FPS	Latency Target
Interactive video call (Doubao-style)	1-4	<1s
Robotics / embodied AI	2-8	<500ms
Manufacturing QC	1-2	<1s
Security / surveillance	0.5-2	<2s
Accessibility (scene narration)	1-2	<1s

Paper / Project	Key Contribution	Reference
StreamingVLM	8 FPS on H100, 3+ hours, attention sink approach	arXiv 2510.09608
LiveVLM	Training-free KV compression, 44x more frames	arXiv 2505.15269
MemStream	16x adaptive KV compression	arXiv 2602.18434
StreamingBench	Evaluation benchmark, 18 tasks, human baseline 91.66%	streamingbench.github.io
NVIDIA Live VLM WebUI	WebRTC webcam to VLM backend	GitHub
Qwen3-Omni	507ms video+audio latency, SOTA 32/36 AV benchmarks	GitHub
vLLM Streaming Input	StreamingInput API, /v1/realtime WebSocket	vllm-project/vllm#25066

[RFC] Streaming Video Input for Omni-Modal Real-Time Interaction #2201

Description

Motivation

Market Context

Benchmarks

Use Cases

Target Experience

Phased Implementation

Phase 1: End-to-End Streaming Video with Qwen3-Omni

Phase 2: EVS Frame Pruning for Streaming

Phase 3: Combined Video + Audio Input

Phase 4: Async Chunk for Video

Phase 5: KV Cache Eviction for Long Streams

Phase 6: Expand to Other Models

Phase 7: Gradio Demo + Benchmarks

Related Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions