Skip to content

[RFC] Streaming Video Input for Omni-Modal Real-Time Interaction #2201

@lishunyang12

Description

@lishunyang12

Motivation

Real-time video understanding — where a model continuously processes a live camera or video stream and responds to user queries about what it sees — is rapidly becoming a core capability of frontier AI platforms (Gemini Live, Doubao, Apple Visual Intelligence). vLLM-OMNI is uniquely positioned: it already has Qwen3-Omni's 3-stage pipeline, async chunk streaming, and audio-in-video interleaving. This RFC proposes adding streaming video input to complete the "Doubao experience": point camera, speak question, get spoken answer.

No open-source serving engine currently offers streaming video in with text+speech out. This RFC fills that gap.

Market Context

pie title AI Video Market Segments (2025 est., $B)
    "Visual Inspection — $30B" : 30
    "Video Analytics — $13B" : 13
    "Video Surveillance — $6B" : 6
    "Robotics / Embodied AI — $5B" : 5
Loading

Sources:

Benchmarks

xychart-beta
    title "StreamingBench Scores — higher is better (human = 91.66)"
    x-axis ["Human", "Seed1.5-VL", "Gemini 1.5 Pro", "MiniCPM-o 2.6", "GPT-4o", "Qwen2-VL 7B"]
    y-axis "Score (%)" 50 --> 95
    bar [91.66, 82.80, 67.07, 66.01, 60.15, 54.14]
Loading

Source: StreamingBench Leaderboard — 900 videos, 4,500 human-curated QA pairs, 18 tasks.

Even the best model trails humans by 9 points. Massive room for improvement.

Use Cases

Use Case FPS Latency Target
Interactive video call (Doubao-style) 1-4 <1s
Robotics / embodied AI 2-8 <500ms
Manufacturing QC 1-2 <1s
Security / surveillance 0.5-2 <2s
Accessibility (scene narration) 1-2 <1s

Target Experience

sequenceDiagram
    participant U as User (Camera + Mic)
    participant WS as WebSocket
    participant T as Thinker
    participant TK as Talker
    participant CW as Code2Wav

    U->>WS: session.config {model, modalities}
    loop Every frame (1-4 FPS)
        U->>WS: video.frame {base64 JPEG}
        WS->>T: Vision encode + incremental prefill
        Note over T: KV cache grows as frames arrive
    end
    U->>WS: video.query "What do you see?"

    Note over T: Most prefill already done — append query tokens only
    T->>T: LLM generate (near-instant start)
    T-->>WS: response.text.delta "A person is..."
    T->>TK: Hidden states (async chunk)
    TK->>TK: Text → speech codes
    TK->>CW: Codec codes
    CW->>CW: Codes → waveform
    CW-->>WS: response.audio (binary)

    WS-->>U: Streaming text + audio
    U->>WS: video.done
Loading

Phased Implementation

gantt
    title Implementation Roadmap
    dateFormat  YYYY-MM-DD
    axisFormat %b

    section Phase 1
    WebSocket endpoint + Qwen3-Omni       :done, p1, 2026-03-26, 7d

    section Phase 2
    EVS frame pruning                      :p2, after p1, 5d

    section Phase 3
    Combined video + audio input           :p3, after p2, 7d

    section Phase 4
    Async chunk for video                  :p4, after p3, 5d

    section Phase 5
    KV cache eviction                      :p5, after p4, 10d

    section Phase 6
    Expand to other models                 :p6, after p4, 14d

    section Phase 7
    Gradio demo + benchmarks               :p7, after p5, 7d
Loading

Phase 1: End-to-End Streaming Video with Qwen3-Omni

Status: PR #2202

flowchart LR
    A[Webcam / Video File] -->|base64 JPEG| B[WebSocket Handler]
    B -->|Encode + Prefill| C[Thinker KV Cache]
    C -->|Query arrives| D[Qwen3-Omni Thinker]
    D -->|text tokens| E[Talker]
    E -->|codec codes| F[Code2Wav]
    D -.->|text deltas| G[Client]
    F -.->|audio bytes| G
Loading

New WebSocket endpoint /v1/video/chat/stream. Frames arrive incrementally and are vision-encoded and prefilled into KV cache as they arrive — so by the time a query is submitted, most of the prefill is already done and TTFT is minimized. The handler appends query tokens to the existing KV cache and starts generation immediately. Responses stream back as text deltas and audio bytes.

Key design: multi-turn within a single session. Frames persist in KV cache across queries — user can send more frames, then ask another question about the accumulated context.


Phase 2: EVS Frame Pruning for Streaming

flowchart LR
    subgraph "Without EVS"
        A1[Frame 1] --> B1[Encoder]
        A2[Frame 2<br>nearly identical] --> B1
        A3[Frame 3<br>nearly identical] --> B1
        A4[Frame 4<br>scene change] --> B1
        B1 -->|4 frames| C1[LLM]
    end

    subgraph "With EVS"
        D1[Frame 1] --> E1{Similar to<br>last retained?}
        E1 -->|No| F1[Keep]
        D2[Frame 2] --> E2{Similar?}
        E2 -->|Yes| G1[Drop]
        D3[Frame 3] --> E3{Similar?}
        E3 -->|Yes| G2[Drop]
        D4[Frame 4] --> E4{Similar?}
        E4 -->|No| F2[Keep]
        F1 --> H1[Encoder]
        F2 --> H1
        H1 -->|2 frames| I1[LLM]
    end
Loading

Upstream vLLM already ships EVS (multimodal/evs.py) which computes cosine similarity between consecutive frame embeddings and retains only the most dissimilar ones. For streaming, we add a lightweight pre-filter in the handler: compare each incoming frame against the last retained frame using pixel-level similarity. Frames above a configurable threshold (default 0.95) are dropped before reaching the vision encoder.

Expected impact: 2-5x fewer frames processed for static/slow scenes, proportional reduction in encoder compute and KV cache usage.


Phase 3: Combined Video + Audio Input

flowchart TB
    subgraph "Client"
        CAM[Camera<br>1-4 FPS] -->|video.frame| WS[WebSocket]
        MIC[Microphone<br>16kHz PCM] -->|audio.chunk| WS
    end

    subgraph "Handler"
        WS --> VB[Video Buffer]
        WS --> AB[Audio Buffer]
        VB --> MRG[Merge into<br>Chat Request]
        AB --> MRG
    end

    subgraph "Qwen3-Omni Thinker"
        MRG -->|image_url + audio_url| INT[Audio-Video<br>Token Interleaving]
        INT --> |temporal alignment| LLM[LLM Decode]
    end
Loading

Extend the WebSocket protocol with audio.chunk events alongside video.frame. When a query is committed, the handler merges both buffers into a single chat request with image_url + audio_url content blocks. Qwen3-Omni's existing use_audio_in_video mechanism interleaves audio and video tokens at the temporal level — splitting frames into chunks and alternating vision tokens with corresponding audio tokens.

This enables the full "Doubao experience" — user speaks a question while showing their camera, model understands both modalities simultaneously.


Phase 4: Async Chunk for Video

sequenceDiagram
    participant TH as Thinker
    participant TK as Talker
    participant CW as Code2Wav
    participant CL as Client

    Note over TH: Prefill all video frames + query

    TH->>TH: Generate token 1
    TH->>TK: Forward token 1 (async)
    TH->>TH: Generate token 2
    TK->>TK: Process token 1
    TH->>TK: Forward token 2
    TH->>TH: Generate token 3
    TK->>CW: First codec chunk (25 frames)
    TK->>TK: Process token 2
    CW->>CL: First audio chunk
    TH->>TK: Forward token 3

    Note over TH,CL: Talker starts before Thinker finishes!
Loading

The async chunk infrastructure already exists for audio-only Qwen3-Omni (see docs/design/feature/async_chunk_design.md). The key insight: it's model-input agnostic. The thinker2talker_async_chunk processor works identically whether the thinker input was text, audio, or video — it only cares about the thinker's output hidden states. Enabling async chunk for video requires only a new stage config with async_chunk: true.

Impact from existing benchmarks: ~92% reduction in time-to-first-audio (6.5s to 0.52s at concurrency=1), ~17% E2E latency improvement at concurrency=10. Source: docs/design/feature/async_chunk_design.md internal benchmarks.


Phase 5: KV Cache Eviction for Long Streams

flowchart LR
    subgraph "KV Cache — fixed memory budget"
        direction TB
        S["Sink Tokens<br>System prompt + first frames<br>~512 tokens — never evicted"]
        VW["Vision Window<br>Last N seconds of frames<br>Sliding — oldest evicted"]
        TW["Text Window<br>Recent responses<br>Sliding — oldest evicted"]
    end

    NEW[New Frame] -->|Append| VW
    VW -->|Overflow| EVICT[Evict oldest<br>vision block]

    style S fill:#4CAF50,color:white
    style VW fill:#2196F3,color:white
    style TW fill:#FF9800,color:white
Loading

For long streaming sessions, the KV cache grows unboundedly. Three eviction strategies, in order of complexity:

Strategy A — Fixed Sliding Window: Keep KV for last N seconds only. Simple, loses history beyond the window.

Strategy B — Attention Sink + Sliding Window: Based on StreamingVLM (MIT+NVIDIA). Keep "sink" tokens permanently plus a sliding window of recent context. Proven stable for 3+ hours on H100.

Strategy C — KV Compression: Merge old frame KVs into compressed representations. MemStream achieves 16x compression via adaptive key selection; LiveVLM merges per-frame KVs into single tuples, processing 44x more frames on the same hardware.

Recommended path: start with A, iterate to B.

KV cache lifetime estimates (Qwen3-Omni 30B-A3B, A100 80GB, ~60GB available after model weights, 256 tokens/frame):

xychart-beta
    title "Streaming Duration Before KV Cache Fills"
    x-axis ["1 FPS", "2 FPS", "4 FPS", "4 FPS + FP8 (H100)", "Any FPS + Eviction"]
    y-axis "Hours" 0 --> 10
    bar [4.5, 2.0, 1.0, 2.0, 10]
Loading

With eviction (any strategy), streaming duration becomes unlimited regardless of FPS. The "10" bar represents the axis cap — actual duration is unbounded.


Phase 6: Expand to Other Models

graph TD
    subgraph "Full Omni Pipeline — video+audio in/out"
        Q3O["Qwen3-Omni<br>(Phase 1)"]
        Q25O[Qwen2.5-Omni]
        MCPM[MiniCPM-o 2.6]
    end

    subgraph "Text-Only Output — video in, text out"
        Q3VL[Qwen3-VL]
        Q25VL[Qwen2.5-VL]
        IVL[InternVL 2.5/3]
    end

    Q3O -->|Same handler| Q25O
    Q3O -->|Adapter needed| MCPM
    Q3O -.->|Chat-only path| Q3VL
    Q3VL -.-> Q25VL
    Q3VL -.-> IVL

    style Q3O fill:#4CAF50,color:white
    style Q25O fill:#8BC34A,color:white
    style MCPM fill:#8BC34A,color:white
    style Q3VL fill:#90CAF9
    style Q25VL fill:#90CAF9
    style IVL fill:#90CAF9
Loading

The handler is model-agnostic — it builds standard chat completion requests. Each new model needs verification that image_url content blocks are processed correctly, plus a stage config for omni models with audio output.


Phase 7: Gradio Demo + Benchmarks

Gradio demo — extend the existing Qwen3-Omni demo with a "Live Camera" tab:

  • WebRTC or periodic snapshot capture from user's webcam
  • Real-time display of model responses (text + audio playback)
  • Controls: FPS slider, query input, modality toggles, EVS threshold

Benchmarks:

  • StreamingBench: 900 videos, 4,500 QA pairs, 18 tasks — establish baseline score
  • Latency: TTFT, TTFA, E2E at various FPS and concurrency levels
  • Memory: KV cache usage over time at 1/2/4 FPS with and without eviction
  • Throughput: max concurrent streaming sessions on A100/H100

Related Work

Paper / Project Key Contribution Reference
StreamingVLM 8 FPS on H100, 3+ hours, attention sink approach arXiv 2510.09608
LiveVLM Training-free KV compression, 44x more frames arXiv 2505.15269
MemStream 16x adaptive KV compression arXiv 2602.18434
StreamingBench Evaluation benchmark, 18 tasks, human baseline 91.66% streamingbench.github.io
NVIDIA Live VLM WebUI WebRTC webcam to VLM backend GitHub
Qwen3-Omni 507ms video+audio latency, SOTA 32/36 AV benchmarks GitHub
vLLM Streaming Input StreamingInput API, /v1/realtime WebSocket vllm-project/vllm#25066

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions