[RFC]: [Cosmos3 optimization] Feature Support, Quantilization, Hardware Recipes, and Scenario Optimization

### Motivation.

  Cosmos3 covers multiple inference scenarios, including T2I, T2V, I2V, video generation with sound, and action-related generation. These scenarios have different latency, memory, and hardware requirements, especially between Cosmos3-Nano and Cosmos3-Super.

  The goal of this analysis is to provide an evidence-based feature matrix for Cosmos3 across cache acceleration, CPU offload, parallelism, LoRA, quantization, device support, and hardware recipes. It also identifies which features are already supported, which are explicitly unsupported, and which still require validation.

  This document follows the same direction as the continuous diffusion acceleration tracking RFC: maintain a Model x Feature support matrix and clarify Feature x Feature compatibility gaps before recommending production configurations.

### Proposed Change.

#### Features Supported Table
The following table tracks Cosmos3 feature support across cache acceleration, offload, parallelism, quantization, LoRA, and execution features. It follows the same status convention as the continuous diffusion acceleration tracking RFC.

 | Model | TeaCache | Cache-DiT | UND K/V cache | SP (Ulysses & Ring) | CFG Parallel | Tensor Parallel | Pipeline Parallel | HSDP | CPU Offload (Layerwise) | VAE Patch Parallel | Quantization | LoRA Inference | Step Execution |
  |---|---|---|---|---|---|---|---|---|---|---|---|---|---|
  | Cosmos3-Nano | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
  | Cosmos3-Super | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |

#### Quantization
Cosmos3 is marked as quantization-supported at the framework level, and its transformer passes the diffusion `quantization_config` into quantizable linear layers. The current Cosmos3 recipes, however, do not include method-specific validation or recommended quantized checkpoints. Therefore, this section tracks both the framework-level quantization support and the validation status of selected quantization methods.
  | Model | Quantization Framework | FP8 | NVFP4 |
  |---|---|---|---|
  | Cosmos3-Nano | ✅ | ❓ | ❓ |
  | Cosmos3-Super | ✅ | ❓ | ❓ |

Beyond the selected FP8 and NVFP4 entries, we welcome contributions for additional GPU and non-GPU quantization methods. On GPUs, potential directions include INT8, ModelOpt-based pre-quantized checkpoints, GGUF-style quantized transformer weights, and other hardware-aware formats supported by the vLLM-Omni quantization framework. On Ascend NPU, the current codebase includes NPU-oriented paths such as INT8, MXFP8, MXFP4, and MXFP4-DualScale, but there is no Cosmos3-specific NPU quantization recipe or validation result yet. Any contribution should clearly document the target device, quantization method, Cosmos3 variant, memory usage, latency, and output quality comparison against the BF16 baseline.

  #### Scenario Optimization

  Cosmos3 supports multiple inference scenarios through the same pipeline, but each scenario has different latency and memory characteristics. This section tracks the baseline configuration and applicable optimization methods for each scenario.
  | Scenario | Cosmos3-Nano | Cosmos3-Super | Baseline | Optimization | Notes |
  |---|---|---|---|---|---|
  | Text-to-Image (T2I) | ✅ | ✅ |  |  | `/v1/images/generations`. |
  | Text-to-Video (T2V) | ✅ | ✅ |  |  | `/v1/videos/sync` without reference image/video. |
  | Image-to-Video (I2V) | ✅ | ✅ |  |  | `/v1/videos/sync` with image `input_reference` or `image_reference`. |
  | Video-to-Video (V2V) | ✅ | ✅ |  |  | `/v1/videos/sync` with video `input_reference` or `video_reference`. Use `condition_frame_indexes_vision` and `condition_video_keep`. |
  | Text-to-Video with Sound (T2VS) | ✅ | ✅ |  |  | T2V plus `generate_sound=true`; output is muxed as MP4 with AAC audio. |
  | Image-to-Video with Sound (I2VS) | ✅ | ✅ |  |  | I2V plus `generate_sound=true`; output is muxed as MP4 with AAC audio. |
  | Action Forward Dynamics | ✅ | ✅ |  |  | Input: first frame or video + action trajectory. Output: rollout video. |
  | Action Policy | ✅ | ✅ |  |  | Input: first frame or video + language instruction. Output: predicted action trajectory + rollout video. |
  | Inverse Dynamics | ✅ | ✅ |  |  | Input: video. Output: recovered action trajectory. Use async `/v1/videos` to read the top-level `action` field. |
  | Video-Conditioned Action Generation | ✅ | ✅ |  |  | Input: video under action mode. Output depends on `action_mode`; not generic V2V. |

#### Recipes on varing devices
This section tracks documented and missing Cosmos3 deployment recipes across different hardware targets. A recipe should include the target device, model variant, baseline launch command, low-memory optimization flags, and validation results such as memory usage, latency, and output quality. Current documentation covers NVIDIA data-center GPUs, while low-VRAM consumer GPUs and NPU deployments still need validation.

| Device / Hardware | Cosmos3-Nano | Cosmos3-Super | Baseline Recipe | Low-VRAM / Optimization Recipe | Notes |
  |---|---|---|---|---|---|
  | 1x NVIDIA H200 / B300 | ✅ | 🙋 | Nano: `vllm serve nvidia/Cosmos3-Nano --omni --host 0.0.0.0 --port 8000 --init-timeout 1800` | Optional: `--enable-layerwise-offload` for smaller GPUs | Nano has documented 1-GPU online serving recipe. Super requires larger multi-GPU recipe in current docs. |
  | 2x NVIDIA H200 / B300 | ❓ | ✅ | Super: `vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --use-hsdp --hsdp-shard-size 2 --init-timeout 1800` | Optional: `--enable-layerwise-offload` | Documented minimum Super recipe. |
  | 8x NVIDIA H200 / H100 / A100 | ❓ | ✅ | Super: `vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --ulysses-degree 4 --use-hsdp --hsdp-shard-size 8 --init-timeout 1800` | Uses CFG parallel, Ulysses sequence parallel, and HSDP | Documented recommended Super recipe. |
  | Low-VRAM NVIDIA GPU, e.g. RTX 5090-class | 🙋 | 🙋 |  | Candidate: Nano + `--enable-layerwise-offload`; lower resolution, frame count, and steps; quantization after validation | No documented Cosmos3 recipe or memory result found for this class of GPU. Needs validation. |
  | Ascend NPU | 🙋 | 🙋 |  | Candidate: NPU quantization paths such as INT8, MXFP8, MXFP4, or MXFP4-DualScale after Cosmos3 validation | No documented Cosmos3 NPU recipe found. |
  | AMD GPU / ROCm | 🙋 | 🙋 |  | Candidate: ROCm backend validation and ROCm-compatible quantization / attention backend tuning after Cosmos3 validation | No documented Cosmos3 AMD ROCm recipe found. |
  | Intel XPU, e.g. Intel Arc B-Series | 🙋 | 🙋 |  | Candidate: XPU backend validation, AutoRound / MXFP8 / MXFP4 quantization paths after Cosmos3 validation | No documented Cosmos3 Intel XPU recipe found. |

### Feedback Period.

_No response_

### CC List.

@hsliuustc0106 @Gaohan123 @david6666666 @bjf-frz 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: [Cosmos3 optimization] Feature Support, Quantilization, Hardware Recipes, and Scenario Optimization #4340

Motivation.

Proposed Change.

Features Supported Table

Quantization

Scenario Optimization

Recipes on varing devices

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	TeaCache	Cache-DiT	UND K/V cache	SP (Ulysses & Ring)	CFG Parallel	Tensor Parallel	Pipeline Parallel	HSDP	CPU Offload (Layerwise)	VAE Patch Parallel	Quantization	LoRA Inference	Step Execution
Cosmos3-Nano	❌	✅	✅	✅	✅	✅	❌	✅	✅	✅	✅	❌	❌
Cosmos3-Super	❌	✅	✅	✅	✅	✅	❌	✅	✅	✅	✅	❌	❌

Scenario	Cosmos3-Nano	Cosmos3-Super	Notes
Text-to-Image (T2I)	✅	✅	`/v1/images/generations`.
Text-to-Video (T2V)	✅	✅	`/v1/videos/sync` without reference image/video.
Image-to-Video (I2V)	✅	✅	`/v1/videos/sync` with image `input_reference` or `image_reference`.
Video-to-Video (V2V)	✅	✅	`/v1/videos/sync` with video `input_reference` or `video_reference`. Use `condition_frame_indexes_vision` and `condition_video_keep`.
Text-to-Video with Sound (T2VS)	✅	✅	T2V plus `generate_sound=true`; output is muxed as MP4 with AAC audio.
Image-to-Video with Sound (I2VS)	✅	✅	I2V plus `generate_sound=true`; output is muxed as MP4 with AAC audio.
Action Forward Dynamics	✅	✅	Input: first frame or video + action trajectory. Output: rollout video.
Action Policy	✅	✅	Input: first frame or video + language instruction. Output: predicted action trajectory + rollout video.
Inverse Dynamics	✅	✅	Input: video. Output: recovered action trajectory. Use async `/v1/videos` to read the top-level `action` field.
Video-Conditioned Action Generation	✅	✅	Input: video under action mode. Output depends on `action_mode`; not generic V2V.

Device / Hardware	Cosmos3-Nano	Cosmos3-Super	Baseline Recipe	Low-VRAM / Optimization Recipe	Notes
1x NVIDIA H200 / B300	✅	🙋	Nano: `vllm serve nvidia/Cosmos3-Nano --omni --host 0.0.0.0 --port 8000 --init-timeout 1800`	Optional: `--enable-layerwise-offload` for smaller GPUs	Nano has documented 1-GPU online serving recipe. Super requires larger multi-GPU recipe in current docs.
2x NVIDIA H200 / B300	❓	✅	Super: `vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --use-hsdp --hsdp-shard-size 2 --init-timeout 1800`	Optional: `--enable-layerwise-offload`	Documented minimum Super recipe.
8x NVIDIA H200 / H100 / A100	❓	✅	Super: `vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --ulysses-degree 4 --use-hsdp --hsdp-shard-size 8 --init-timeout 1800`	Uses CFG parallel, Ulysses sequence parallel, and HSDP	Documented recommended Super recipe.
Low-VRAM NVIDIA GPU, e.g. RTX 5090-class	🙋	🙋		Candidate: Nano + `--enable-layerwise-offload`; lower resolution, frame count, and steps; quantization after validation	No documented Cosmos3 recipe or memory result found for this class of GPU. Needs validation.
Ascend NPU	🙋	🙋		Candidate: NPU quantization paths such as INT8, MXFP8, MXFP4, or MXFP4-DualScale after Cosmos3 validation	No documented Cosmos3 NPU recipe found.
AMD GPU / ROCm	🙋	🙋		Candidate: ROCm backend validation and ROCm-compatible quantization / attention backend tuning after Cosmos3 validation	No documented Cosmos3 AMD ROCm recipe found.
Intel XPU, e.g. Intel Arc B-Series	🙋	🙋		Candidate: XPU backend validation, AutoRound / MXFP8 / MXFP4 quantization paths after Cosmos3 validation	No documented Cosmos3 Intel XPU recipe found.

[RFC]: [Cosmos3 optimization] Feature Support, Quantilization, Hardware Recipes, and Scenario Optimization #4340

Description

Motivation.

Proposed Change.

Features Supported Table

Quantization

Scenario Optimization

Recipes on varing devices

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions