Skip to content

[RFC]: [Cosmos3 optimization] Feature Support, Quantilization, Hardware Recipes, and Scenario Optimization #4340

@bjf-frz

Description

@bjf-frz

Motivation.

Cosmos3 covers multiple inference scenarios, including T2I, T2V, I2V, video generation with sound, and action-related generation. These scenarios have different latency, memory, and hardware requirements, especially between Cosmos3-Nano and Cosmos3-Super.

The goal of this analysis is to provide an evidence-based feature matrix for Cosmos3 across cache acceleration, CPU offload, parallelism, LoRA, quantization, device support, and hardware recipes. It also identifies which features are already supported, which are explicitly unsupported, and which still require validation.

This document follows the same direction as the continuous diffusion acceleration tracking RFC: maintain a Model x Feature support matrix and clarify Feature x Feature compatibility gaps before recommending production configurations.

Proposed Change.

Features Supported Table

The following table tracks Cosmos3 feature support across cache acceleration, offload, parallelism, quantization, LoRA, and execution features. It follows the same status convention as the continuous diffusion acceleration tracking RFC.

Model TeaCache Cache-DiT UND K/V cache SP (Ulysses & Ring) CFG Parallel Tensor Parallel Pipeline Parallel HSDP CPU Offload (Layerwise) VAE Patch Parallel Quantization LoRA Inference Step Execution
Cosmos3-Nano
Cosmos3-Super

Quantization

Cosmos3 is marked as quantization-supported at the framework level, and its transformer passes the diffusion quantization_config into quantizable linear layers. The current Cosmos3 recipes, however, do not include method-specific validation or recommended quantized checkpoints. Therefore, this section tracks both the framework-level quantization support and the validation status of selected quantization methods.

Model Quantization Framework FP8 NVFP4
Cosmos3-Nano
Cosmos3-Super

Beyond the selected FP8 and NVFP4 entries, we welcome contributions for additional GPU and non-GPU quantization methods. On GPUs, potential directions include INT8, ModelOpt-based pre-quantized checkpoints, GGUF-style quantized transformer weights, and other hardware-aware formats supported by the vLLM-Omni quantization framework. On Ascend NPU, the current codebase includes NPU-oriented paths such as INT8, MXFP8, MXFP4, and MXFP4-DualScale, but there is no Cosmos3-specific NPU quantization recipe or validation result yet. Any contribution should clearly document the target device, quantization method, Cosmos3 variant, memory usage, latency, and output quality comparison against the BF16 baseline.

Scenario Optimization

Cosmos3 supports multiple inference scenarios through the same pipeline, but each scenario has different latency and memory characteristics. This section tracks the baseline configuration and applicable optimization methods for each scenario.

Scenario Cosmos3-Nano Cosmos3-Super Baseline Optimization Notes
Text-to-Image (T2I) /v1/images/generations.
Text-to-Video (T2V) /v1/videos/sync without reference image/video.
Image-to-Video (I2V) /v1/videos/sync with image input_reference or image_reference.
Video-to-Video (V2V) /v1/videos/sync with video input_reference or video_reference. Use condition_frame_indexes_vision and condition_video_keep.
Text-to-Video with Sound (T2VS) T2V plus generate_sound=true; output is muxed as MP4 with AAC audio.
Image-to-Video with Sound (I2VS) I2V plus generate_sound=true; output is muxed as MP4 with AAC audio.
Action Forward Dynamics Input: first frame or video + action trajectory. Output: rollout video.
Action Policy Input: first frame or video + language instruction. Output: predicted action trajectory + rollout video.
Inverse Dynamics Input: video. Output: recovered action trajectory. Use async /v1/videos to read the top-level action field.
Video-Conditioned Action Generation Input: video under action mode. Output depends on action_mode; not generic V2V.

Recipes on varing devices

This section tracks documented and missing Cosmos3 deployment recipes across different hardware targets. A recipe should include the target device, model variant, baseline launch command, low-memory optimization flags, and validation results such as memory usage, latency, and output quality. Current documentation covers NVIDIA data-center GPUs, while low-VRAM consumer GPUs and NPU deployments still need validation.

Device / Hardware Cosmos3-Nano Cosmos3-Super Baseline Recipe Low-VRAM / Optimization Recipe Notes
1x NVIDIA H200 / B300 🙋 Nano: vllm serve nvidia/Cosmos3-Nano --omni --host 0.0.0.0 --port 8000 --init-timeout 1800 Optional: --enable-layerwise-offload for smaller GPUs Nano has documented 1-GPU online serving recipe. Super requires larger multi-GPU recipe in current docs.
2x NVIDIA H200 / B300 Super: vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --use-hsdp --hsdp-shard-size 2 --init-timeout 1800 Optional: --enable-layerwise-offload Documented minimum Super recipe.
8x NVIDIA H200 / H100 / A100 Super: vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --ulysses-degree 4 --use-hsdp --hsdp-shard-size 8 --init-timeout 1800 Uses CFG parallel, Ulysses sequence parallel, and HSDP Documented recommended Super recipe.
Low-VRAM NVIDIA GPU, e.g. RTX 5090-class 🙋 🙋 Candidate: Nano + --enable-layerwise-offload; lower resolution, frame count, and steps; quantization after validation No documented Cosmos3 recipe or memory result found for this class of GPU. Needs validation.
Ascend NPU 🙋 🙋 Candidate: NPU quantization paths such as INT8, MXFP8, MXFP4, or MXFP4-DualScale after Cosmos3 validation No documented Cosmos3 NPU recipe found.
AMD GPU / ROCm 🙋 🙋 Candidate: ROCm backend validation and ROCm-compatible quantization / attention backend tuning after Cosmos3 validation No documented Cosmos3 AMD ROCm recipe found.
Intel XPU, e.g. Intel Arc B-Series 🙋 🙋 Candidate: XPU backend validation, AutoRound / MXFP8 / MXFP4 quantization paths after Cosmos3 validation No documented Cosmos3 Intel XPU recipe found.

Feedback Period.

No response

CC List.

@hsliuustc0106 @Gaohan123 @david6666666 @bjf-frz

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Labels

Hardware Pluginsupport different hardware beyond cudacriticalcritical issuediffusioncodes related to diffusion modelsgood first issueGood for newcomershelp wantedExtra attention is neededhigh priorityhigh priority issue, needs to be done asap

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions