Motivation.
Cosmos3 covers multiple inference scenarios, including T2I, T2V, I2V, video generation with sound, and action-related generation. These scenarios have different latency, memory, and hardware requirements, especially between Cosmos3-Nano and Cosmos3-Super.
The goal of this analysis is to provide an evidence-based feature matrix for Cosmos3 across cache acceleration, CPU offload, parallelism, LoRA, quantization, device support, and hardware recipes. It also identifies which features are already supported, which are explicitly unsupported, and which still require validation.
This document follows the same direction as the continuous diffusion acceleration tracking RFC: maintain a Model x Feature support matrix and clarify Feature x Feature compatibility gaps before recommending production configurations.
Proposed Change.
Features Supported Table
The following table tracks Cosmos3 feature support across cache acceleration, offload, parallelism, quantization, LoRA, and execution features. It follows the same status convention as the continuous diffusion acceleration tracking RFC.
| Model |
TeaCache |
Cache-DiT |
UND K/V cache |
SP (Ulysses & Ring) |
CFG Parallel |
Tensor Parallel |
Pipeline Parallel |
HSDP |
CPU Offload (Layerwise) |
VAE Patch Parallel |
Quantization |
LoRA Inference |
Step Execution |
| Cosmos3-Nano |
❌ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
| Cosmos3-Super |
❌ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Quantization
Cosmos3 is marked as quantization-supported at the framework level, and its transformer passes the diffusion quantization_config into quantizable linear layers. The current Cosmos3 recipes, however, do not include method-specific validation or recommended quantized checkpoints. Therefore, this section tracks both the framework-level quantization support and the validation status of selected quantization methods.
| Model |
Quantization Framework |
FP8 |
NVFP4 |
| Cosmos3-Nano |
✅ |
❓ |
❓ |
| Cosmos3-Super |
✅ |
❓ |
❓ |
Beyond the selected FP8 and NVFP4 entries, we welcome contributions for additional GPU and non-GPU quantization methods. On GPUs, potential directions include INT8, ModelOpt-based pre-quantized checkpoints, GGUF-style quantized transformer weights, and other hardware-aware formats supported by the vLLM-Omni quantization framework. On Ascend NPU, the current codebase includes NPU-oriented paths such as INT8, MXFP8, MXFP4, and MXFP4-DualScale, but there is no Cosmos3-specific NPU quantization recipe or validation result yet. Any contribution should clearly document the target device, quantization method, Cosmos3 variant, memory usage, latency, and output quality comparison against the BF16 baseline.
Scenario Optimization
Cosmos3 supports multiple inference scenarios through the same pipeline, but each scenario has different latency and memory characteristics. This section tracks the baseline configuration and applicable optimization methods for each scenario.
| Scenario |
Cosmos3-Nano |
Cosmos3-Super |
Baseline |
Optimization |
Notes |
| Text-to-Image (T2I) |
✅ |
✅ |
|
|
/v1/images/generations. |
| Text-to-Video (T2V) |
✅ |
✅ |
|
|
/v1/videos/sync without reference image/video. |
| Image-to-Video (I2V) |
✅ |
✅ |
|
|
/v1/videos/sync with image input_reference or image_reference. |
| Video-to-Video (V2V) |
✅ |
✅ |
|
|
/v1/videos/sync with video input_reference or video_reference. Use condition_frame_indexes_vision and condition_video_keep. |
| Text-to-Video with Sound (T2VS) |
✅ |
✅ |
|
|
T2V plus generate_sound=true; output is muxed as MP4 with AAC audio. |
| Image-to-Video with Sound (I2VS) |
✅ |
✅ |
|
|
I2V plus generate_sound=true; output is muxed as MP4 with AAC audio. |
| Action Forward Dynamics |
✅ |
✅ |
|
|
Input: first frame or video + action trajectory. Output: rollout video. |
| Action Policy |
✅ |
✅ |
|
|
Input: first frame or video + language instruction. Output: predicted action trajectory + rollout video. |
| Inverse Dynamics |
✅ |
✅ |
|
|
Input: video. Output: recovered action trajectory. Use async /v1/videos to read the top-level action field. |
| Video-Conditioned Action Generation |
✅ |
✅ |
|
|
Input: video under action mode. Output depends on action_mode; not generic V2V. |
Recipes on varing devices
This section tracks documented and missing Cosmos3 deployment recipes across different hardware targets. A recipe should include the target device, model variant, baseline launch command, low-memory optimization flags, and validation results such as memory usage, latency, and output quality. Current documentation covers NVIDIA data-center GPUs, while low-VRAM consumer GPUs and NPU deployments still need validation.
| Device / Hardware |
Cosmos3-Nano |
Cosmos3-Super |
Baseline Recipe |
Low-VRAM / Optimization Recipe |
Notes |
| 1x NVIDIA H200 / B300 |
✅ |
🙋 |
Nano: vllm serve nvidia/Cosmos3-Nano --omni --host 0.0.0.0 --port 8000 --init-timeout 1800 |
Optional: --enable-layerwise-offload for smaller GPUs |
Nano has documented 1-GPU online serving recipe. Super requires larger multi-GPU recipe in current docs. |
| 2x NVIDIA H200 / B300 |
❓ |
✅ |
Super: vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --use-hsdp --hsdp-shard-size 2 --init-timeout 1800 |
Optional: --enable-layerwise-offload |
Documented minimum Super recipe. |
| 8x NVIDIA H200 / H100 / A100 |
❓ |
✅ |
Super: vllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --ulysses-degree 4 --use-hsdp --hsdp-shard-size 8 --init-timeout 1800 |
Uses CFG parallel, Ulysses sequence parallel, and HSDP |
Documented recommended Super recipe. |
| Low-VRAM NVIDIA GPU, e.g. RTX 5090-class |
🙋 |
🙋 |
|
Candidate: Nano + --enable-layerwise-offload; lower resolution, frame count, and steps; quantization after validation |
No documented Cosmos3 recipe or memory result found for this class of GPU. Needs validation. |
| Ascend NPU |
🙋 |
🙋 |
|
Candidate: NPU quantization paths such as INT8, MXFP8, MXFP4, or MXFP4-DualScale after Cosmos3 validation |
No documented Cosmos3 NPU recipe found. |
| AMD GPU / ROCm |
🙋 |
🙋 |
|
Candidate: ROCm backend validation and ROCm-compatible quantization / attention backend tuning after Cosmos3 validation |
No documented Cosmos3 AMD ROCm recipe found. |
| Intel XPU, e.g. Intel Arc B-Series |
🙋 |
🙋 |
|
Candidate: XPU backend validation, AutoRound / MXFP8 / MXFP4 quantization paths after Cosmos3 validation |
No documented Cosmos3 Intel XPU recipe found. |
Feedback Period.
No response
CC List.
@hsliuustc0106 @Gaohan123 @david6666666 @bjf-frz
Any Other Things.
No response
Before submitting a new issue...
Motivation.
Cosmos3 covers multiple inference scenarios, including T2I, T2V, I2V, video generation with sound, and action-related generation. These scenarios have different latency, memory, and hardware requirements, especially between Cosmos3-Nano and Cosmos3-Super.
The goal of this analysis is to provide an evidence-based feature matrix for Cosmos3 across cache acceleration, CPU offload, parallelism, LoRA, quantization, device support, and hardware recipes. It also identifies which features are already supported, which are explicitly unsupported, and which still require validation.
This document follows the same direction as the continuous diffusion acceleration tracking RFC: maintain a Model x Feature support matrix and clarify Feature x Feature compatibility gaps before recommending production configurations.
Proposed Change.
Features Supported Table
The following table tracks Cosmos3 feature support across cache acceleration, offload, parallelism, quantization, LoRA, and execution features. It follows the same status convention as the continuous diffusion acceleration tracking RFC.
Quantization
Cosmos3 is marked as quantization-supported at the framework level, and its transformer passes the diffusion
quantization_configinto quantizable linear layers. The current Cosmos3 recipes, however, do not include method-specific validation or recommended quantized checkpoints. Therefore, this section tracks both the framework-level quantization support and the validation status of selected quantization methods.Beyond the selected FP8 and NVFP4 entries, we welcome contributions for additional GPU and non-GPU quantization methods. On GPUs, potential directions include INT8, ModelOpt-based pre-quantized checkpoints, GGUF-style quantized transformer weights, and other hardware-aware formats supported by the vLLM-Omni quantization framework. On Ascend NPU, the current codebase includes NPU-oriented paths such as INT8, MXFP8, MXFP4, and MXFP4-DualScale, but there is no Cosmos3-specific NPU quantization recipe or validation result yet. Any contribution should clearly document the target device, quantization method, Cosmos3 variant, memory usage, latency, and output quality comparison against the BF16 baseline.
Scenario Optimization
Cosmos3 supports multiple inference scenarios through the same pipeline, but each scenario has different latency and memory characteristics. This section tracks the baseline configuration and applicable optimization methods for each scenario.
/v1/images/generations./v1/videos/syncwithout reference image/video./v1/videos/syncwith imageinput_referenceorimage_reference./v1/videos/syncwith videoinput_referenceorvideo_reference. Usecondition_frame_indexes_visionandcondition_video_keep.generate_sound=true; output is muxed as MP4 with AAC audio.generate_sound=true; output is muxed as MP4 with AAC audio./v1/videosto read the top-levelactionfield.action_mode; not generic V2V.Recipes on varing devices
This section tracks documented and missing Cosmos3 deployment recipes across different hardware targets. A recipe should include the target device, model variant, baseline launch command, low-memory optimization flags, and validation results such as memory usage, latency, and output quality. Current documentation covers NVIDIA data-center GPUs, while low-VRAM consumer GPUs and NPU deployments still need validation.
vllm serve nvidia/Cosmos3-Nano --omni --host 0.0.0.0 --port 8000 --init-timeout 1800--enable-layerwise-offloadfor smaller GPUsvllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --use-hsdp --hsdp-shard-size 2 --init-timeout 1800--enable-layerwise-offloadvllm serve nvidia/Cosmos3-Super --omni --host 0.0.0.0 --port 8000 --cfg-parallel-size 2 --ulysses-degree 4 --use-hsdp --hsdp-shard-size 8 --init-timeout 1800--enable-layerwise-offload; lower resolution, frame count, and steps; quantization after validationFeedback Period.
No response
CC List.
@hsliuustc0106 @Gaohan123 @david6666666 @bjf-frz
Any Other Things.
No response
Before submitting a new issue...