三段式分离部署(Encoder + Transformer + Decoder)#947
三段式分离部署(Encoder + Transformer + Decoder)#947fuheaven wants to merge 451 commits intoModelTC:mainfrom
Conversation
Co-authored-by: Yang Yong (雍洋) <yongyang1030@163.com>
deploy 部署相关环境更新
1. 增加mlu dockerfile 2. 增加dockerfile保存目录
Tidy VAReader & OmniVAReader Tidy VARecorder & X264VARecorder VARecorder with stream, use buffer stream Tidy env WORKER_RANK, READER_RANK, RECORDER_RANK Support voice type choose
Co-authored-by: root <root@pt-de4c35727a1b4d1b9f27f422f06026ec-worker-0.pt-de4c35727a1b4d1b9f27f422f06026ec.ns-devsft-3460edd0.svc.cluster.local> Co-authored-by: root <root@pt-9b2035a55fe647eeb007584b238e5077-worker-0.pt-9b2035a55fe647eeb007584b238e5077.ns-devsft-3460edd0.svc.cluster.local>
Add option: ulysses qkv_fusion --------- Co-authored-by: root <root@pt-72be2ccd01a14fa18a4b18c6c347f823-worker-0.pt-72be2ccd01a14fa18a4b18c6c347f823.ns-devsft-3460edd0.svc.cluster.local> Co-authored-by: root <root@pt-0699d18802514bc1b116c156f9ce2bc1-worker-0.pt-0699d18802514bc1b116c156f9ce2bc1.ns-devsft-3460edd0.svc.cluster.local>
--------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wangshankun <wangshankun2011@hotmail.com>
# Intel Support for LightX2V
## Summary
This PR adds Intel support for LightX2V, enabling video generation and
image generation on Intel GPUs.
## End-to-End Performance
On PTL integrated GPUs (iGPUs), we have achieved native-level
performance leveraging the torch_sdpa kernel.
| Models | Configuration | Time |
|-------------------|--------------------------------|---------|
| Wan2.1-T2V-1.3B | 33 frames, 480×848, 20 steps | 197.80s |
| Z-image-turbo | 16:9 ratio, 9steps | 57s |
## Usage
### Environment Setup
Set the platform environment variable for Intel iGPUs (Windows):
```bash
set PLATFORM=intel_xpu
```
### Wan Models (Text-to-Video)
```python
"""
Wan2.1 text-to-video generation example.
This example demonstrates how to use LightX2V with Wan2.1 model for T2V generation.
"""
from lightx2v import LightX2VPipeline
# Initialize pipeline for Wan2.1 T2V task
pipe = LightX2VPipeline(
model_path=r"xxx\models\Wan2.1-T2V-1.3B",
model_cls="wan2.1",
task="t2v",
)
pipe.create_generator(
config_json="../../configs/platforms/intel_xpu/wan_t2v_1_3.json"
)
# Create generator with specified parameters
pipe.create_generator(
attn_mode="torch_sdpa",
infer_steps=50,
height=480, # Can be set to 720 for higher resolution
width=832, # Can be set to 1280 for higher resolution
num_frames=33,
guidance_scale=5.0,
sample_shift=5.0,
)
seed = 42
prompt = "a cat"
negative_prompt = "镜头晃动,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
save_result_path = "./output.mp4"
pipe.generate(
seed=seed,
prompt=prompt,
negative_prompt=negative_prompt,
save_result_path=save_result_path,
)
```
### Z-image-turbo Models (Text-to-Image)
```python
"""
Z-Image image-to-image generation example.
This example demonstrates how to use LightX2V with Z-Image-Turbo model
for T2I generation.
"""
from lightx2v import LightX2VPipeline
# Initialize pipeline for Z-Image-edit T2I task
pipe = LightX2VPipeline(
model_path=r"xxxx\models\Z-Image-Turbo",
model_cls="z_image",
task="t2i",
)
# Alternative: create generator from config JSON file
pipe.create_generator(
config_json="../../configs/platforms/intel_xpu/z_image_turbo_t2i.json"
)
# Create generator manually with specified parameters
pipe.create_generator(
attn_mode="torch_sdpa",
aspect_ratio="16:9",
infer_steps=9,
guidance_scale=1,
)
# Generation parameters
seed = 42
prompt = 'A coffee shop entrance features a chalkboard sign reading
"Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying
"通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and
beneath the poster is written
"π≈3.1415926-53589793-23846264-33832795-02384197". Ultra HD, 4K,
cinematic composition, Ultra HD, 4K, cinematic composition.'
negative_prompt = ""
save_result_path = "./output.png"
# Generate video
pipe.generate(
seed=seed,
prompt=prompt,
negative_prompt=negative_prompt,
save_result_path=save_result_path,
)
```
## Installation
Prerequisites
For intel platform, install dependencies with the following commands:
```bash
pip install --no-cache-dir -r requirements_win.txt
pip install --no-cache-dir torch==2.9.1+xpu torchvision torchaudio
--index-url https://download.pytorch.org/whl/xpu
pip install --no-cache-dir -e .
```
## Platform Detection
Verify Intel XPU availability with the following code:
```python
import torch
torch.xpu.is_available()
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gushiqiao <975033167>
Co-authored-by: gushiqiao <975033167>
服务化部署增加画布操作模式 --------- Co-authored-by: qinxinyi <qxy118045534@163.com>
# Enable Disaggregation Feature ## Summary This PR introduces a **disaggregation architecture** to LightX2V, enabling distributed deployment of the video generation pipeline across multiple devices or machines. ## What's New ### Core Functionality - **Service Decoupling**: Separate encoder and transformer services that can run independently - **High-Performance Communication**: ZeroMQ and RDMA-based messaging with Mooncake transfer engine - **Flexible Deployment**: Support for single-machine multi-GPU and cross-machine distributed setups ### New Components - `lightx2v/disagg/`: Complete disaggregation package - `conn.py`: Data connection and management - `services/encoder.py`: Encoder service implementation - `services/transformer.py`: Transformer service implementation - `examples/`: Usage examples for WAN I2V and T2V models ## Key Benefits 1. **Resource Flexibility**: Distribute compute-intensive tasks across multiple devices 2. **Scalability**: Easy horizontal scaling for production deployments 3. **Memory Efficiency**: Run large models on hardware-constrained environments 4. **Service-Oriented**: Build microservice-based video generation systems ## Usage Example ```shell python3 lightx2v/disagg/examples/wan_t2v_service.py ``` See `lightx2v/disagg/examples/` for complete working examples. ## Backward Compatibility ✅ This is an **optional feature** that doesn't affect existing functionality: - Default mode preserves current behavior - All existing APIs remain unchanged - Users can opt-in to use disaggregation when needed ## Testing - ✅ Tested with WAN I2V and T2V models - ✅ Verified cross-device communication stability - ✅ Validated accuracy matches single-machine mode ## Files Changed - Added: `lightx2v/disagg/` package with all disaggregation modules - Modified: None (purely additive) ## Future Enhancements - Automatic service discovery - Load balancing across multiple workers - Enhanced monitoring and health checks --- **Type**: Feature **Breaking Changes**: None **Documentation**: Included in `lightx2v/disagg/examples/` --------- Co-authored-by: jasonzhang517 <yzhang298@e.ntu.edu.sg> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: helloyongyang <yongyang1030@163.com>
…l) support (ModelTC#918) # Summary Add FP8 and Flash Attention optimizations for lightx2v_intel_xpu, enabling future expansion of Intel-optimized kernels. # E2E Perf ## Wan2.1-T2V-1.3B (33 frames, 480×848, 20 steps) Configuration | Time | Speedup -- | -- | -- Before PR(torch_sdpa) | 197s | 1.00x After PR (sycl_kernels) | 170.55s | 1.13x # Usage Example ```python import time from lightx2v import LightX2VPipeline # Initialize pipeline for Wan2.1 T2V task pipe = LightX2VPipeline( model_path=r"xxxx\Wan2.1-T2V-1.3B", model_cls="wan2.1", task="t2v", ) pipe.create_generator( config_json=r"xxx\LightX2V\configs\platforms\intel_xpu\wan_t2v_1_3_xpu_flash_attn.json" ) seed = 42 prompt = "a bird" negative_prompt = "镜头晃动,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" save_result_path = "./output.mp4" s=time.time() pipe.generate( seed=seed, prompt=prompt, negative_prompt=negative_prompt, save_result_path=save_result_path, ) e=time.time() print("generate time",e-s) ``` --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: helloyongyang <yongyang1030@163.com>
…ofile) (ModelTC#909) ## Description 通过引入 TensorRT 引擎来替换 PyTorch 原生算子,为 Qwen Image 模型的 VAE (Encoder / Decoder) 带来了显著的性能提升。针对 T2I(定尺寸)和 I2I(变尺寸)两种截然不同的任务类型,设计了双维度的 TRT 加速方案,并重构了底层的加载组件。 ### Key Features 1. **统一底层的 TensorRT VAE 加载器 (`vae_trt.py`)**: - 使用统一的 `trt_engine_path` 入参与 `vae_type: "tensorrt"` 配置开关。 - 支持完善的 **PyTorch Fallback 机制**:一旦环境探测失败、引擎文件缺失或预分配显存 OOM,会自动回退使用 PyTorch 的原生 VAE 算子执行,保证推理链路的业务安全与健壮性。 2. **T2I 场景:Static Shape 引擎 + 按需加载 (Lazy Load)** - 因为 T2I 生成图像具有有限的固定比例,为每个分辨率预构建独立的静态引擎,完全消除动态执行开销。 - 采用 **按需加载 (Lazy Load)** 策略:仅在当前分辨率首次请求时加载对应引擎对(~5GB 显存 / 对),切换分辨率时自动释放旧引擎、加载新引擎。相比全量加载(~25GB)大幅降低显存占用,兼容端到端推理场景。 3. **I2I 场景:Multi-Profile 动态引擎集成** - 针对非受控的任意宽高输入,支持在一份引擎中包含 9 组经典的 Opt Shapes(包括 512x512, 1024x1024, 720p, 1080p 等)。 - 推理时动态匹配最接近的 Profile 档位,确保 TensorRT 分配出最佳的内存布局与 Kernel 计算路径。 - 引擎常驻显存,Encoder + Decoder 合计约 ~1.0-1.2 GB。 4. **配套文档 (`QwenImageVAETensorRT.md`)** - 新增 VAE TRT 优化的配置与最佳实践指南。 - 含独立测试与端到端服务模式两组 benchmark 数据,以及性能差异的根因分析。 --- ## Performance Benchmark 实测数据来自 NVIDIA H100 (80GB) 单卡环境。 ### 1. T2I Static Shape — 独立 VAE 测试 | 比例 | PT Enc (ms) | TRT Enc (ms) | Enc 加速 | PT Dec (ms) | TRT Dec (ms) | Dec 加速 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | 16:9 | 66.53 | **32.70** | **2.03x** | 103.65 | **49.66** | **2.09x** | | 9:16 | 65.72 | **32.22** | **2.04x** | 103.02 | **50.71** | **2.03x** | | 1:1 | 78.16 | **41.95** | **1.86x** | 121.91 | **61.52** | **1.98x** | | 4:3 | 73.99 | **37.23** | **1.99x** | 114.45 | **54.75** | **2.09x** | | 3:4 | 31.74 | **17.33** | **1.83x** | 50.77 | **26.86** | **1.89x** | > **Encoder ~1.95x, Decoder ~2.02x** ### 2. T2I Static Shape — 端到端服务模式 (Qwen-Image-2512, 5 step, VAE Decoder) > T2I 无 VAE Encoder,仅统计 Decoder。 | 比例 | PT Dec (ms) | TRT Dec (ms) | Dec 加速 | 首次加载 (ms) | | :---: | :---: | :---: | :---: | :---: | | 16:9 | 189.3 | **88.4** | **2.14x** | 343.9 | | 9:16 | 179.6 | **85.6** | **2.10x** | 226.4 | | 1:1 | 157.6 | **106.2** | **1.48x** | 304.1 | | 4:3 | 148.7 | **94.7** | **1.57x** | 238.0 | | 3:4 | 70.4 | **46.1** | **1.53x** | 178.2 | > **Decoder 平均 ~1.8x**。「首次加载」为 Lazy Load 切换分辨率时的一次性开销,后续同分辨率请求不再产生。 ### 3. I2I Multi-Profile — 独立 VAE 测试 (10 轮平均) **Encoder**: | 分辨率 | PT Enc (ms) | TRT Enc (ms) | 加速 | | :---: | :---: | :---: | :---: | | 512x512 | 11.00 | **8.53** | **1.29x** | | 1024x1024 | 42.85 | **27.56** | **1.55x** | | 480p 16:9 | 17.25 | **12.00** | **1.44x** | | 720p 16:9 | 38.00 | **25.35** | **1.50x** | | 768p 4:3 | 31.98 | **21.76** | **1.47x** | > **Encoder 平均 ~1.45x** **Decoder**: | 分辨率 | PT Dec (ms) | TRT Dec (ms) | 加速 | | :---: | :---: | :---: | :---: | | 512x512 | 17.60 | **12.78** | **1.38x** | | 1024x1024 | 68.16 | **44.93** | **1.52x** | | 480p 16:9 | 27.67 | **18.85** | **1.47x** | | 720p 16:9 | 60.24 | **40.80** | **1.48x** | | 768p 4:3 | 51.14 | **34.92** | **1.46x** | > **Decoder 平均 ~1.46x。综合 ~1.45x** ### 4. I2I Multi-Profile — 端到端服务模式 (qwen-image-edit-251130, 4 step) | 分辨率 | PT Enc → TRT Enc | Enc 加速 | PT Dec → TRT Dec | Dec 加速 | | :---: | :---: | :---: | :---: | :---: | | 512x512 | 48.5 → **28.8** | **1.68x** | 138.4 → **134.0** | **1.03x** | | 1024x1024 | 48.2 → **28.4** | **1.70x** | 152.7 → **133.3** | **1.15x** | | 480p 16:9 | 48.7 → **29.6** | **1.64x** | 140.4 → **134.4** | **1.04x** | | 720p 16:9 | 48.6 → **30.1** | **1.62x** | 139.0 → **134.2** | **1.04x** | | 768p 4:3 | 49.2 → **29.8** | **1.65x** | 152.8 → **134.8** | **1.13x** | > **Encoder ~1.66x, Decoder ~1.08x** > > Decoder 加速比低于独立测试是因为 `postprocess(output_type="pil")` 附加了 ~80-90ms 恒定 CPU 开销(tensor → PIL 转换),TRT 无法加速,数学上稀释了比值。TRT 引擎内核本身的加速效果应参考独立测试数据。 --- ## Changes Made - Refactored `lightx2v/models/video_encoders/trt/qwen_image/vae_trt.py` - Unified Static / Multi-Profile loading logic - Implemented Lazy Load for T2I static engines (auto load/release per resolution) - PyTorch fallback mechanism - Added T2I TRT config: `configs/qwen_image/qwen_image_t2i_2512_trt.json` - Added I2I TRT config: `configs/qwen_image/qwen_image_i2i_2511_trt.json` - Added shell scripts: `scripts/qwen_image/qwen_image_t2i_2512_trt.sh`, `scripts/qwen_image/qwen_image_i2i_2511_trt.sh` - Added Documentation: `examples/BeginnerGuide/ZH_CN/QwenImageVAETensorRT.md`, `examples/BeginnerGuide/EN/QwenImageVAETensorRT.md`
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求引入了LightX2V框架中大型生成模型(如Wan和Qwen Image)的三段式分离部署模式。这一改进旨在通过将推理流水线拆分为独立的Encoder、Transformer和Decoder服务,显著优化显存使用、提高系统吞吐量,并支持跨设备或跨机器的灵活部署。通过集成Mooncake传输引擎和LightLLM优化,确保了数据传输的高效性和编码阶段的性能提升,从而为高分辨率、长时生成场景提供了更稳定和可扩展的解决方案。 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
本次 PR 引入了非常重要的三段式分离部署功能(Encoder + Transformer + Decoder),这是一个很棒的工程实现,可以有效优化大规模生成模型在分布式环境下的显存占用和推理吞吐。
代码实现非常全面,涵盖了从底层通信(基于 Mooncake)、核心逻辑(DisaggMixin)、与现有 Runner 的集成,到上层的配置、文档和测试脚本。整体设计考虑周全,例如:
- 使用
DisaggMixin来复用分离部署逻辑,代码结构清晰。 - 针对不同角色(encoder, transformer, decode)按需加载模型,有效降低显存。
- 包含了数据传输的哈希校验,保证了数据一致性。
- 提供了详尽的中文文档和开箱即用的启动、测试脚本,极大地降低了用户的使用门槛。
我发现了一些文档和脚本注释中的小问题,并已在具体的 review comments 中提出建议,希望能让这个功能更加完善。总体来说,这是一次高质量的提交。
| ```bash | ||
| python -m lightx2v.server \ | ||
| --model_cls wan2.1 \ | ||
| --task t2v \ | ||
| --model_path $model_path \ | ||
| --config_json ${lightx2v_path}/configs/wan/wan_t2v_disagg_decode.json \ | ||
| --host 0.0.0.0 \ | ||
| --port 8004 | ||
| ``` |
There was a problem hiding this comment.
你好,这篇文档写得非常详细,对用户理解和使用分离部署功能非常有帮助。
在 3.1 节手动启动服务的示例代码中,使用了 $model_path 和 ${lightx2v_path} 这两个环境变量。对于直接阅读这部分内容的用户来说,可能不清楚如何设置这两个变量。
建议在这里增加一个简短的说明,提醒用户需要先设置这两个环境变量,并可以参考脚本 scripts/server/disagg/wan/start_wan_t2v_disagg.sh 中的定义方式。例如:
> **注意**:以下命令中的 `$model_path` 和 `${lightx2v_path}` 变量需要提前设置。`$lightx2v_path` 应指向项目根目录,`$model_path` 应指向模型文件所在的目录。这样可以提升文档的易用性。
| # GPU_T : Transformer (port 8005) | ||
| # | ||
| # Override GPUs via environment variables: | ||
| # GPU_ENCODER=4 GPU_TRANSFORMER=5 GPU_DECODER=6 ./start_wan_i2v_disagg_all.sh |
| # GPU_T : Transformer (port 8003) | ||
| # | ||
| # Override GPUs via environment variables: | ||
| # GPU_ENCODER=4 GPU_TRANSFORMER=5 GPU_DECODER=6 ./start_wan_t2v_disagg_all.sh |
There was a problem hiding this comment.
Summary
在runner中集成了mooncake提供的分离部署模式,为 LightX2V 提供完整 三段式 分离部署能力:推理流水线可拆分为 Encoder、Transformer、Decoder 三个节点。VAE Decoder 独立部署在 Decoder 节点。支持了Wan和Qwen系列模型。
feature
分离架构解析
通过配置参数 disagg_mode,推理 Pipeline 被物理拆分为 三段式 独立服务,数据流经 Phase1(Encoder → Transformer) 与 Phase2(Transformer → Decoder) 两次 Mooncake 传输:
仅加载 Text Encoder、Image Encoder(I2V / I2I 时)以及 VAE Encoder,跳过 DiT 与 VAE Decoder。
执行特征提取,将 context、clip_encoder_out、vae_encoder_out、latent_shape 等通过 Mooncake Phase1 投递给 Transformer 节点。
仅加载 DiT 模型,跳过 Encoder 与 VAE Decoder(三段式下由 Decoder 节点承担解码)。
启动后等待 Phase1 数据,收到后执行哈希校验、拼装输入并完成去噪;若配置了 decoder_engine_rank,将去噪后的潜空间通过 Mooncake Phase2 发送给 Decoder 节点,不本地做 VAE 解码。
仅加载 VAE Decoder,跳过 Text/Image Encoder 与 DiT。
启动后进入 Phase2 接收等待状态,收到 Transformer 发来的潜空间后执行 VAE 解码并保存输出视频/图像,任务完成状态与结果文件均落在 Decoder 节点。