[RFC]: HunyuanImage Model deployment optimization

> Last updated: 2025-05-10

## Overview

`HunyuanImage-3.0-Instruct` uses a two-stage AR (Auto-Regressive) + DIT (Diffusion Transformer) architecture: AR handles token generation, DIT handles image denoising. This document serves as the unified tracker for feature implementation, optimization, and maintenance of this model in vllm-omni.

---

## 1. AR Module

### 1.1 Functional Support

| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| HunyuanImage-3.0 AR | [#759](https://github.com/vllm-project/vllm-omni/pull/759) | @usberkeley   | - | ✅ |
| HunyuanImage-3.0-Instruct AR | [#2713](https://github.com/vllm-project/vllm-omni/pull/2713) | @TaffyOfficial   | - | ✅ |
| AR accuracy bench | [#3332](https://github.com/vllm-project/vllm-omni/pull/3332) | @TaffyOfficial   | P0 | ✅ |
| Multi-image input support | [#3444](https://github.com/vllm-project/vllm-omni/pull/3444) | @TaffyOfficial   | P0 | ✅ |
| Config refactor | [#3172](https://github.com/vllm-project/vllm-omni/pull/3172) | @Fishermanykx  | P0 | ✅ |

### 1.2 Performance Features
| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| performance analyze | - | @TaffyOfficial   | P1 | 2026/5/19 |

---

## 2. DIT Module

### 2.1 Functional Support

| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| Hunyuanimage3.0 DIT | [#1085](https://github.com/vllm-project/vllm-omni/pull/1085) | @ElleElleWu  | P0 | ✅ |

### 2.2 Performance Features

| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| TP/EP | - | - | - | ✅ |
| SP (Sequence Parallelism) | [#2163](https://github.com/vllm-project/vllm-omni/pull/2163) | @Bounty-hunter | P1 | ✅ |
| timing tool | [#1757](https://github.com/vllm-project/vllm-omni/pull/1757) | @Bounty-hunter | P0 | ✅ |
| CFG Parallel | [#1751](https://github.com/vllm-project/vllm-omni/pull/1751) | @nussejzz | P1 | ✅ |
| TeaCache | [#1927](https://github.com/vllm-project/vllm-omni/pull/1927) | @nussejzz | P1 | ✅ |
| VAE Parallel | [#3091](https://github.com/vllm-project/vllm-omni/pull/3091) | @Fishermanykx  | P1 | - |
| Flash Attention | [#2981](https://github.com/vllm-project/vllm-omni/pull/2981) | @Bounty-hunter  | P1 | ✅ |

### 2.3 Quantilization
| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| NPU offline quantilization | [2979](https://github.com/vllm-project/vllm-omni/pull/2979) | @jiangmengyu18  | P0 | ✅ |


## 3. AR + DIT Joint Inference
| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| AR + DIT with KV recompute | [#3107](https://github.com/vllm-project/vllm-omni/pull/3107) | @skf-1999  | P0 | ✅ |
| AR + DIT with KV reuse | [#3346](https://github.com/vllm-project/vllm-omni/pull/3346) | @Bounty-hunter  | P0 | ✅ |
| Online mode adaptation | [#3410](https://github.com/vllm-project/vllm-omni/issues/3410) | @skf-1999  | P0 | ✅ |
|Offline-Online Accuracy Alignment Check/Fix | | @skf-1999  |  P0 | 2026/5/13 |
| YR connector (NPU) | [#3180](https://github.com/vllm-project/vllm-omni/pull/3180) | @yangsonglin13  | P1 |  |
| Skip encoding the parts in the DiT stage that have already been encoded during the AR stage, such as the system prompt and image tokens |  |   | P1 | close no obvious benefit |

### 3.1 Large-scale Deployment
Production readiness for large-scale deployment, focusing on multi-replica and high-concurrency scenarios.

| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| Single-node, multi-replica, uniform AR/DIT config (e.g. both TP2) | - | - | P0 | - |
| Multi-node, multi-replica, uniform AR/DIT config (e.g. both TP2) | - | - | P0 | - |
| Single-node, multi-replica, heterogeneous AR/DIT config | - | - | P0 | - |
| Multi-node, multi-replica, heterogeneous AR/DIT config | - | - | P0| - |

---

### 3.2 Baseline
| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| Baseline evaluation and analyze | - | @Bounty-hunter @fake0fan  | P0 | - |

## 4. Cross-cutting / Maintenance

### 4.1 Known bug/issue

| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| Rebase 0.20.0 accuracy issue | [#3373](https://github.com/vllm-project/vllm-omni/pull/3373) | @Bounty-hunter  | P0 | ✅ |
| https://github.com/vllm-project/vllm-omni/issues/3477 | | | | |
| https://github.com/vllm-project/vllm-omni/issues/3499 | https://github.com/vllm-project/vllm-omni/pull/3500 | @Bounty-hunter  | P0 |  |
| https://github.com/vllm-project/vllm-omni/issues/3503 | | @Fishermanykx  | P0 | |
| DIT-only accuracy issue | |   | P0 | |
| AR+DIT + multi-image-input + kvreuse  accuracy issue | |  | P0 | |
| AR + DIT start up issue | |  | P0 | |
| recaption issue | |  | P0 | |



### 4.2 CI & Quality
| Category | Test case | test_file | Covered Scenario |
|------|----|--------|----------|
| Accuracy | AR + DIT accuracy ci | `tests\e2e\accuracy\test_hunyuan_image3.py`  | `kv_reuse` `AR accuracy` `online/offline` |
| Accuracy | DIT accuracy ci | - | `DIT accuracy` |
| Performance | DIT performance ci | `run_diffusion_benchmark.py` | `TP` `SP` `CFG parallel` |
| Performance | end2end performance ci | - | `end2end` `AR performance` |



| Task | PR | Author | Priority | Deadline |
|------|----|--------|----------|----------|
| AR+DIT+kv_reuse+offline accuracy ci | https://github.com/vllm-project/vllm-omni/pull/3655 | @Bounty-hunter   | P0 | ✅ |
| AR+DIT+kv_reuse+online accuracy ci | - | @BLANKETusers  | P0 | |
| DIT performance ci | https://github.com/vllm-project/vllm-omni/pull/2495  |  @Bounty-hunter   | P0 | ✅ |
| Add DIT performance ci to local test | -  |  @TaffyOfficial    | P0 |  |
| DIT accuracy ci | -  |  @BLANKETusers    | P0 |  |
| end2end performance ci | current depend on benchmark |  | P0 | |



## 5. Appendix

### 5.1 Performance Data (L20x)

```
performance data hear
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: HunyuanImage Model deployment optimization #2015

Overview

1. AR Module

1.1 Functional Support

1.2 Performance Features

2. DIT Module

2.1 Functional Support

2.2 Performance Features

2.3 Quantilization

3. AR + DIT Joint Inference

3.1 Large-scale Deployment

3.2 Baseline

4. Cross-cutting / Maintenance

4.1 Known bug/issue

4.2 CI & Quality

5. Appendix

5.1 Performance Data (L20x)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Task	PR	Author	Priority	Deadline
HunyuanImage-3.0 AR	#759	@usberkeley	-	✅
HunyuanImage-3.0-Instruct AR	#2713	@TaffyOfficial	-	✅
AR accuracy bench	#3332	@TaffyOfficial	P0	✅
Multi-image input support	#3444	@TaffyOfficial	P0	✅
Config refactor	#3172	@Fishermanykx	P0	✅

Task	PR	Author	Priority	Deadline
TP/EP	-	-	-	✅
SP (Sequence Parallelism)	#2163	@Bounty-hunter	P1	✅
timing tool	#1757	@Bounty-hunter	P0	✅
CFG Parallel	#1751	@nussejzz	P1	✅
TeaCache	#1927	@nussejzz	P1	✅
VAE Parallel	#3091	@Fishermanykx	P1	-
Flash Attention	#2981	@Bounty-hunter	P1	✅

Task	PR	Author	Priority	Deadline
AR + DIT with KV recompute	#3107	@skf-1999	P0	✅
AR + DIT with KV reuse	#3346	@Bounty-hunter	P0	✅
Online mode adaptation	#3410	@skf-1999	P0	✅
Offline-Online Accuracy Alignment Check/Fix		@skf-1999	P0	2026/5/13
YR connector (NPU)	#3180	@yangsonglin13	P1
Skip encoding the parts in the DiT stage that have already been encoded during the AR stage, such as the system prompt and image tokens			P1	close no obvious benefit

Task	PR	Author	Priority	Deadline
Single-node, multi-replica, uniform AR/DIT config (e.g. both TP2)	-	-	P0	-
Multi-node, multi-replica, uniform AR/DIT config (e.g. both TP2)	-	-	P0	-
Single-node, multi-replica, heterogeneous AR/DIT config	-	-	P0	-
Multi-node, multi-replica, heterogeneous AR/DIT config	-	-	P0	-

Task	PR	Author	Priority	Deadline
Rebase 0.20.0 accuracy issue	#3373	@Bounty-hunter	P0	✅
#3477
#3499	#3500	@Bounty-hunter	P0
#3503		@Fishermanykx	P0
DIT-only accuracy issue			P0
AR+DIT + multi-image-input + kvreuse accuracy issue			P0
AR + DIT start up issue			P0
recaption issue			P0

Category	Test case	test_file	Covered Scenario
Accuracy	AR + DIT accuracy ci	`tests\e2e\accuracy\test_hunyuan_image3.py`	`kv_reuse` `AR accuracy` `online/offline`
Accuracy	DIT accuracy ci	-	`DIT accuracy`
Performance	DIT performance ci	`run_diffusion_benchmark.py`	`TP` `SP` `CFG parallel`
Performance	end2end performance ci	-	`end2end` `AR performance`

Task	PR	Author	Priority	Deadline
AR+DIT+kv_reuse+offline accuracy ci	#3655	@Bounty-hunter	P0	✅
AR+DIT+kv_reuse+online accuracy ci	-	@BLANKETusers	P0
DIT performance ci	#2495	@Bounty-hunter	P0	✅
Add DIT performance ci to local test	-	@TaffyOfficial	P0
DIT accuracy ci	-	@BLANKETusers	P0
end2end performance ci	current depend on benchmark		P0

[RFC]: HunyuanImage Model deployment optimization #2015

Description

Overview

1. AR Module

1.1 Functional Support

1.2 Performance Features

2. DIT Module

2.1 Functional Support

2.2 Performance Features

2.3 Quantilization

3. AR + DIT Joint Inference

3.1 Large-scale Deployment

3.2 Baseline

4. Cross-cutting / Maintenance

4.1 Known bug/issue

4.2 CI & Quality

5. Appendix

5.1 Performance Data (L20x)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions