VeRL-Omni is a general RL training framework focused on multimodal generative models, built on top of verl.
It originated from the multi-modal generation RL effort in verl, and now has a dedicated home so it can evolve in a more focused way.
- [2026-06] DiffusionNFT and Diffusion DPO are integrated with verified recipes on Qwen-Image/SD3.5. Wan2.2 is now supported for video generation tasks.
Multimodal generative RL training differs from text-only LLM RL not only in model structure, but also in I/O patterns, compute characteristics, and runtime bottlenecks. As this space grows, it deserves a dedicated training repository that can evolve quickly around its own constraints.
VeRL-Omni targets RL post-training for three families of generative models:
- Diffusion generative models for image, video, and audio — e.g., Qwen-Image, Wan2.2.
- Unified multimodal understanding + generation models — e.g., BAGEL, HunyuanImage-3.0.
- Omni-modality models that jointly handle text, image, audio, and video — e.g., Qwen3-Omni.
- Optimized rollout:
vLLM-Omnias a rollout backend for high-throughput multimodal generation. - Flexible and async multi-reward serving: Support for multi-reward serving (HPSv3, GenRM-OCR, UnifiedReward, etc.), HTTP scorer, and asynchronous reward computation to overlap the rollout phase.
- Modular training backends: Selectable VeOmni and FSDP2 backends with combinable parallelism (USP/TP/DP) for distributed training.
- Stability tools: Improved diffusion RL stability with rollout correction and deterministic rollout/reward/trainer.
- End-to-end examples and benchmarks: Validated recipes for co-located sync and fully-async RL on the model families above.
- High training throughput: On our reference Qwen-Image FlowGRPO setup,
VeRL-Omniachieves ~25% higher end-to-end throughput than the diffusers-basedflow_grpoimplementation, driven byvLLM-Omnirollout, FSDP2 trainer, overlapped reward computation (asynchronous), etc.
Visit our documentation to learn more.
| Model | Category | Modality | Algorithm | Status |
|---|---|---|---|---|
| Qwen-Image | Diffusion generator | Text → Image | FlowGRPO (+ CPS/SDE) | ✅ |
| MixGRPO | ✅ | |||
| GRPO-Guard | ✅ | |||
| DiffusionNFT | ✅ | |||
| DPO | ✅ | |||
| Wan2.2 | Diffusion generator | Text → Video | DanceGRPO | ✅ |
| LTX2.3 | Diffusion generator | Text → Video + Audio | FlowGRPO | WIP |
| BAGEL | Unified understand + gen | Text + Image | FlowGRPO | ✅ |
| HunyuanImage-3.0 | Unified understand + gen | Text + Image | MixGRPO | Planned |
| SRPO | Planned | |||
| Qwen3-Omni-Thinker | Omni-modality | Text / Image / Video / Audio | GSPO | WIP |
| SD3.5 | Diffusion generator | Text → Image | DPO | ✅ |
VeRL-Omni now supports Ascend NPU. For instructions on how to install and get started with FlowGRPO training on Ascend NPU, please refer to our Ascend NPU Quickstart Guide.
Future work is tracked here:
Contributions are welcome.
See the contribution guide.
verl-omni builds on the engineering foundations developed in verl and is closely aligned with multimodal inference systems such as vLLM-Omni.
If you find the project helpful, please cite:
@misc{verlomni_github,
title = {{VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models}},
author = {Yongxiang Huang and Cheung Kawai and Jingan Zhou and Yingshu Chen and {openYuanrong Team} and Xibin Wu},
year = {2026},
howpublished = {\url{https://github.com/verl-project/verl-omni}},
urldate = {2026-04-28}
}