Unified Reward Model for Multimodal Understanding and Generation

UnifiedReward Series Works

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning: We propose Pref-GRPO and UniGenbench, the first preference reward-based GRPO method for stable T2I reinforcement learning, and a unified T2I generation benchmark for fine-grained semantic consistency evaluation.

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning: We propose UnifiedReward-Think, the first unified multimodal CoT reward model.

Unified Reward Model for Multimodal Understanding and Generation: We release the UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring.

Unified Reward Model for Multimodal Understanding and Generation

😊 We are actively gathering feedback from the community to improve our models. We welcome your input and encourage you to stay updated through our repository!!

🔥🔥🔥 We release UnifiedReward-2.0-qwen-[3b/7b/32b/72b].

This version introduces several new capabilities:

Pairwise scoring for image and video generation assessment on Alignment, Coherence, Style dimensions.

Pointwise scoring for image and video generation assessment on Alignment, Coherence/Physics, Style dimensions.

Welcome to try the latest version, and the inference code is available at inference_qwen/UnifiedReward-2.0-inference directory.

🔥 We release SGLang and vLLM inference code in sglang_llava and vllm_qwen directories!

✨ Awesome Works using UnifiedReward

😊 Meta, Transition Matching: Scalable and Flexible Generative Modeling.

😊 Tencent Hunyuan X, X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again.

😊 Kuaishou&Tsinghua&CUHK, Flow-GRPO: Training Flow Matching Models via Online RL.

😊 CUHK MMLab, Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO.

Method	HPS	ImageReward	UnifiedReward
Janus-Pro + DPO	77.3	77.7	80.0
Janus-Pro + GRPO	79.2	79.3	81.0
Janus-Pro + Best-of-4	82.1	82.4	84.5

😊 We appreciate the mradermacher team for providing the GGUF version of our models, and the Tencent Hunyuan team for providing the evaluation results on several T2I models using UnifiedReward-qwen-7b!! The evaluation was conducted on 400 prompts sourced from here.

click for evaluation results on several T2I models

Model	Alignment	Coherence	Style
Flux-pro-ultra	3.6453	3.8193	3.4971
Imagen-4.0	3.6792	3.8049	3.4756
Recraft-v3	3.6611	3.8409	3.5158
OpenAI-GPT-image-1	3.6890	3.8448	3.4960
Imagen-3.0	3.6733	3.8027	3.4674
Seedream-3.0	3.6927	3.8218	3.4887

🔥🔥🔥 UnifiedReward-Think

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

We release UnifiedReward-Think -- the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.

Please refer to the project page for details.

🔥🔥 We release UnifiedReward-Think-qwen-7b, a more powerful unified multimodal CoT reward model built upon UnifiedReward-qwen-7b!!!!

🔥🔥 We released Gradio for UnifiedReward-Think!

🏁 Compared with Current Reward Models

Reward Model	Method	Image Generation	Image Understanding	Video Generation	Video Understanding
PickScore	Point	√
HPS	Point	√
ImageReward	Point	√
LLaVA-Critic	Pair/Point		√
IXC-2.5-Reward	Pair/Point		√		√
VideoScore	Point			√
LiFT	Point			√
VisionReward	Point	√		√
VideoReward	Point			√
UnifiedReward (Ours)	Pair/Point	√	√	√	√

🔧 Environment Set Up

Clone this repository and navigate to the UnifiedReward folder:

git clone https://github.com/CodeGoat24/UnifiedReward.git
cd UnifiedReward

Install the inference package:

conda create -n unifiedreward python=3.10 -y
conda activate unifiedreward
pip install --upgrade pip  
pip install -e ".[train]"
pip install flash_attn==2.5.8 --no-build-isolation

🚀 Inference

For Qwen2.5-VL based UnifiedReward models, you should first install the inference packages as follows:

pip install git+https://github.com/huggingface/transformers accelerate qwen-vl-utils[decord]==0.0.8

We provide reference pair ranking and point score inference code for each task in the ./inference and ./inference_qwen directories.

inference
├── image_generation                  
    ├── pair_rank_image_generation.py            
    └── point_score_image_generation.py         
├── video_understanding                 
    ├── pair_rank_video_understanding.py            
    └── point_score_video_understanding.py
...

Note that our model is not constrained to a fixed input prompt style. You can flexibly adjust inputs based on your requirements.

1. vLLM Inference

We provide vLLM inference code for UnifiedReward-qwen in vllm_qwen directory.

Install vLLM

pip install vllm==0.9.0.1 transformers==4.52.4

Deploy vLLM Server

bash vllm_qwen/vllm_server.sh

Inference Request to vLLM Server

python vllm_qwen/vllm_inference.py

2. SGLang Inference

We provide SGLang inference code for UnifiedReward-llava in sglang_llava directory.

Install SGLang

pip install "sglang[all]"

Deploy SGLang Server

bash sglang_llava/sglang_server.sh

Inference Request to SGLang Server

python sglang_llava/sglang_inference.py

💻 Training UnifiedReward

1. Training based on Qwen2.5-VL-Instruct (Recommended)

We use LLaMA-Factory to train the SFT model.

Clone the LLaMA-Factory repository and install the dependencies.

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Follow this README (Multimodal Image Dataset) to prepare our released datasets.

Run the following command to train the SFT model.

llamafactory-cli train examples/train_full/qwen2_5vl_full_sft.yaml

2. Training based on LLaVA-Onevision

2.1 Unified Preference Training Dataset Preparation

Please download our constructed unified preference dataset from Huggingface and put it in ./dataset/.

dataset
├── EvalMuse                  
    ├── pairwise            
    └── pointwise
    └── ...            
└── HPD                   
└── LiFT-HRA
└── LLaVA-Critic 
    ├── pairwise            
    └── pointwise
    └── ...
└── OIP
└── ShareGPTVideo
    ├── pairwise            
    └── pointwise
    └── ...      
└── VideoDPO 
└── VideoFeedback
└── train_data.yaml

2.2 Training based on LLaVA-Onevision

bash train.sh

✨ Direct Preference Optimization

🎨 Image and Video Understanding DPO

1. Construct Preference data

The data for preference data construction should adhere to the following structure:

[
    {
    "prompt": "",
    "image": "",
    },
    ...
]

Then

# image understanding 
cd preference_data_construction/image_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file

# video understanding 
cd preference_data_construction/video_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file

2. Training

The training data format in data.json should adhere to the following structure:

[
    {
    "id": "",
    "image": "",
    "prompt": "",
    "chosen": "",
    "rejected": ""
    },
    ...
]

Then start training:

# image understanding 
bash dpo_image_understand_ov7b.sh 

# video understanding 
bash dpo_video_understand_llava_video_7b.sh

🖼️ Image Generation DPO

0. Prepare Environments

cd DiffusionDPO
conda create -n diffdpo python=3.10 -y
conda activate diffdpo
pip install -r requirements.txt

1. Construct Preference data

Image Generation

The data for preference data construction should adhere to the following structure:

[
    {
    "prompt": "",
    },
    ...
]

Then

python data_generation.py # you need to fill the 'data_path' in this file

Preference Pair Data Construction

python sift_dpo_data.py

2. Training

The training data format in data.json should adhere to the following structure:

[
    {
        "id": "",
        "caption": "",
        "jpg_0": "", #chosen image path
        "jpg_1": "", #rejected image path
        "label_0": 1,
    },
    ...
]

Then start training:

bash launchers/turbo_dpo.sh

🎬 Video Generation DPO

0. Prepare Environments

cd VideoDPO
conda create -n videodpo python=3.10 -y
conda activate videodpo
pip install -r requirements.txt

Run following instruction to download VideoCrafter checkpoints.

mkdir -p checkpoints/vc2
wget -P checkpoints/vc2 https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt

Please download our constructed T2V-Turbo model and its reference model from Huggingface and put it in ./checkpoints/t2v-turbo.

1. Construct Preference data

Video Generation

The data for preference data construction should adhere to the following structure:

[
    {
    "prompt": "",
    },
    ...
]

Then

bash data_generation.sh # you need to fill '--prompts_file' in this file

Preference Pair Data Construction

python sift_dpo_data.py

2. Training

The training data format in data.json should adhere to the following structure:

[
    {
        "id": "",
        "caption": "",
        "chosen": "", # chosen video path
        "rejected": "", # rejected video path
    },
    ...
]

Then start training:

bash run.sh

🚀 Evaluation

We provide several evaluation code in ./benchmark_evaluation directory.

Reward model

We provide evaluation code for GenAI-Bench-Video, GenAI-Bench-Image, VideoGen-RewardBench and VL-RewardBench benchmarks.

Video Understanding

We provide evaluation code for MSRVTT, MSVD, and TGIF benchmarks while using the VLMEvalKit toolkit for evaluating LongVideoBench, MLVU, and Video-MME benchmarks with 64 input frames.

Image Understanding

We use LMMs-Eval toolkit to evaluate LLaVABench, WildVision, LLaVABench-Wilder, LiveBench, and MMHal benchmarks.

Image Generation

We utilize the image reward model, i.e., PickScore, HPS and ImageReward for quality assessment.

Video Generation

VBench is used for video generation assessment.

📧 Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

🤗 Acknowledgments

In this work, reward model and image/video understanding DPO code is based on LLaVA-Next, while image and video generation DPO is based on DiffusionDPO and VideoDPO.

We also utilize LMMs-Eval and VLMEvalKit toolkits for evaluation.

Thanks to all the contributors!

⭐ Citation

@article{unifiedreward-think,
  title={Unified multimodal chain-of-thought reward model through reinforcement fine-tuning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2505.03318},
  year={2025}
}

@article{unifiedreward,
  title={Unified reward model for multimodal understanding and generation},
  author={Wang, Yibin and Zang, Yuhang and Li, Hao and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2503.05236},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
DiffusionDPO		DiffusionDPO
UnifiedReward-Think		UnifiedReward-Think
VideoDPO		VideoDPO
benchmark_evaluation		benchmark_evaluation
dataset		dataset
docs		docs
inference		inference
inference_qwen		inference_qwen
llava		llava
modules		modules
playground		playground
preference_data_construction		preference_data_construction
scripts		scripts
sglang_llava		sglang_llava
trl		trl
vllm_qwen		vllm_qwen
LICENSE		LICENSE
README.md		README.md
dpo_image_understand_ov7b.sh		dpo_image_understand_ov7b.sh
dpo_video_understand_llava_video_7b.sh		dpo_video_understand_llava_video_7b.sh
pyproject.toml		pyproject.toml
train.sh		train.sh

License

CodeGoat24/UnifiedReward

Folders and files

Latest commit

History

Repository files navigation

UnifiedReward Series Works

Unified Reward Model for Multimodal Understanding and Generation

✨ Awesome Works using UnifiedReward

🔥🔥🔥 UnifiedReward-Think

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

🏁 Compared with Current Reward Models

🔧 Environment Set Up

🚀 Inference

1. vLLM Inference

2. SGLang Inference

💻 Training UnifiedReward

1. Training based on Qwen2.5-VL-Instruct (Recommended)

2. Training based on LLaVA-Onevision

2.1 Unified Preference Training Dataset Preparation

2.2 Training based on LLaVA-Onevision

✨ Direct Preference Optimization

1. Construct Preference data

2. Training

0. Prepare Environments

1. Construct Preference data

2. Training

0. Prepare Environments

1. Construct Preference data

2. Training

🚀 Evaluation

Reward model

Video Understanding

Image Understanding

Image Generation

Video Generation

📧 Contact

🤗 Acknowledgments

⭐ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages