The collection of awesome papers on the alignment of visual generation models (including AR and diffusion models)
We also use this repo as a reference.
We welcome community contributions!
- CS285 from UC Berkeley. (This course is mainly for robotics control, but very important if you want to be an expert on RL.)
- Introduction to Reinforcement Learning
- 动手学强化学习 (in Chinese)
- 强化学习知乎专栏 (in Chinese)
Traditional Policy Gradient
- DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models, [pdf], 2023.05
- Training Diffusion Models with Reinforcement Learning, [pdf], 2023.05
GRPO
- DanceGRPO: Unleashing GRPO on Visual Generation, [pdf], 2025.05
- Flow-GRPO: Training Flow Matching Models via Online RL, [pdf], 2025.05
DPO
- Diffusion Model Alignment Using Direct Preference Optimization, [pdf], 2023.11
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model, [pdf], 2023.11
Reward Feedback Learning
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf], 2023.04
- Directly Fine-Tuning Diffusion Models on Differentiable Rewards, [pdf], 2023.09
Alignment on AR models
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [pdf], 2025.01
- Autoregressive Image Generation Guided by Chains of Thought [pdf], 2025.02
Reward Models
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf], 2023.04
- Human Preference Score: Better Aligning Text-to-Image Models with Human Preference, [pdf], 2023.03
- DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models, [pdf], 2023.05
- Training Diffusion Models with Reinforcement Learning, [pdf], 2023.05
- Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning, [pdf], 2025.02
- Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards, [pdf], 2025.01
- Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation, [pdf], 2024.12
- DanceGRPO: Unleashing GRPO on Visual Generation, [pdf], 2025.05
- Flow-GRPO: Training Flow Matching Models via Online RL, [pdf], 2025.05
- MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE, [pdf], 2025.07
- TempFlow-GRPO: When Timing Matters for GRPO in Flow Models, [pdf], 2025.08
- Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning, [pdf], 2025.08
- DiffusionNFT: Online Diffusion Reinforcement with Forward Process, [pdf], 2025.09
- BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models, [pdf], 2025.09
- Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models, [pdf], 2025.09
- PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models, [pdf], 2025.09
- Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching, [pdf], 2025.09
- Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling, [pdf], 2025.09
- MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning, [pdf], 2025.09
- G^2RPO: Granular GRPO for Precise Reward in Flow Models, [pdf], 2025.10
- Reinforcing Diffusion Models by Direct Group Preference Optimization, [pdf], 2025.10
- Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models, [pdf], 2025.10
- Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning, [pdf], 2025.10
- Self-Forcing++: Towards Minute-Scale High-Quality Video Generation, [pdf], 2025.10
DPO-based (referred to here)
- Diffusion Model Alignment Using Direct Preference Optimization, [pdf], 2023.11
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model, [pdf], 2023.11
- A Dense Reward View on Aligning Text-to-Image Diffusion with Preference, [pdf], 2024.02
- Aligning Diffusion Models by Optimizing Human Utility, [pdf], 2024.04
- Boost Your Human Image Generation Model via Direct Preference Optimization, [pdf], 2024.05
- Curriculum Direct Preference Optimization for Diffusion and Consistency Models, [pdf], 2024.05
- Margin-aware Preference Optimization for Aligning Diffusion Models without Reference, [pdf], 2024.06
- DSPO: Direct Score Preference Optimization for Diffusion Model Alignment, [pdf], 2024.09
- Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models, [pdf], 2024.09
- Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization, [pdf], 2024.10
- Scalable Ranked Preference Optimization for Text-to-Image Generation, [pdf], 2024.10
- Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution Estimation, [pdf], 2024.11
- PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation. arXiv 2024, [pdf], 2024.12
- SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation, [pdf], 2024.12
- OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization, [pdf], 2024.12
- VideoDPO: Omni-Preference Alignment for Video Diffusion Generation, [pdf], 2024.12
- Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization, [pdf], 2024.06
- Personalized Preference Fine-tuning of Diffusion Models, [pdf], 2025.01
- Calibrated Multi-Preference Optimization for Aligning Diffusion Models, [pdf], 2025.02
- Direct Distributional Optimization for Provable Alignment of Diffusion Models, [pdf], 2025.02
- InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment, [pdf], 2025.03
- CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation, [pdf], 2025.02
- Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking, [pdf], 2025.02
- Aligning Text to Image in Diffusion Models is Easier Than You Think, [pdf], 2025.03
- DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization, [pdf], 2025.02
- Flow-DPO: Improving Video Generation with Human Feedback, [pdf], 2025.01
- HuViDPO: Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment, [pdf], 2025.02
- Fine-Tuning Diffusion Generative Models via Rich Preference Optimization, [pdf], 2025.03
- D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples, [pdf], 2025.05
- Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences, [pdf], 2025.06
- Follow-Your-Preference: Towards Preference-Aligned Image Inpainting, [pdf], 2025.09
- Towards Better Optimization For Listwise Preference in Diffusion Models, [pdf], 2025.10
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf], 2023.04
- Aligning Text-to-Image Diffusion Models with Reward Backpropagation, [pdf], 2023.10
- Directly Fine-Tuning Diffusion Models on Differentiable Rewards, [pdf], 2023.09
- Feedback Efficient Online Fine-Tuning of Diffusion Models, [pdf], 2024.02
- Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models, [pdf], 2024.05
- Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward, [pdf], 2024.11
- InstructVideo: Instructing Video Diffusion Models with Human Feedback, [pdf], 2023.12
- IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation, [pdf], 2024.10
- Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference, [pdf], 2025.09
- Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization, [pdf], 2025.10
We only list the reports using alignment methods:
- Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, [pdf], 2025.02
- Seedream 2.0: A native chinese-english bilingual image generation foundation model, [pdf], 2025.03
- Seedream 3.0 technical report, [pdf], 2025.04
- Seedance 1.0: Exploring the Boundaries of Video Generation Models, [pdf], 2025.06
- Seedream 4.0: Toward Next-generation Multimodal Image Generation, [pdf], 2025.09
- Qwen-Image Technical Report, [pdf], 2025.08
- Skywork-UniPic2, [pdf], 2025.09
- BLIP3o-NEXT: Next Frontier of Native Image Generation, [pdf], 2025.10
Alignment on AR models (referred to here)
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, [pdf], 2025.01
- Autoregressive Image Generation Guided by Chains of Thought, [pdf], 2025.02
- LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization, [pdf], 2025.03
- SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL, [pdf], 2025.04
- UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation, [pdf], 2025.05
- UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning, [pdf], 2025.05
- ReasonGen-R1: CoT for Autoregressive Image Generation model through SFT and RL, [pdf], 2025.05
- Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation, [pdf], 2025.06
- Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO, [pdf], 2025.05
- CoT-lized Diffusion: Let’s Reinforce T2I Generation Step-by-step, [pdf], 2025.07
- X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again, [pdf], 2025.07
- T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT, [pdf], 2025.05
- AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning, [pdf], 2025.08
- Group Critical-token Policy Optimization for Autoregressive Image Generation, [pdf], 2025.09
- STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation, [pdf], 2025.09
- Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking, [pdf], 2025.09
Benchmarks
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [pdf], 2022.02
- Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark, [pdf], 2022.11
- LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation, [pdf], 2023.05
- VPGen & VPEval: Visual Programming for Text-to-Image Generation and Evaluation, [pdf], 2023.05
- T2I-CompBench: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation, [pdf], 2023.07
- GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment, [pdf], 2023.10
- Holistic Evaluation of Text-to-Image Models, [pdf], 2023.11
- Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community, [pdf], 2024.02
- Rich Human Feedback for Text to Image Generation, [pdf], 2023.12
- Learning Multi-Dimensional Human Preference for Text-to-Image Generation, [pdf], 2024.05
- Evaluating Text-to-Visual Generation with Image-to-Text Generation, [pdf], 2024.04
- Multimodal Large Language Models Make Text-to-Image Generative Models Align Better, [pdf], 2024.04
- Measuring Style Similarity in Diffusion Models, [pdf], 2024.04
- DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation, [pdf], 2024.06
- PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment, [pdf], 2024.06
- Video-Bench: Human-Aligned Video Generation Benchmark, [pdf], 2025.04
- ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning, [pdf], 2025.10
Reward Models
- Human Preference Score: Better Aligning Text-to-Image Models with Human Preference, [pdf], 2023.03
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, [pdf], 2023.04
- Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation, [pdf], 2023.05
- Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis, [pdf], 2023.06
- Improving Video Generation with Human Feedback, [pdf], 2025.01
- VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation, [pdf], 2024.06
- VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation, [pdf], 2024.12
- Unified Reward Model for Multimodal Understanding and Generation, [pdf], 2025.03
- LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment, [pdf], 2024.12
- HPSv3: Towards Wide-Spectrum Human Preference Score, [pdf], 2025.08
- Rewarddance: Reward scaling in visual generation, [pdf], 2025.09
- Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization, [pdf], 2025.09
- EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling, [pdf], 2025.09
- VideoScore2: Think before You Score in Generative Video Evaluation, [pdf], 2025.09
- EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing, [pdf], 2025.09
- Optimizing Prompts for Text-to-Image Generation, [pdf], 2022.12
- RePrompt: Automatic Prompt Editing to Refine AI-Generative Art Towards Precise Expressions, [pdf], 2023.02
- Improving Text-to-Image Consistency via Automatic Prompt Optimization, [pdf], 2024.03
- Dynamic Prompt Optimizing for Text-to-Image Generation, [pdf], 2024.04
- T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT, [pdf], 2025.05
- RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning, [pdf], 2025.05