Skip to content

GAIR-NLP/thinking-with-generated-images

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💡Thinking with Generated Images

thinking-with-generated-images

📤 Get Started   |   📄 Preprint   |   🤗 Hugging Face (Subgoal)   |   🤗 Hugging Face (Critique)   |  

👋 Overview

We introduce Thinking with Generated Images, where we enable a single LMM (Large Multimodal Model) to spontaneously generate and reason with intermediate visual thoughts via a native long-multimodal thought process.

framework

💭 Thinking Evolution

We demonstrate the evolution from passive seeing with images (single image ingestion), to thinking with images (multi-step transformations of the input image), and finally to thinking with generated images, where the model itself generates multimodal tokens to help with thinking and solving more complex tasks. Below, we showcase a few example scenarios under each concept.

see_vs_think

Under the thinking with generated images paradigm, we also distinguish between agentic, module-heavy approaches with our unified, single-model approach, which interleaves visual and textual tokens through one autoregressive pass that naturally enables test-time scaling.

flow_chart

📊 Examples

Anole-7b vs. Thinking with Generated Images on GenEval

geneval

Anole-7b vs. Thinking with Generated Images on DPG-Bench

dpgbench

🔍 Methodology

We implement our Thinking with Generated Images paradigm by supervised fine-tuning unified autoregressive LMMs (e.g., Anole-7b) on a curated dataset of interleaved text–vision reasoning chains. This fine-tuning optimizes a composite loss that combines standard cross-entropy on multimodal tokens with a visual feature reconstruction term to ensure both semantic coherence and high-fidelity image outputs. Furthermore, this approach interleaves text and vision tokens to natively perform visual sub-goal decomposition and self-critique, and leverages test-time scaling to significantly improve vision generation quality.

🚀 Get started

Installation

  1. Download the model: twgi-subgoal-anole-7b or twgi-critique-anole-7b
huggingface-cli download --resume-download GAIR/twgi-critique-anole-7b --local-dir twgi-critique-anole-7b --local-dir-use-symlinks False
huggingface-cli download --resume-download GAIR/twgi-subgoal-anole-7b --local-dir twgi-subgoal-anole-7b --local-dir-use-symlinks False
  1. Install requirements and transformers from the chameleon branch (already included in this repo). This transformers library is modified from leloykun's implementation.
bash install.sh

Inference

The inference code supports vision generation with intermediate visual sub-goals and vision generation with self-critique. We also support general multimodal generation on the original Anole-7b. Remember to download the corresponding model (twgi-subgoal-anole-7b, twgi-critique-anole-7b, Anole-7b) and specify the model path in ./inference/inference.sh.

cd inference
bash inference.sh
bash detokenization.sh

Training

We have open-sourced our training code and provided a minimal dataset for testing the training pipeline. Remember to specify the initial and trained model path in ./training/train.sh.

cd training
bash train.sh

We also provide the example data tokenization code in ./training/tokenization.py.

🛠️ Models

Model Name HF Checkpoints License
twgi-subgoal-anole-7b 🤗 7B Chameleon License
twgi-critique-anole-7b 🤗 7B Chameleon License

📝 Usage and License Notices

The trained models based on anole follow the same license as Chameleon.

Citation

Please cite our paper if you find the repository helpful.

@article{chern2025thinkingwithgeneratedimages,
  title={Thinking with Generated Images},
  author={Chern, Ethan and Hu, Zhulin and Chern, Steffi and Kou, Siqi and Su, Jiadi and Ma, Yan and Deng, Zhijie and Liu, Pengfei},
  journal={arXiv preprint arXiv:2505.22525},
  year={2025}
} 

About

Doodling our way to AGI ✏️ 🖼️ 🧠

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •