💡Thinking with Generated Images

📤 Get Started | 📄 Preprint | 🤗 Hugging Face (Subgoal) | 🤗 Hugging Face (Critique) |

👋 Overview

We introduce Thinking with Generated Images, where we enable a single LMM (Large Multimodal Model) to spontaneously generate and reason with intermediate visual thoughts via a native long-multimodal thought process.

💭 Thinking Evolution

We demonstrate the evolution from passive seeing with images (single image ingestion), to thinking with images (multi-step transformations of the input image), and finally to thinking with generated images, where the model itself generates multimodal tokens to help with thinking and solving more complex tasks. Below, we showcase a few example scenarios under each concept.

Under the thinking with generated images paradigm, we also distinguish between agentic, module-heavy approaches with our unified, single-model approach, which interleaves visual and textual tokens through one autoregressive pass that naturally enables test-time scaling.

📊 Examples

Anole-7b vs. Thinking with Generated Images on GenEval

Anole-7b vs. Thinking with Generated Images on DPG-Bench

🔍 Methodology

We implement our Thinking with Generated Images paradigm by supervised fine-tuning unified autoregressive LMMs (e.g., Anole-7b) on a curated dataset of interleaved text–vision reasoning chains. This fine-tuning optimizes a composite loss that combines standard cross-entropy on multimodal tokens with a visual feature reconstruction term to ensure both semantic coherence and high-fidelity image outputs. Furthermore, this approach interleaves text and vision tokens to natively perform visual sub-goal decomposition and self-critique, and leverages test-time scaling to significantly improve vision generation quality.

🚀 Get started

Installation

Download the model: twgi-subgoal-anole-7b or twgi-critique-anole-7b

huggingface-cli download --resume-download GAIR/twgi-critique-anole-7b --local-dir twgi-critique-anole-7b --local-dir-use-symlinks False
huggingface-cli download --resume-download GAIR/twgi-subgoal-anole-7b --local-dir twgi-subgoal-anole-7b --local-dir-use-symlinks False

Install requirements and transformers from the chameleon branch (already included in this repo). This transformers library is modified from leloykun's implementation.

bash install.sh

Inference

The inference code supports vision generation with intermediate visual sub-goals and vision generation with self-critique. We also support general multimodal generation on the original Anole-7b. Remember to download the corresponding model (twgi-subgoal-anole-7b, twgi-critique-anole-7b, Anole-7b) and specify the model path in ./inference/inference.sh.

cd inference
bash inference.sh
bash detokenization.sh

Training

We have open-sourced our training code and provided a minimal dataset for testing the training pipeline. Remember to specify the initial and trained model path in ./training/train.sh.

cd training
bash train.sh

We also provide the example data tokenization code in ./training/tokenization.py.

🛠️ Models

Model Name	HF Checkpoints	License
twgi-subgoal-anole-7b	🤗 7B	Chameleon License
twgi-critique-anole-7b	🤗 7B	Chameleon License

📝 Usage and License Notices

The trained models based on anole follow the same license as Chameleon.

Citation

Please cite our paper if you find the repository helpful.

@article{chern2025thinkingwithgeneratedimages,
  title={Thinking with Generated Images},
  author={Chern, Ethan and Hu, Zhulin and Chern, Steffi and Kou, Siqi and Su, Jiadi and Ma, Yan and Deng, Zhijie and Liu, Pengfei},
  journal={arXiv preprint arXiv:2505.22525},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
inference		inference
training		training
transformers		transformers
README.md		README.md
install.sh		install.sh
paper.pdf		paper.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💡Thinking with Generated Images

👋 Overview

💭 Thinking Evolution

📊 Examples

🔍 Methodology

🚀 Get started

Installation

Inference

Training

🛠️ Models

📝 Usage and License Notices

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

GAIR-NLP/thinking-with-generated-images

Folders and files

Latest commit

History

Repository files navigation

💡Thinking with Generated Images

👋 Overview

💭 Thinking Evolution

📊 Examples

🔍 Methodology

🚀 Get started

Installation

Inference

Training

🛠️ Models

📝 Usage and License Notices

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages