📤 Get Started | 📄 Preprint | 🤗 Hugging Face (Subgoal) | 🤗 Hugging Face (Critique) |
We introduce Thinking with Generated Images, where we enable a single LMM (Large Multimodal Model) to spontaneously generate and reason with intermediate visual thoughts via a native long-multimodal thought process.
We demonstrate the evolution from passive seeing with images (single image ingestion), to thinking with images (multi-step transformations of the input image), and finally to thinking with generated images, where the model itself generates multimodal tokens to help with thinking and solving more complex tasks. Below, we showcase a few example scenarios under each concept.
Under the thinking with generated images paradigm, we also distinguish between agentic, module-heavy approaches with our unified, single-model approach, which interleaves visual and textual tokens through one autoregressive pass that naturally enables test-time scaling.
Anole-7b vs. Thinking with Generated Images on GenEval
Anole-7b vs. Thinking with Generated Images on DPG-Bench
We implement our Thinking with Generated Images paradigm by supervised fine-tuning unified autoregressive LMMs (e.g., Anole-7b) on a curated dataset of interleaved text–vision reasoning chains. This fine-tuning optimizes a composite loss that combines standard cross-entropy on multimodal tokens with a visual feature reconstruction term to ensure both semantic coherence and high-fidelity image outputs. Furthermore, this approach interleaves text and vision tokens to natively perform visual sub-goal decomposition and self-critique, and leverages test-time scaling to significantly improve vision generation quality.
- Download the model: twgi-subgoal-anole-7b or twgi-critique-anole-7b
huggingface-cli download --resume-download GAIR/twgi-critique-anole-7b --local-dir twgi-critique-anole-7b --local-dir-use-symlinks False
huggingface-cli download --resume-download GAIR/twgi-subgoal-anole-7b --local-dir twgi-subgoal-anole-7b --local-dir-use-symlinks False
- Install requirements and
transformers
from thechameleon
branch (already included in this repo). This transformers library is modified from leloykun's implementation.
bash install.sh
The inference code supports vision generation with intermediate visual sub-goals and vision generation with self-critique. We also support general multimodal generation on the original Anole-7b. Remember to download the corresponding model (twgi-subgoal-anole-7b, twgi-critique-anole-7b, Anole-7b) and specify the model path in ./inference/inference.sh
.
cd inference
bash inference.sh
bash detokenization.sh
We have open-sourced our training code and provided a minimal dataset for testing the training pipeline. Remember to specify the initial and trained model path in ./training/train.sh
.
cd training
bash train.sh
We also provide the example data tokenization code in ./training/tokenization.py
.
Model Name | HF Checkpoints | License |
---|---|---|
twgi-subgoal-anole-7b | 🤗 7B | Chameleon License |
twgi-critique-anole-7b | 🤗 7B | Chameleon License |
The trained models based on anole follow the same license as Chameleon.
Please cite our paper if you find the repository helpful.
@article{chern2025thinkingwithgeneratedimages,
title={Thinking with Generated Images},
author={Chern, Ethan and Hu, Zhulin and Chern, Steffi and Kou, Siqi and Su, Jiadi and Ma, Yan and Deng, Zhijie and Liu, Pengfei},
journal={arXiv preprint arXiv:2505.22525},
year={2025}
}