Use Vision Tools, Think with Images

👁️ Vision: "Thinking with Images"

"The eye sees only what the mind is prepared to comprehend." – Robertson Davies

Humans don't just passively observe; we actively engage with visual information, sketching, highlighting, and manipulating it to understand. OpenThinkIMG aims to bring this interactive visual cognition to AI, enabling agents that can genuinely "think with images."

Overview of the OpenThinkIMG framework and V-ToolRL training process.

News

[2025/06/01] 🐳 We have released an official Docker image of tool server.
[2025/05/17] Our work is reported by Qubit (量子位)
[2025/05/14] Our work is reported by both Deep Learning and NLP (深度学习自然语言处理) and Machine Learning and NLP (机器学习算法与自然语言处理).
[2025/05/13] The models and datasets are released on HuggingFace.
[2025/05/13] OpenThinkIMG codebase is released along with evaluation scripts. Try it out!
[2025/05/13] OpenThinkIMG paper available on arXiv.

🤔 What is OpenThinkIMG?

OpenThinkIMG is an end-to-end open-source framework that empowers Large Vision-Language Models (LVLMs) to think with images. It features:

Flexible vision tool management and easy integration of new tools.
Efficient dynamic inference with distributed tool deployment.
A streamlined SFT (Supervised Fine-Tuning) and Agent-RL (Reinforcement Learning) training pipeline, including our novel V-ToolRL method.

Our goal is to enable AI agents to interactively use visual tools to decompose, analyze, and solve complex visual problems, moving beyond passive observation towards active visual cognition.

🐚 Why OpenThinkIMG?

Current LVLMs excel at many tasks but often struggle when:

Deep, iterative visual reasoning is required, not just single-pass description.
Precise interaction with visual content (e.g., reading specific chart values, identifying exact locations) is crucial.
Generalizing learned tool-use to new scenarios dynamically.

OpenThinkIMG addresses these challenges by:

Bridging the Gap to Human-like Visual Cognition: We enable LVLMs to "think with images" by actively using a suite of visual tools, much like humans use sketches or highlights to understand complex scenes.
Standardizing a Fragmented Landscape: The current ecosystem for vision tools lacks unification. OpenThinkIMG provides:
- Unified Tool Interfaces: A standardized way to define and interact with diverse visual tools.
- Modular, Distributed Deployment: Tools run as independent services, enhancing scalability, fault isolation, and resource management.
Moving Beyond Static SFT Limitations: Supervised Fine-Tuning (SFT) on fixed trajectories often leads to poor generalization and lacks adaptability. We introduce:
- V-ToolRL for Adaptive Policies: Our novel reinforcement learning framework allows agents to autonomously discover optimal tool-usage strategies by directly optimizing for task success through interaction and feedback. This leads to significantly better performance and adaptability compared to SFT-only approaches.
Driving Reproducible Research: By open-sourcing the entire framework, we aim to provide a common platform for the community to build upon, experiment with, and advance the field of tool-augmented visual reasoning.

🚀 Quick Start

This framework comprises three main components: the fundamental tool service supplier tool server, the inference evaluation framework TF EVAL, and the RL work R1-V-TOOL. Each component has its own environment requirements. The tool server serves as the foundation and must be successfully launched before performing any inference or training.

🖥️ Step 1: Launch Vision Tool Server

You can either run our tool server using the provided Docker image or launch the tool_server locally, depending on your environment preferences.

🐳 Option 1: Docker Image

It's recommended to try our Tool Server docker image. You can either download our provided tool_server image or build it by your self!

📌 Note:

It’s recommended to use the -v /path/to/your/logdir:/log option to mount a host directory to the container’s /log directory, which allows you to view runtime logs and receive the controller_addr output.
The controller address is saved at /path/to/your/logdir/controller_addr.json, which is no longer the default location. Make sure to provide this path to tool_manager when using it.
By default, the molmoPoint worker is configured to run in 4-bit mode to minimize VRAM usage. To customize GPU behavior or access advanced settings, you can log into the container and edit /app/OpenThinkIMG/tool_server/tool_workers/scripts/launch_scripts/config/service_apptainer.yaml.

Option 1.1 Start Tool Server with Our Docker Image

We have released the docker image for tool_server and its slim version tool_server_slim.

The tool_server_slim image is a lightweight version of the tool_server image, which removes the model weights to reduce the image size. To use it, manually prepare the model weights and place them under the /weights folder of the container as described in Build Docker Image by Yourself section. You can pull the image from either Aliyun or Docker Hub.

Image	Aliyun	Docker Hub
`tool_server`	`crpi-fs6w5qkjtxy37mko.cn-shanghai.personal.cr.aliyuncs.com/hitsmy/tool_server:v0.1`	`hitsmy/tool_server:v0.1`
`tool_server_slim`	`crpi-fs6w5qkjtxy37mko.cn-shanghai.personal.cr.aliyuncs.com/hitsmy/tool_server_slim:v0.1`	`hitsmy/tool_server_slim:v0.1`

# Pull the docker image and run
docker pull crpi-fs6w5qkjtxy37mko.cn-shanghai.personal.cr.aliyuncs.com/hitsmy/tool_server:v0.1
docker run -it \
  --gpus all \
  --name tool_server \
  -v /path/to/your/logdir:/log \
  -w /app/OpenThinkIMG \
  --network host \
  crpi-fs6w5qkjtxy37mko.cn-shanghai.personal.cr.aliyuncs.com/hitsmy/tool_server:v0.1 \
  bash -c \
  "python /app/OpenThinkIMG/tool_server/tool_workers/scripts/launch_scripts/start_server_local.py \
  --config /app/OpenThinkIMG/tool_server/tool_workers/scripts/launch_scripts/config/service_apptainer.yaml"

# Test the server 
pthon OpenThinkIMG/tool_server/tool_workers/online_workers/test_cases/worker_tests/test_all.py

Option 1.2 Build Docker Image by Yourself

We have provided the dockerfile at OpenThinkIMG/Dockerfile, you can build the docker Image according to it.

sub-step 1 Prepare the weights

Some tools require specific pretrained weights. Please ensure that these model weights are prepared and placed in the appropriate paths before building the image. The image building process will copy them into the /weights folder of the container automatically.

The directory structure is organized as follows:

project-root/
├── weights/
│   ├── Molmo-7B-D-0924/ # allenai/Molmo-72B-0924
│   ├── sam2-hiera-large/ # facebook/sam2-hiera-large
│   ├── groundingdino_swint_ogc.pth # https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
│   └── GroundingDINO_SwinT_OGC.py # https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
├── OpenThinkIMG/
│   └── Dockerfile

sub-step 2 Start the building procedure!

git clone https://github.com/OpenThinkIMG/OpenThinkIMG.git
cd OpenThinkIMG
docker build -f Dockerfile -t tool_server:v0.1 ..  # This might take a while ...

sub-step 3 Run the image and test!

docker run -it \
  --gpus all \
  --name tool_server \
  -v /path/to/your/logdir:/log \
  -w /app/OpenThinkIMG/ \
  --network host \
  tool_server:v0.1 \

# Test the server 
pthon tool_server/tool_workers/online_workers/test_cases/worker_tests/test_all.py

Option 2. Start Tool Server From Source Code

You can choose to start tool_server through SLURM or just run it on local machine.

🛠️ Installation

First of all, provide a pytorch-based environment.

torch==2.0.1+cu118

# [Optional] Create a clean Conda environment
conda create -n tool-server python=3.10
conda activate tool-server
# Install PyTorch and dependencies (make sure CUDA version matches)
pip install -e git+https://github.com/facebookresearch/sam2.git
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

# Install this project
git clone https://github.com/OpenThinkIMG/OpenThinkIMG.git
pip install -e OpenThinkIMG
pip install -r OpenThinkIMG/requirements/requirements.txt # Tool Server Requirements

⚠️ Be aware:

We deliberately selected minimal dependencies in this project to reduce the risk of conflicts. As a result, you may need to manually install any missing packages based on your environment.

Option 2.1 Start Tool Server through SLURM

It's recommended to start the tool server through SLURM because it's more flexible.

## First, modify the config to adapt to your own environment
## OpenThinkIMG/tool_server/tool_workers/scripts/launch_scripts/config/all_service_example.yaml

## Start all services
cd OpenThinkIMG/tool_server/tool_workers/scripts/launch_scripts
python start_server_config.py --config ./config/all_service_example.yaml

## Press Ctrl + C to shutdown all services automatically.

Option 2.2 Start Tool Server Locally

We made a slight modification to start_server_config.py to create start_server_local.py, primarily by removing the logic related to SLURM job detection and adapting it for local execution instead.

## First, modify the config to adapt to your own environment
## OpenThinkIMG/tool_server/tool_workers/scripts/launch_scripts/config/all_service_example_local.yaml

## Start all services
cd OpenThinkIMG/tool_server/tool_workers/scripts/launch_scripts
python start_server_local.py --config ./config/all_service_example_local.yaml

## Press Ctrl + C to shutdown all services automatically.

You can then inspect the log files to diagnose and resolve any potential issues. Due to the complexity of this project, we cannot guarantee that it will run without errors on every machine.

🔍 Step 2: Run Inference with OpenThinkIMG

🛠️ Installation

First of all, provide a pytorch-vllm-based environment.

vllm>=0.7.3
torch==2.5.1+cu121
transformers>=4.49.0
flash_attn>=2.7.3

# [Optional] Create a clean Conda environment
conda create -n vllm python=3.10
conda activate tool-server
# Install PyTorch and dependencies (make sure CUDA version matches)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# Install this project
git clone https://github.com/OpenThinkIMG/OpenThinkIMG.git
pip install -e OpenThinkIMG
pip install -r OpenThinkIMG/requirements/inference_requirements.txt # Tool Server Requirements

✅ Option 1: Direct Evaluation (e.g., Qwen2VL on ChartGemma)

accelerate launch  --config_file  ${accelerate_config} \
-m tool_server.tf_eval \
--model qwen2vl \
--model_args pretrained=Qwen/Qwen2-VL-7B-Instruct \
--task_name chartgemma \
--verbosity INFO \
--output_path ./tool_server/tf_eval/scripts/logs/results/chartgemma/qwen2vl.jsonl \
--batch_size 2 \
--max_rounds 3 \

🧩 Option 2: Evaluation via Config File (Recommended)

accelerate launch  --config_file  ${accelerate_config} \
-m tool_server.tf_eval \
--config ${config_file}

Config file example:

- model_args:
    model: qwen2vl
    model_args: pretrained=Qwen/Qwen2-VL-7B-Instruct
    batch_size: 2
    max_rounds: 3
    stop_token: <stop>
  task_args:
    task_name: chartgemma
    resume_from_ckpt:
      chartgemma: ./tool_server/tf_eval/scripts/logs/ckpt/chartgemma/qwen2vl.jsonl
    save_to_ckpt:
      chartgemma: ./tool_server/tf_eval/scripts/logs/ckpt/chartgemma/qwen2vl.jsonl
  script_args:
    verbosity: INFO
    output_path: ./tool_server/tf_eval/scripts/logs/results/chartgemma/qwen2vl.jsonl

For detailed information and config setting please refer to our documentation.

🧠 Training

Once the vision tools are properly deployed, we provide a flexible training pipeline to teach models how to plan and invoke tools effectively through SFT and our proposed V-ToolRL methods.

Our training pipeline builds on the solid foundation of OpenR1, integrating visual tools as external reasoning capabilities.

📦 Install Additional Dependencies

To run the training code, make sure to install the additional required packages:

pip install -r requirements/requirements_train.txt

🔁 V-ToolRL: Reinforcement Learning with Vision Tools

We provide a customized implementation of V-ToolRL for training models to leverage vision tools dynamically in complex tasks.

torchrun --nproc_per_node=${nproc_per_node} \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port=${master_port} \
    src/open_r1/tool_grpo.py --use_vllm True \
    --output_dir ${output_dir} \
    --model_name_or_path ${model_path} \
    --dataset_name ${data_path} \
    --max_prompt_length 16000 \
    --max_completion_length 2048 \
    --temperature 1.0 \
    --seed 42 \
    --learning_rate 1e-6 \
    --num_generations 8 \
    --lr_scheduler_type "constant" \
    --vllm_gpu_memory_utilization 0.8 \
    --deepspeed ${DS_CONFIG} \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 12 \
    --logging_steps 1 \
    --bf16 true \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --max_pixels 200000 \
    --num_train_epochs 1 \
    --run_name $RUN_NAME \
    --save_steps 100 \
    --save_only_model true \
    --controller_addr http://SH-IDCA1404-10-140-54-15:20001 \
    --use_tool true

📈 This helps the model learn dynamic planning & tool invocation using environment feedback.

🧪 SFT: Supervised Fine-Tuning

We also support supervised fine-tuning for training models on curated tool usage demonstrations. Modify the config according to your use case:

    accelerate launch --num_machines 1 --num_processes 6 --main_process_port 29502 --multi_gpu\
    src/open_r1/sft.py \
    --output_dir ${output_dir} \
    --model_name_or_path ${model_path} \
    --dataset_name ${data_path} \
    --seed 42 \
    --learning_rate 2e-5 \
    --max_seq_length 4096 \
    --deepspeed config/deepspeed/ds_z3_offload_config.json \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --logging_steps 1 \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --bf16 \
    --num_train_epochs 2 \
    --run_name $RUN_NAME \
    --save_steps 100 \
    --warmup_ratio 0.1 \
    --save_only_model true

🚧 Project Status

OpenThinkIMG is currently an alpha release but is actively being developed. The core end-to-end system, including tool integration, trajectory generation, SFT (Cold-Start), and V-ToolRL training, is functional and can be used to replicate the results in our paper.

The project team is actively working on the following key milestones:

🥇 Release of Pre-trained Models: Providing readily usable SFT-initialized and V-ToolRL-trained agent models (e.g., based on Qwen2-VL-2B).
🛠️ Expanding the Vision Toolset: Integrating more diverse and powerful vision tools (e.g., advanced image editing, 3D analysis tools).
🤖 Broader LVLM Backbone Support: Adding easy integration for more open-source LVLMs (e.g., LLaVA series, MiniGPT-4).
📊 More Benchmarks & Evaluation Suites: Extending evaluation to a wider range of visual reasoning tasks beyond chart reasoning.
🌐 Community Building: Fostering an active community through GitHub discussions, contributions, and collaborations.

We welcome contributions and feedback to help us achieve these goals!

🔧 Vision Toolset

Tool	Input	Output	Description
GroundingDINO	image + text query	bounding boxes	Object detection producing boxes for any target
SAM	image + bounding box	segmentation mask	Generates precise segmentation masks based on provided regions
OCR	image	text strings + bounding boxes	Optical character recognition for extracting text from images
Crop	image + region coordinates	cropped image	Extracts a sub-region of the image for focused analysis
Point	image + target description	point coordinates	Uses a model to predict the location of a specified object
DrawHorizontalLineByY	image + Y-coordinate	annotated image	Draws a horizontal line at the given Y-coordinate
DrawVerticalLineByX	image + X-coordinate	annotated image	Draws a vertical line at the given X-coordinate
ZoominSubplot	image + description (title/pos)	subplot images	Zoomin subplot(s) based on description
SegmentRegionAroundPoint	image + point coordinate	localized mask	Refines segmentation around a specified point

💡 More vision tools are coming soon!

📊 Results on Chart Reasoning (ChartGemma)

Our V-ToolRL approach significantly boosts performance:

Model	Method	Accuracy (%)
GPT-4.1	Zero-shot	50.71
Gemini-2.0-flash-exp	Zero-shot	68.20
---	---	---
CogCom	SFT (CoM)	15.07
TACO	SFT (CoTA)	30.50
---	---	---
Qwen2-vl-2B	Zero-shot	29.56
Qwen2-vl-2B-SFT	SFT	45.67
Text-based RL	RL (No Vis)	51.63
V-ToolRL	V-ToolRL	59.39

V-ToolRL not only enhances our base model by +29.83 points but also outperforms other open-source tool-augmented agents and even strong closed-source models like GPT-4.1.

📂 Case Studies

An example demonstrating the step-by-step visual reasoning process of our V-ToolRL agent.

🤝 Contributing

We welcome contributions of all kinds! In our Documentation you’ll find detailed guides for:

Importing custom models
Defining and integrating new vision tools
Extending the training pipeline

To contribute:

Fork the repository and create a feature branch (e.g., feature/new-vision-tool).
Implement your changes, adding or updating tests under tests/.
Submit a pull request referencing the relevant issue, with clear descriptions and code snippets.

🙏 Acknowledgments

We thank the Visual Sketchpad and TACO teams for inspiring our vision-driven reasoning paradigm.

📖 Citation

Please cite the following if you find OpenThinkIMG helpful:

@article{su2025openthinkimg,
  title={OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning},
  author={Su, Zhaochen and Li, Linjie and Song, Mingyang and Hao, Yunzhuo and Yang, Zhengyuan and Zhang, Jun and Chen, Guanjie and Gu, Jiawei and Li, Juntao and Qu, Xiaoye and others},
  journal={arXiv preprint arXiv:2505.08617},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 419 Commits
configs		configs
docs		docs
r1_v		r1_v
requirements		requirements
tool_server		tool_server
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Use Vision Tools, Think with Images

👁️ Vision: "Thinking with Images"

News

🤔 What is OpenThinkIMG?

🐚 Why OpenThinkIMG?

🚀 Quick Start

🖥️ Step 1: Launch Vision Tool Server

🐳 Option 1: Docker Image

Option 1.1 Start Tool Server with Our Docker Image

Option 1.2 Build Docker Image by Yourself

Option 2. Start Tool Server From Source Code

🛠️ Installation

Option 2.1 Start Tool Server through SLURM

Option 2.2 Start Tool Server Locally

🔍 Step 2: Run Inference with OpenThinkIMG

🛠️ Installation

✅ Option 1: Direct Evaluation (e.g., Qwen2VL on ChartGemma)

🧩 Option 2: Evaluation via Config File (Recommended)

Config file example:

🧠 Training

📦 Install Additional Dependencies

🔁 V-ToolRL: Reinforcement Learning with Vision Tools

🧪 SFT: Supervised Fine-Tuning

🚧 Project Status

🔧 Vision Toolset

📊 Results on Chart Reasoning (ChartGemma)

📂 Case Studies

🤝 Contributing

🙏 Acknowledgments

📖 Citation

About

Uh oh!

Releases

Packages

Languages

discodot/OpenThinkIMG

Folders and files

Latest commit

History

Repository files navigation

Use Vision Tools, Think with Images

👁️ Vision: "Thinking with Images"

News

🤔 What is OpenThinkIMG?

🐚 Why OpenThinkIMG?

🚀 Quick Start

🖥️ Step 1: Launch Vision Tool Server

🐳 Option 1: Docker Image

Option 1.1 Start Tool Server with Our Docker Image

Option 1.2 Build Docker Image by Yourself

Option 2. Start Tool Server From Source Code

🛠️ Installation

Option 2.1 Start Tool Server through SLURM

Option 2.2 Start Tool Server Locally

🔍 Step 2: Run Inference with OpenThinkIMG

🛠️ Installation

✅ Option 1: Direct Evaluation (e.g., Qwen2VL on ChartGemma)

🧩 Option 2: Evaluation via Config File (Recommended)

Config file example:

🧠 Training

📦 Install Additional Dependencies

🔁 V-ToolRL: Reinforcement Learning with Vision Tools

🧪 SFT: Supervised Fine-Tuning

🚧 Project Status

🔧 Vision Toolset

📊 Results on Chart Reasoning (ChartGemma)

📂 Case Studies

🤝 Contributing

🙏 Acknowledgments

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages