From Global Semantics to Region-aligned Guidance
RePlan is an instruction-based image editing framework designed to conquer Instruction-Visual Complexity (IV-Complexity). By coupling a Multimodal Large Language Model (MLLM) planner with a diffusion-based editor, RePlan shifts the paradigm from coarse global semantics to Region-aligned Guidance. It achieves significant improvement in fine-grained visual reasoning and background consistency without requiring massive paired training data.
- [Dec 26th, 2025] We updated the gradio demo for custom attention control and optimized inference settings when using Qwen-Image-Edit as backbone.
- [Dec 19th, 2025] We released paper, model and data of RePlan!
- π Overview
- π‘ Introduction
- π οΈ Environment Setup
- β‘ Inference
- π Evaluation on IV-Edit Benchmark
- π Train Your Own Planner
- π Citation
Current instruction-based editing models struggle when intricate instructions meet cluttered, realistic scenesβa challenge we define as Instruction-Visual Complexity (IV-Complexity). In these scenarios, high-level global context is insufficient to distinguish specific targets from semantically similar objects (e.g., distinguishing a "used cup" from a clean glass on a messy desk).
Existing methods, including unified VLM-diffusion architectures, predominantly rely on Global Semantic Guidance. They compress instructions into global feature vectors, lacking spatial grounding. Consequently, edits often "spill over" into unrelated areas or modify the wrong targets, failing to preserve background consistency.
RePlan introduces a Plan-then-Execute framework that explicitly links text to pixels. Our key contributions include:
-
π§± Reasoning-Guided Planning
A VLM planner performs Chain-of-Thought (CoT) reasoning to decompose complex instructions into structured, region-specific guidance (Bounding Boxes + Local Hints).
-
π― Training-Free Attention Injection
We introduce a mechanism tailored for Multimodal DiT (MMDiT) that executes edits via region-constrained attention. This enables precise, multi-region parallel edits in a single pass while preserving the background, without requiring any training of the DiT backbone.
-
β‘ Efficient GRPO Training
We enhance the planner's reasoning capabilities using Group Relative Policy Optimization (GRPO). Remarkably, we achieve strong planning performance using only ~1k instruction-only samples, bypassing the need for large-scale paired image datasets.
-
ποΈ Interactive & Flexible Editing
RePlan's intermediate region guidance is fully editable, enabling user-in-the-loop intervention. Users can adjust bounding boxes or hints directly to refine results. Furthermore, our attention mechanism supports regional negative prompts to prevent bleeding effects.
-
π IV-Edit Benchmark
To foster future research, we establish IV-Edit, the first benchmark specifically designed to evaluate IV-Complex editing, filling the gap left by current subject-dominated evaluation sets.
First, clone the repository:
git clone https://github.com/taintaintainu/RePlan-IVEdit-page.git
cd RePlan
If you only want to run demos or local inference, set up a lightweight environment:
conda create -n replan_infer python=3.10
conda activate replan_infer
pip install -e .
pip install flash-attn --no-build-isolation
For full training or evaluation on the IV-Edit benchmark:
conda create -n replan python=3.10
conda activate replan
pip install -e .[full]
pip install flash-attn --no-build-isolation
We provide both a Web Interface for interactive visualization/editing and a Command Line Interface.
Launch a local web server to interactively edit images. This interface supports visualizing and manually modifying the region guidance (Bounding Boxes & Hints) generated by the planner before execution.
# use Flux.1 Kontext dev as the backbone
python replan/inference/app.py --server_port 8080 --pipeline_type "flux"
# use Qwen-Image-Edit as the backbone
python replan/inference/app.py --server_port 8080 --pipeline_type "qwen"demo.mp4
-
Upload & Select: Click the top-left area to upload an image, or select any provided example.
-
Baseline Method: Copy your instruction directly into the Global Prompt text box, skip the AI Plan step, and click Run Editing. The result will be identical to using the baseline model (Flux.1 Kontext dev or Qwen-Image-Edit).
-
RePlan Method: Enter your instruction and click AI Plan. After a few seconds, the model generates the Global Prompt and Region Guidance. You can see the visual BBoxes (Bounding Boxes) in the interactive interface. Click Run Editing to generate the RePlan result.
-
Manual Correction: If you are unsatisfied with the AI-generated guidance, you can modify it:
- Global Prompt: Edit the text in the prompt box below the interface.
- Move/Resize BBoxes: Select a BBox in the interactive window to drag it or resize it by dragging its edges.
- Add/Delete BBoxes: Drag on any empty area in the interface to create a new BBox. To delete one, select it and click the Delete button in the bottom-right.
- Edit Hints: After selecting a BBox, modify its specific Local Hint in the bottom-right text box.
- Click Run Editing to apply your manual changes.
-
Manual Design: You can skip the AI Plan entirely. Manually enter a Global Prompt, draw your own BBoxes/Hints using the tools above, and click Run Editing.
Example of advanced control panel. "Main": global prompt. "Hi": Hint i. "N.Bi": Noise of bbox i. "I.Bi": Image of bbox i. "N.BG": Noise of background. "I.BG": Image of background.
-
Interactive Attention Matrix: Defines the fine-grained attention mask between specific components to control information flow. Noise/Image components distinguish the Latent Noise Patches from Input Image Patches.
-
Rule Switch Ratio: Determines the phase transition point (
0.0-1.0) from Custom Rules (defined in 1) to Default Rules (depicted in the paper). The cutoff step is calculated as (e.g., Ratio0.7with50steps β Custom Rules apply for steps 0β35). -
Locality Control Strategy: Predefined custom rule of the demo disables Noise Background β Noise BBox attention to isolate background from edit target. The Rule Switch Ratio controls the duration (number of steps) of this isolation:
- Higher Ratio: Enforces stronger locality. Increase to fix semantic spillover (edits leaking into the background).
- Lower Ratio: Allows more global interaction. Decrease to fix boundary artifacts (unnatural, sharp edges).
-
Expand Value: Expands the effective attention mask relative to the bounding box size (e.g.,
0.15to expand 15% of bbox size).
Launch a session to load the model once and perform multiple inference rounds:
python replan/inference/run_replan.py --pipeline_type "flux" # use Flux.1 Kontext dev as RePlan backbone
python replan/inference/run_replan.py --pipeline_type "qwen" # use Qwen-Image-Edit as RePlan backbone
Run inference directly by providing arguments (the program ends after one editing round):
python replan/inference/run_replan.py \
--image "assets/cup.png" \
--instruction "Replace the cup that has been used and left on the desk with a small potted plant" \
--output_dir "output/inference" \
--pipeline_type "flux"
--image: Path to the input image (relative or absolute).--instruction: Editing instruction text.--output_dir: Directory to save results (default:./output/inference).--only_save_image: If set, only saves the edited image (omits VLM response and visualization).--vlm_ckpt_path: Path to the Planner VLM checkpoint (default:TainU/RePlan-Qwen2.5-VL-7B).--pipeline_type: Diffusion pipeline type. Supportsflux(Flux.1 Kontext dev) andqwen(Qwen-Image-Edit).--lora_path: (Experimental) Path to LoRA weights.
We propose the IV-Edit Benchmark to evaluate performance on IV-Complex editing scenarios.
- Dataset: ~800 manually verified instruction-image pairs.
- Capabilities: Tests 7 referring types (e.g., Spatial, Knowledge) and 16 task types (e.g., Attribute Modification, Physics Reasoning). Please refer to the paper appendix for specific categories.
- Metrics: We use Gemini-2.5-Pro to evaluate Target, Consistency, Quality, and Effect.
Examples from IV-Edit spanning a wide range of real-world scenarios and fine-grained instruction intents, Including spatial, structural, and reasoning-intensive edits. Each instruction is decomposed into a referring expression and a task type, reflecting the need for both grounded understanding and visual transformation.
Evaluation consists of two steps: generating edited images and scoring them using Gemini.
Generate edited images using the desired backbone. Configuration files control model parameters and output paths.
# RePlan with Flux backbone
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_flux.yaml
# RePlan with Qwen-Image-Edit
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_qwen_image.yaml
# Original Flux.1 Kontext Dev
bash replan/eval/scripts/gen_kontext.sh replan/eval/config/flux_kontext.yaml
# Original Qwen-Image-Edit
bash replan/eval/scripts/gen_qwen_image.sh replan/eval/config/qwen_image.yaml
# Flux Kontext with Global Instruction Rephrasing (using Qwen)
bash replan/eval/scripts/gen_kontext_rephrase.sh replan/eval/config/qwen_kontext_rephrase.yaml
We use Gemini-2.5-Pro for evaluation.
-
Configure API: Update
replan/eval/api_config.yamlwith your credentials. We support:- Vertex AI (credential or API key)
- Google GenAI (API key)
- OpenAI-compatible 3rd party endpoints
-
Run Evaluation:
python evaluate.py \ --edited_images_dir ./output/replan_flux \ # Must match the output dir from step 1 --output_json ./output/replan_flux.json # Path to save aggregated results
Quantitative comparison of open-source and proprietary image editing models on four evaluation dimensions. We also report Overall and Weighted scores. For openβsource models, the highest score in each column is marked as Bold, while the second highest is indicated with Underline. RePlan achieves the best consistency and overall score among open-source models.
The RePlan VLM planner is optimized using Group Relative Policy Optimization (GRPO) in two stages, requiring only ~1k samples.
Ensure full dependencies are installed:
pip install -e .[full]
Optimizes for valid JSON structure and reasoning chain adequacy using format-related rewards.
bash replan/train/scripts/grpo_stage1.sh
Convert Checkpoint: You may need to convert the saved checkpoint to HuggingFace format:
python -m replan.train.model_merger --local_dir ./ckpt/easy_r1/stage1/global_step_30/actor
Uses a large VLM (e.g., Qwen2.5-VL 72B) as a Reward Model to evaluate the execution quality (Target, Effect, Consistency) of the generated plans.
We use reward server paradigm for reward calculation, requiring deployment of:
1. A VLLM server for the VLM reward model.
2. An editing model server (Kontext).
1. Start Reward Model Server (VLLM) Navigate to the reward function directory:
cd replan/train/reward_function
Launch the VLLM server (example using Qwen2.5-VL-72B on 8 GPUs):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 4096 \
--limit-mm-per-prompt '{"image":2}' \
--disable-mm-preprocessor-cache \
--gpu-memory-utilization 0.35 \
--enforce-eager \
--port 8000
Note: You can use other VLM models or OpenAI-compatible APIs. Smaller models are supported but may degrade reward quality.
2. Start Editing Model Server (Kontext) Deploy the image editing model. Each specified GPU will host a separate instance for data-parallel inference.
# Usage: bash start_kontext_server.sh <gpu_ids> <port>
bash start_kontext_server.sh 0,1,2,3,4,5,6,7 8001
Note: Since the reward and editing servers are accessed sequentially, they can share GPUs if memory permits.
3. Configure and Run Training
Update the server URLs in replan/train/scripts/grpo_stage2.sh:
export KONTEXT_SERVER_URL="http://localhost:8001/v1/images/generations" # Your Kontext server URL
export OPENAI_API_BASE="http://localhost:8000/v1" # Your VLLM server URL
(If servers are on different nodes, update localhost to the corresponding IP addresses.)
Launch training:
bash replan/train/scripts/grpo_stage2.sh
4. Convert Final Checkpoint
python -m replan.train.model_merger --local_dir ./ckpt/easy_r1/stage2/global_step_40/actor
@article{qu2025replan,
title={RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing},
author={Tianyuan Qu and Lei Ke and Xiaohang Zhan and Longxiang Tang and Yuqi Liu and Bohao Peng and Bei Yu and Dong Yu and Jiaya Jia},
journal={arXiv preprint arXiv:2512.16864},
year={2025}
}