Skip to content

Commit b1b71af

Browse files
authored
Merge pull request #12 from asFeng/asfeng
uploading dedit
2 parents 5e2d10d + d5ac5a2 commit b1b71af

File tree

9 files changed

+112
-8
lines changed

9 files changed

+112
-8
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,6 @@ yarn-error.log*
3333
# typescript
3434
*.tsbuildinfo
3535
next-env.d.ts
36+
37+
38+
.vscode

.vscode/settings.json

Lines changed: 0 additions & 8 deletions
This file was deleted.

app/projects/dedit/assets/arch.png

650 KB
Loading
3.51 MB
Loading
3.81 MB
Loading
3.53 MB
Loading
4.49 MB
Loading

app/projects/dedit/page.mdx

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
import { Authors, Badges } from '@/components/utils'
2+
3+
# An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control
4+
5+
<Authors
6+
authors="Aosong Feng, Yale University; Weikang Qiu, Yale University; Jinbin Bai, Collov AI; Xiao Zhang, Collov AI; Zhen Dong, Collov AI; Kaicheng Zhou, Collov AI; Rex Ying, Yale University; Leandros Tassiulas, Yale University"
7+
/>
8+
9+
<Badges
10+
venue="AAAI 2025"
11+
github="https://github.com/Graph-and-Geometric-Learning/HyBRiD"
12+
arxiv="https://arxiv.org/abs/2312.02203"
13+
pdf="https://arxiv.org/pdf/2312.02203"
14+
/>
15+
16+
17+
## Introduction
18+
Recent advancements in text-to-image diffusion models have revolutionized image editing by enabling sophisticated control over tasks like inpainting, text-guided editing, and item removal. Despite progress, challenges remain in preserving original image integrity and achieving precise semantic alignment with modifications. To address these, we introduce D-Edit, a versatile framework that disentangles item-prompt interactions using grouped cross-attention and unique item prompts. D-Edit supports text-based, image-based, mask-based editing, and item removal within a unified system, offering unprecedented flexibility and precision for creative and practical editing applications.
19+
20+
21+
22+
## Method
23+
24+
#### Item-Prompt Association
25+
26+
The original LDM performs text-image interaction between every token in $c$ and every pixel in $z_t$ through cross-attention matrix $A$.
27+
In fact, such token-pixel interactions have been shown disentangled in nature, and the attention matrix $A\in\mathbb{R}^{Z\times W}$ is usually sparse in the sense that each column (token) only attend to several non-zero rows (pixels).
28+
For example, during image generation, the word "bear" has higher attention scores with pixels related to the bear region compared to the remaining region.
29+
30+
31+
Inspired by the natural disentanglement, we propose to segment the given image $I$ into $N$ non-overlapped items $\{I_i \}_{i=1}^{N}$ using segmentation model (same segmentation applied to $z^t$ because of emergent correspondence).
32+
A set of prompts $\{ P_i\}_{i=1}^{N}$ is adopted to replace the original text prompt $P$.
33+
we force different items $I_i$ to be controlled by distinct prompt $P_i$ by masking our other items, and therefore any prompt changes in $P_i$ will not influence the remaining item during the cross-attention controlling flow, which is the desired property for image editing.
34+
This results in a group of disentangled cross-attentions. For each item-prompt pair ($I_i$, $P_i$), the cross-attention can be written as
35+
$$
36+
q_i=w_q z^t_i \in \mathbb{R}^{Z_i\times D} \quad
37+
k_i = w_k c_i \in \mathbb{R}^{W_i\times D} \quad
38+
v_i = w_v c_i \in \mathbb{R}^{W_i\times D}\\
39+
\quad
40+
41+
\text{out}(\{ c_i\}, \{z^t_i\}) = \Sigma_{i=1}^{N} \text{out}_i(c_i, z^t_i) \quad
42+
A_i = \text{softmax}(q_ik_i^T) \in \mathbb{R}^{Z_i\times W_i} \quad
43+
\text{out}(c_i, z^t_i) = A_i\cdot v_i
44+
$$
45+
It should be noted that such disentangled cross-attention cannot be directly used for pretrained LDMs, and therefore further finetuning is necessary to enable the model to comprehend item prompts and grouped cross-attention.
46+
47+
![Comparison of conventional full cross-attention and grouped cross-attention. Query, key, and value are shown as one-dimensional vectors. For grouped cross-attention, each item (corresponding to certain pixels/patches) only attends to the text prompt (two tokens) assigned to it.|scale=0.7](./assets/arch.png)
48+
49+
50+
#### Linking Prompt to Item
51+
52+
We link prompts to items with two sequential steps. We first introduce the item prompt, consisting of several special tokens with randomly initialized embeddings.
53+
Then we finetune the model to build the item-prompt association.
54+
55+
56+
57+
##### Prompt Injection
58+
We propose to represent each item in an image with several new tokens which are inserted into the existing vocabulary of text encoder(s).
59+
Specifically, we use 2 tokens to represent each item and initialize the newly added embedding entries using Gaussian distribution with mean and standard deviation derived from the existing vocabulary.
60+
For comparisons, Dreambooth represents the image using rare tokens and
61+
perfect rare tokens should have no interference with existing vocabulary, which is hard to find.
62+
Textual inversion and Imagic insert new tokens into vocabulary where the corresponding embedding is semantically initialized by given word embeddings which describe the image. This adds additional burdens of captioning the original image.
63+
We found that it is sufficient to use randomly initialized new tokens as item prompts and such randomly initialized tokens have minimal impact on the existing vocabularies.
64+
65+
66+
To associate items with prompts, the inserted embedding entries are then optimized to reconstruct the corresponding image to be edited using
67+
$$
68+
\text{min}_e \mathbb{E}_{t,\epsilon}\left[|| \epsilon - f_\theta (z_t, t, g_\Phi(P) )||^2 \right],
69+
$$
70+
where $e\in\mathbb{R}^{NM\times D_{\text{emb}}}$
71+
represents the embedding rows corresponding to $N$ items each with $M$ tokens.
72+
73+
74+
##### Model Finetuning
75+
Optimization in the first stage injects the image concept into text-encoder(s), but cannot achieve perfect reconstruction of the original item given the corresponding prompt.
76+
Therefore, in the second stage of optimization, we optimize the UNet parameters by running optimization with the same objective function as in Equation.
77+
We found that updating parameters solely within cross-attention layers is adequate, as we only disentangle the forward process of these layers rather than the entire model.
78+
It should be noted that the optimizations above are running against only one image or two images (target and reference images) if image-based editing is needed.
79+
80+
Editing with Item-Prompt Freestyle.
81+
After the two-step optimization, the model can exactly reconstruct the original image given the set of prompts corresponding to each item, with an appropriate classifier-free guidance scale.
82+
We then achieve various disentangled image editing by changing the prompt associated with an item, the mask of an item-prompt pair, and the mapping between items and prompts.
83+
84+
85+
86+
## Experiments
87+
#### Text-based Editing
88+
![The learned prompt (denoted as [v]) can be combined with words to achieve refinement/editing of the target item. (a) Augment an item prompt with words while keeping other prompts unchanged for editing. (b) Generate the entire image with certain item prompt(s) augmented with text words for personalization.|scale=0.7](./assets/img3_text2.jpg)
89+
90+
#### Image-based Editing
91+
![Qualitative comparison of image-guided editing. D-Edit is compared with Anydoor, Paint-by-Example, and TF-ICON, on item replacement and face swapping.|scale=0.7](./assets/img4_image2.jpg)
92+
#### Mask-based Editing
93+
![Different types of mask-based editing: (a) Moving/swapping items; (b) reshaping an item; (c) Resizing an item.|scale=0.7](./assets/img5_mask.jpg)
94+
#### Item Removal
95+
![Removing items one by one from the image |scale=0.7](./assets/img7_remove.jpg)

config/publications.ts

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,20 @@ export const publications: Publication[] = [
2929
impact: "Beyond theoretical guarantees, we demonstrate the improvements achieved by LResNet in building hyperbolic deep learning models, where we conduct extensive experiments to show its superior performance in graph and image modalities across CNNs, GNNs, and graph Transformers.",
3030
tags: [Tag.MultiModalFoundationModel],
3131
},
32+
{
33+
title: "D-Edit: An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control",
34+
authors:
35+
"Aosong Feng, Weikang Qiu, Jinbin Bai,Xiao Zhang, Zhen Dong, Kaicheng Zhou, Rex Ying, and Leandros Tassiulas",
36+
venue: "AAAI 2025",
37+
page: "dedit",
38+
paper: "https://arxiv.org/abs/2403.04880",
39+
code: "https://github.com/collovlabs/d-edit",
40+
tags: [Tag.Applications],
41+
abstract:
42+
"D-Edit is a novel framework for diffusion-based image editing framework that disentangles image-prompt into item-prompt associations, enabling precise and harmonious edits across image, achieving state-of-the-art results in a unified, versatile approach.",
43+
impact:
44+
"The proposed method is a unified editing framework that supports image-based, text-based, mask-based editing, and item removal within a single cohesive system.",
45+
},
3246
{
3347
title: "Protein-Nucleic Acid Complex Modeling with Frame Averaging Transformer",
3448
authors: "Tinglin Huang, Zhenqiao Song, Rex Ying, Wengong Jin",

0 commit comments

Comments
 (0)