Chroma Follow Up #11725

DN6 · 2025-06-16T18:14:36Z

What does this PR do?

Follow up to #11698

This PR

Adds Img2Img pipeline for Chroma
Fixes an issue where we neglected passing the modified attention mask to the transformer model, leading to quality issues mentioned in Chroma fails to mask attention in the transformer #11724
Clean up Docstrings

Fixes # (issue)
#11724

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

DN6 · 2025-06-16T18:15:12Z

@AmericanPresidentJimmyCarter do these changes help with the quality issue you're seeing?

cc: @Ednaordinary

HuggingFaceDocBuilderDev · 2025-06-16T18:23:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

nitinmukesh · 2025-06-16T18:33:57Z

Even I was getting poor results and gave up on this. Let me try again.

AmericanPresidentJimmyCarter · 2025-06-16T18:44:06Z

This one is looking much better, thank you.

Refer to the issue for another potential issue (maybe needs a new issue?), I think there may also be issues with T5 quantization because of the way the original authors chose to train it.

nitinmukesh · 2025-06-16T18:52:21Z

After installing
pip install git+https://github.com/huggingface/diffusers.git@refs/pull/11725/head

Better now used this config

pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
        "load_in_4bit": True,
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_compute_dtype": dtype,
        "llm_int8_skip_modules": ["distilled_guidance_layer"],
    },
    components_to_quantize=["transformer", "text_encoder"],
)

AmericanPresidentJimmyCarter · 2025-06-16T18:55:24Z

Tested it again with GGUF 8-bit T5 text transformer (see issue). It seems that to get full quality, we're going to have to figure out what the forked transformer the reference is using does to the text embeddings.

AmericanPresidentJimmyCarter · 2025-06-16T19:07:31Z

Another comparison for this branch.

T5 8-bit quanto:

T5 bfloat16:

T5 8-bit GGUF:

It's crazy that there's such a world of difference between them, and neither quant seems quite right.. BF16 appears to give the best quality images.

Ednaordinary

Thanks for this! Apologies I couldn't have been more help, I was busy all day. I did start on my own version which I'll compare to this version if I can get it to work well

Ednaordinary · 2025-06-17T04:03:39Z

src/diffusers/pipelines/chroma/pipeline_chroma.py

@@ -256,6 +256,8 @@ def encode_prompt(
        num_images_per_prompt: int = 1,
        prompt_embeds: Optional[torch.FloatTensor] = None,
        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        prompt_attention_mask: Optional[torch.Tensor] = None,
+        negative_prompt_attention_mask: Optional[torch.Tensor] = None,
        do_classifier_free_guidance: bool = True,
        max_sequence_length: int = 512,
        lora_scale: Optional[float] = None,


We should include negative_prompt_embeds (and prompt_attention_mask, negative_prompt_attention mask) in docs. I missed it in the pipeline PR

Ednaordinary · 2025-06-17T04:12:31Z

Also, now that prompt_embeds and negative_prompt embeds consistently stay the same size, we should batch cond and uncond (also simplifying attention_mask + negative_attention_mask into just attention_mask)

Update: made a PR for this in #11729

Ednaordinary · 2025-06-17T05:18:13Z

Hm.. the attention mask lines up with what lodestone shows, so I'm a bit confused why the picture quality is not on par with ~~how it was before batch refactor~~/~~is in ComfyUI~~ (actually after testing, I can't get a good image in either. maybe the prompt is just bad.)

yiyixuxu

thanks!

do we already have ip-adapter for chroma?

yiyixuxu · 2025-06-17T11:12:08Z