fix(flux2): Fix FLUX.2 Klein image generation quality (#8838)

Pfannkuchensack · claude · JPPhoto · lstein · commit 4081f8701e3b · 2026-02-06T08:04:03.000-05:00
* fix(flux2): Fix image quality degradation at resolutions &gt; 1024x1024

This commit addresses severe quality degradation and artifacts when
generating images larger than 1024x1024 with FLUX.2 Klein models.

Root causes fixed:

1. Dynamic max_image_seq_len in scheduler (flux2_denoise.py)
   - Previously hardcoded to 4096 (1024x1024 only)
   - Now dynamically calculated based on actual resolution
   - Allows proper schedule shifting at all resolutions

2. Smoothed mu calculation discontinuity (sampling_utils.py)
   - Eliminated 40-50% mu value drop at seq_len 4300 threshold
   - Implemented smooth cosine interpolation (4096-4500 transition zone)
   - Gradual blend between low-res and high-res formulas

Impact:
- FLUX.2 Klein 9B: Major quality improvement at high resolutions
- FLUX.2 Klein 4B: Improved quality at high resolutions
- Baseline 1024x1024: Unchanged (no regression)
- All generation modes: T2I and Kontext (reference images)

Fixes: Community-reported quality degradation issue
See: Discord discussions in #garbage-bin and #devchat

Co-Authored-By: Claude Sonnet 4.5 &lt;noreply@anthropic.com&gt;

* fix(flux2): Fix high-resolution quality degradation for FLUX.2 Klein

  Fixes grid/diamond artifacts and color loss at resolutions &gt; 1024x1024.

  Root causes identified and fixed:
  - BN normalization was incorrectly applied to random noise input
    (diffusers only normalizes image latents from VAE.encode)
  - BN denormalization must be applied to output before VAE decode
  - mu parameter was resolution-dependent causing over-shifted schedules
    at high resolutions (now fixed to 2.02, matching ComfyUI)

  Changes:
  - Remove BN normalization on noise input (not needed for N(0,1) noise)
  - Preserve BN denormalization on denoised output (required for VAE)
  - Fix mu to constant 2.02 for all resolutions (matches ComfyUI)

  Tested at 2048x2048 with FLUX.2 Klein 4B

* Chore Ruff

---------

Co-authored-by: Claude Sonnet 4.5 &lt;noreply@anthropic.com&gt;
Co-authored-by: Jonathan &lt;34005131+JPPhoto@users.noreply.github.com&gt;
diff --git a/invokeai/app/invocations/flux2_denoise.py b/invokeai/app/invocations/flux2_denoise.py
@@ -329,15 +329,13 @@ def _run_diffusion(self, context: InvocationContext) -> torch.Tensor:
         noise_packed = pack_flux2(noise)
         x = pack_flux2(x)
 
-        # Apply BN normalization BEFORE denoising (as per diffusers Flux2KleinPipeline)
-        # BN normalization: y = (x - mean) / std
-        # This transforms latents to normalized space for the transformer
-        # IMPORTANT: Also normalize init_latents and noise for inpainting to maintain consistency
-        if bn_mean is not None and bn_std is not None:
-            x = self._bn_normalize(x, bn_mean, bn_std)
-            if init_latents_packed is not None:
-                init_latents_packed = self._bn_normalize(init_latents_packed, bn_mean, bn_std)
-            noise_packed = self._bn_normalize(noise_packed, bn_mean, bn_std)
+        # BN normalization for txt2img:
+        # - DO NOT normalize random noise (it's already N(0,1) distributed)
+        # - Diffusers only normalizes image latents from VAE (for img2img/kontext)
+        # - Output MUST be denormalized after denoising before VAE decode
+        #
+        # For img2img with init_latents, we should normalize init_latents on unpacked
+        # shape (B, 128, H/16, W/16) - this is handled by _bn_normalize_unpacked below
 
         # Verify packed dimensions
         assert packed_h * packed_w == x.shape[1]
diff --git a/invokeai/app/invocations/flux2_vae_decode.py b/invokeai/app/invocations/flux2_vae_decode.py
@@ -57,20 +57,6 @@ def _vae_decode(self, vae_info: LoadedModel, latents: torch.Tensor) -> Image.Ima
             # Decode using diffusers API
             decoded = vae.decode(latents, return_dict=False)[0]
 
-        # Debug: Log decoded output statistics
-        print(
-            f"[FLUX.2 VAE] Decoded output: shape={decoded.shape}, "
-            f"min={decoded.min().item():.4f}, max={decoded.max().item():.4f}, "
-            f"mean={decoded.mean().item():.4f}"
-        )
-        # Check per-channel statistics to diagnose color issues
-        for c in range(min(3, decoded.shape[1])):
-            ch = decoded[0, c]
-            print(
-                f"[FLUX.2 VAE] Channel {c}: min={ch.min().item():.4f}, "
-                f"max={ch.max().item():.4f}, mean={ch.mean().item():.4f}"
-            )
-
         # Convert from [-1, 1] to [0, 1] then to [0, 255] PIL image
         img = (decoded / 2 + 0.5).clamp(0, 1)
         img = rearrange(img[0], "c h w -> h w c")
diff --git a/invokeai/backend/flux2/sampling_utils.py b/invokeai/backend/flux2/sampling_utils.py
@@ -108,33 +108,27 @@ def unpack_flux2(x: torch.Tensor, height: int, width: int) -> torch.Tensor:
 
 
 def compute_empirical_mu(image_seq_len: int, num_steps: int) -> float:
-    """Compute empirical mu for FLUX.2 schedule shifting.
+    """Compute mu for FLUX.2 schedule shifting.
 
-    This matches the diffusers Flux2Pipeline implementation.
-    The mu value controls how much the schedule is shifted towards higher timesteps.
+    Uses a fixed mu value of 2.02, matching ComfyUI's proven FLUX.2 configuration.
+
+    The previous implementation (from diffusers' FLUX.1 pipeline) computed mu as a
+    linear function of image_seq_len, which produced excessively high values at
+    high resolutions (e.g., mu=3.23 at 2048x2048). This over-shifted the sigma
+    schedule, compressing almost all values above 0.9 and forcing the model to
+    denoise everything in the final 1-2 steps, causing severe grid/diamond artifacts.
+
+    ComfyUI uses a fixed shift=2.02 for FLUX.2 Klein at all resolutions and produces
+    artifact-free images even at 2048x2048.
 
     Args:
-        image_seq_len: Number of image tokens (packed_h * packed_w).
-        num_steps: Number of denoising steps.
+        image_seq_len: Number of image tokens (packed_h * packed_w). Currently unused.
+        num_steps: Number of denoising steps. Currently unused.
 
     Returns:
-        The empirical mu value.
+        The mu value (fixed at 2.02).
     """
-    a1, b1 = 8.73809524e-05, 1.89833333
-    a2, b2 = 0.00016927, 0.45666666
-
-    if image_seq_len > 4300:
-        mu = a2 * image_seq_len + b2
-        return float(mu)
-
-    m_200 = a2 * image_seq_len + b2
-    m_10 = a1 * image_seq_len + b1
-
-    a = (m_200 - m_10) / 190.0
-    b = m_200 - 200.0 * a
-    mu = a * num_steps + b
-
-    return float(mu)
+    return 2.02
 
 
 def get_schedule_flux2(
@@ -169,11 +163,14 @@ def get_schedule_flux2(
 
 
 def generate_img_ids_flux2(h: int, w: int, batch_size: int, device: torch.device) -> torch.Tensor:
-    """Generate tensor of image position ids for FLUX.2.
+    """Generate tensor of image position ids for FLUX.2 with RoPE scaling.
 
     FLUX.2 uses 4D position coordinates (T, H, W, L) for its rotary position embeddings.
     This is different from FLUX.1 which uses 3D coordinates.
 
+    RoPE Scaling: For resolutions >1536x1536, position IDs are scaled down using
+    Position Interpolation to prevent RoPE degradation and diamond/grid artifacts.
+
     IMPORTANT: Position IDs must use int64 (long) dtype like diffusers, not bfloat16.
     Using floating point dtype for position IDs can cause NaN in rotary embeddings.