Congratulations on your great work !
I have some questions about the architecture of HiDream-E1, I would be grateful if you could provide some insignts.
- Why concat latent image and condition image at width dimension before patchify, instead of token dimension (height dimension) after patchify ?
- The position embedding seems to regard latents and condition image as an Union Image. Why not using separating position embedding ( e.g., OminiControl ) for latents and condition image ?