EGGen: Image Generation with Multi-entity Prior Learning through Entity Guidance
Files
(Published version)
Date
2024
Authors
Sun, Z.
Wang, J.
Tan, Z.
Dong, D.
Ma, H.
Li, H.
Gong, D.
Editors
Advisors
Journal Title
Journal ISSN
Volume Title
Type:
Conference paper
Citation
Proceedings of the 32nd ACM International Conference on Multimedia (MM'24), 2024, pp.6637-6645
Statement of Responsibility
Zhenhong Sun, Junyan Wang, Zhiyu Tan, Daoyi Dong∗, Hailan Ma, Hao Li∗, Dong Gong
Conference Name
32nd ACM International Conference on Multimedia (MM) (28 Oct 2024 - 1 Nov 2024 : Melbourne, VIC, Australia)
Abstract
Diffusion models have shown remarkable prowess in text-to-image synthesis and editing, yet they often stumble when tasked with interpreting complex prompts that describe multiple entities with specific attributes and interrelations. The generated images often contain inconsistent multi-entity representation (IMR), reflected as inaccurate presentations of the multiple entities and their attributes. Although providing spatial layout guidance improves the multientity generation quality in existing works, it is still challenging to handle the leakage attributes and avoid unnatural characteristics. To address the IMR challenge, we first conduct in-depth analyses of the diffusion process and attention operation, revealing that the IMR challenges largely stem from the process of cross-attention mechanisms. According to the analyses, we introduce the entity guidance generation mechanism, which maintains the integrity of the original diffusion model parameters by integrating plugin networks. Our work advances the stable diffusion model by segmenting comprehensive prompts into distinct entity-specific prompts with bounding boxes, enabling a transition from multientity to single-entity generation in cross-attention layers. More importantly, we introduce entity-centric cross-attention layers that focus on individual entities to preserve their uniqueness and accuracy, alongside global entity alignment layers that refine crossattention maps using multi-entity priors for precise positioning and attribute accuracy. Additionally, a linear attenuation module is integrated to progressively reduce the influence of these layers during inference, preventing oversaturation and preserving generation fidelity. Our comprehensive experiments demonstrate that this entity guidance generation enhances existing text-to-image models in generating detailed, multi-entity images. Code is available at https://github.com/chaos-sun/eggen.git.
School/Discipline
Dissertation Note
Provenance
Description
Access Status
Rights
© 2024 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike International 4.0 License.