EGGen: Image Generation with Multi-entity Prior Learning through Entity Guidance

Files

hdl_148204.pdf (33.65 MB)
  (Published version)

Date

2024

Authors

Sun, Z.
Wang, J.
Tan, Z.
Dong, D.
Ma, H.
Li, H.
Gong, D.

Editors

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Conference paper

Citation

Proceedings of the 32nd ACM International Conference on Multimedia (MM'24), 2024, pp.6637-6645

Statement of Responsibility

Zhenhong Sun, Junyan Wang, Zhiyu Tan, Daoyi Dong∗, Hailan Ma, Hao Li∗, Dong Gong

Conference Name

32nd ACM International Conference on Multimedia (MM) (28 Oct 2024 - 1 Nov 2024 : Melbourne, VIC, Australia)

Abstract

Diffusion models have shown remarkable prowess in text-to-image synthesis and editing, yet they often stumble when tasked with interpreting complex prompts that describe multiple entities with specific attributes and interrelations. The generated images often contain inconsistent multi-entity representation (IMR), reflected as inaccurate presentations of the multiple entities and their attributes. Although providing spatial layout guidance improves the multientity generation quality in existing works, it is still challenging to handle the leakage attributes and avoid unnatural characteristics. To address the IMR challenge, we first conduct in-depth analyses of the diffusion process and attention operation, revealing that the IMR challenges largely stem from the process of cross-attention mechanisms. According to the analyses, we introduce the entity guidance generation mechanism, which maintains the integrity of the original diffusion model parameters by integrating plugin networks. Our work advances the stable diffusion model by segmenting comprehensive prompts into distinct entity-specific prompts with bounding boxes, enabling a transition from multientity to single-entity generation in cross-attention layers. More importantly, we introduce entity-centric cross-attention layers that focus on individual entities to preserve their uniqueness and accuracy, alongside global entity alignment layers that refine crossattention maps using multi-entity priors for precise positioning and attribute accuracy. Additionally, a linear attenuation module is integrated to progressively reduce the influence of these layers during inference, preventing oversaturation and preserving generation fidelity. Our comprehensive experiments demonstrate that this entity guidance generation enhances existing text-to-image models in generating detailed, multi-entity images. Code is available at https://github.com/chaos-sun/eggen.git.

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

© 2024 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike International 4.0 License.

License

Call number

Persistent link to this record