Adding Conditional Control to Text-to-Image Diffusion Models
Published:
See full version of this blog on Notion
Introduction
- Motivation
- text-to-image models are limited in the control in spatial composition of the image; and it is difficult for text prompt to express complex layouts, poses, shapes and forms.
- want to develop a finer grained spatial control by letting users provide additional images that directly specify their desired image composition.
- Challenge
- training data for specific condition may be significantly smaller than general dataset.
- may cause overfitting or forgetting.
- designing deeper or more customized neural architectures might be necessary for handling in-the-wild conditioning images with complex shapes and diverse high-level semantics.