Adding Conditional Control to Text-to-Image Diffusion Models

less than 1 minute read

Published:

See full version of this blog on Notion

Introduction

  • Motivation
  • text-to-image models are limited in the control in spatial composition of the image; and it is difficult for text prompt to express complex layouts, poses, shapes and forms.
  • want to develop a finer grained spatial control by letting users provide additional images that directly specify their desired image composition.

  • Challenge
  • training data for specific condition may be significantly smaller than general dataset.
  • may cause overfitting or forgetting.
  • designing deeper or more customized neural architectures might be necessary for handling in-the-wild conditioning images with complex shapes and diverse high-level semantics.