Skip to main content
Top

2023 | OriginalPaper | Chapter

Diffusion-Adapter: Text Guided Image Manipulation with Frozen Diffusion Models

Authors : Rongting Wei, Chunxiao Fan, Yuexin Wu

Published in: Artificial Neural Networks and Machine Learning – ICANN 2023

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Research on vision-language models has seen rapid development, enabling natural language-based processing for image generation and manipulation. Existing text-driven image manipulation is typically implemented by GAN inversion or fine-tuning diffusion models. The former is limited by the inversion capability of GANs, which fail to reconstruct pictures with novel poses and perspectives. The latter methods require expensive optimization for each input, and fine-tuning is still a complex process. To mitigate these problems, we propose a novel approach, dubbed Diffusion-Adapter, which performs text-driven image manipulation using frozen pre-trained diffusion models. In this work, we design an Adapter architecture to modify the target attributes without fine-tuning the pre-trained models. Our approach can be applied to diffusion models in any domain and only take a few examples to train the Adapter that could successfully edit images from unknown data. Compared with previous work, Diffusion-Adapter preserves a maximal amount of details from the original image without unintended changes to the input content. Extensive experiments demonstrate the advantages of our approach over competing baselines, and we make a novel attempt at text-driven image manipulation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022) Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
2.
go back to reference Choi, Y., Uh, Y., Yoo, J., Ha, J.-W.: Stargan v2: diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197 (2020) Choi, Y., Uh, Y., Yoo, J., Ha, J.-W.: Stargan v2: diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197 (2020)
3.
go back to reference Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan.: Vqgan-clip: open domain image generation and editing with natural language guidance. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October, 2022, Proceedings, Part XXXVII, pp. 88–105. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_6 Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan.: Vqgan-clip: open domain image generation and editing with natural language guidance. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October, 2022, Proceedings, Part XXXVII, pp. 88–105. Springer, Cham (2022). https://​doi.​org/​10.​1007/​978-3-031-19836-6_​6
4.
go back to reference Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
5.
go back to reference Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: Advances in Neural Information Processing Systems 34, pp. 8780–8794 (2021) Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: Advances in Neural Information Processing Systems 34, pp. 8780–8794 (2021)
6.
go back to reference Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714 (2017) Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714 (2017)
7.
go back to reference Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022) Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:​2208.​01618 (2022)
8.
go back to reference Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik. Stylegan-nada: clip-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) 41(4), 1–13 (2022) Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik. Stylegan-nada: clip-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) 41(4), 1–13 (2022)
9.
go back to reference Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020) Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
10.
go back to reference Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017) Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:​1710.​10196 (2017)
11.
go back to reference Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H.: Imagic: text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276 (2022) Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H.: Imagic: text-based real image editing with diffusion models. arXiv preprint arXiv:​2210.​09276 (2022)
12.
go back to reference Kim, G., Kwon, T., Ye, J.G.: Diffusionclip: text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022) Kim, G., Kwon, T., Ye, J.G.: Diffusionclip: text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)
14.
go back to reference Liu, X., Gong, C., Wu, L., Zhang, S., Su, H., Liu, Q.: Fusedream: training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573 (2021) Liu, X., Gong, C., Wu, L., Zhang, S., Su, H., Liu, Q.: Fusedream: training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:​2112.​01573 (2021)
15.
go back to reference Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S.: Sdedit: image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021) Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S.: Sdedit: image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:​2108.​01073 (2021)
16.
go back to reference Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021) Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
17.
go back to reference Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021) Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)
18.
go back to reference Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021) Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
19.
go back to reference Rombach, R., Blattmann, A., Lorenz, D., Esser, P.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
21.
go back to reference Ruiz, N., Li, Y., Jampani, V.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2022) Ruiz, N., Li, Y., Jampani, V.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:​2208.​12242 (2022)
22.
go back to reference Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022) Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:​2205.​11487 (2022)
24.
go back to reference Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:​2011.​13456 (2020)
25.
go back to reference Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit. Attention is all you need. Advances in neural information processing systems, 30 (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit. Attention is all you need. Advances in neural information processing systems, 30 (2017)
26.
go back to reference Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo., P.: Segformer: simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34, pp. 12077–12090 (2021) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo., P.: Segformer: simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34, pp. 12077–12090 (2021)
27.
go back to reference Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015) Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:​1506.​03365 (2015)
Metadata
Title
Diffusion-Adapter: Text Guided Image Manipulation with Frozen Diffusion Models
Authors
Rongting Wei
Chunxiao Fan
Yuexin Wu
Copyright Year
2023
DOI
https://doi.org/10.1007/978-3-031-44210-0_18

Premium Partner