Skip to main content
Top

2025 | OriginalPaper | Chapter

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Authors : Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, Hongsheng Li

Published in: Computer Vision – ECCV 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. To address this issue, we introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation, intending to achieve structural consistency and scale consistency across resolutions, respectively. Further enhanced by a padding-then-crop strategy, our method can flexibly handle text-to-image generation of various aspect ratios. By using the FouriScale as guidance, our method successfully balances the structural integrity and fidelity of generated images, achieving arbitrary-size, high-resolution, and high-quality generation. With its simplicity and compatibility, our method can provide valuable insights for future explorations into the synthesis of ultra-high-resolution images. The source code is available at https://​github.​com/​LeonHLJ/​FouriScale.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
For simplicity, we assume equal down-sampling scales for height and width. Our method can also accommodate different down-sampling scales in this context through our padding-then-cropping strategy (Sect. 3.4).
 
Literature
1.
go back to reference Balaji, Y., et al.: eDiff-I: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022) Balaji, Y., et al.: eDiff-I: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:​2211.​01324 (2022)
2.
go back to reference Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113 (2023) Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: fusing diffusion paths for controlled image generation. arXiv preprint arXiv:​2302.​08113 (2023)
3.
go back to reference Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: International Conference on Learning Representations (2018) Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: International Conference on Learning Representations (2018)
4.
go back to reference Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR, pp. 22563–22575 (2023) Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR, pp. 22563–22575 (2023)
5.
go back to reference Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: MasaCtrl: tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023) Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: MasaCtrl: tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:​2304.​08465 (2023)
7.
go back to reference Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS 34, pp. 8780–8794 (2021) Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS 34, pp. 8780–8794 (2021)
9.
go back to reference Ding, M., et al.: CogView: mastering text-to-image generation via transformers. In: NeurIPS 34, pp. 19822–19835 (2021) Ding, M., et al.: CogView: mastering text-to-image generation via transformers. In: NeurIPS 34, pp. 19822–19835 (2021)
10.
go back to reference Du, R., Chang, D., Hospedales, T., Song, Y.Z., Ma, Z.: DemoFusion: democratising high-resolution image generation with no $$\$. In: CVPR, pp. 6159–6168 (2024) Du, R., Chang, D., Hospedales, T., Song, Y.Z., Ma, Z.: DemoFusion: democratising high-resolution image generation with no $$\$. In: CVPR, pp. 6159–6168 (2024)
11.
go back to reference Epstein, D., Jabri, A., Poole, B., Efros, A.A., Holynski, A.: Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986 (2023) Epstein, D., Jabri, A., Poole, B., Efros, A.A., Holynski, A.: Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:​2306.​00986 (2023)
12.
go back to reference Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS 27 (2014) Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS 27 (2014)
13.
go back to reference Haji-Ali, M., Balakrishnan, G., Ordonez, V.: ElasticDiffusion: training-free arbitrary size image generation through global-local content separation. In: CVPR, pp. 6603–6612 (2024) Haji-Ali, M., Balakrishnan, G., Ordonez, V.: ElasticDiffusion: training-free arbitrary size image generation through global-local content separation. In: CVPR, pp. 6603–6612 (2024)
14.
go back to reference He, Y., et al.: ScaleCrafter: tuning-free higher-resolution visual generation with diffusion models. arXiv preprint arXiv:2310.07702 (2023) He, Y., et al.: ScaleCrafter: tuning-free higher-resolution visual generation with diffusion models. arXiv preprint arXiv:​2310.​07702 (2023)
15.
go back to reference He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221 (2022) He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:​2211.​13221 (2022)
16.
go back to reference Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.: Prompt-to-prompt image editing with cross-attention control. In: ICLR (2022) Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.: Prompt-to-prompt image editing with cross-attention control. In: ICLR (2022)
17.
go back to reference Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS 30 (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS 30 (2017)
18.
go back to reference Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS 33, pp. 6840–6851 (2020) Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS 33, pp. 6840–6851 (2020)
19.
go back to reference Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)MathSciNet Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)MathSciNet
21.
go back to reference Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: end-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093 (2023) Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: end-to-end diffusion for high resolution images. arXiv preprint arXiv:​2301.​11093 (2023)
22.
go back to reference Jiménez, Á.B.: Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:2302.02412 (2023) Jiménez, Á.B.: Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:​2302.​02412 (2023)
23.
go back to reference Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. arXiv preprint arXiv:2306.08645 (2023) Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. arXiv preprint arXiv:​2306.​08645 (2023)
24.
go back to reference Lee, Y., Kim, K., Kim, H., Sung, M.: SyncDiffusion: coherent montage via synchronized joint diffusions. In: NeurIPS 36 (2024) Lee, Y., Kim, K., Kim, H., Sung, M.: SyncDiffusion: coherent montage via synchronized joint diffusions. In: NeurIPS 36 (2024)
28.
go back to reference Pattichis, M.S., Bovik, A.C.: Analyzing image structure by multidimensional frequency modulation. IEEE TPAMI 29(5), 753–766 (2007)CrossRef Pattichis, M.S., Bovik, A.C.: Analyzing image structure by multidimensional frequency modulation. IEEE TPAMI 29(5), 753–766 (2007)CrossRef
29.
go back to reference Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV, pp. 4195–4205 (2023) Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV, pp. 4195–4205 (2023)
30.
go back to reference Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:​2307.​01952 (2023)
31.
go back to reference Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML, pp. 8821–8831. PMLR (2021) Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML, pp. 8821–8831. PMLR (2021)
32.
go back to reference Riad, R., Teboul, O., Grangier, D., Zeghidour, N.: Learning strides in convolutional neural networks. In: ICLR (2021) Riad, R., Teboul, O., Grangier, D., Zeghidour, N.: Learning strides in convolutional neural networks. In: ICLR (2021)
33.
go back to reference Rippel, O., Snoek, J., Adams, R.P.: Spectral representations for convolutional neural networks. In: NeurIPS 28 (2015) Rippel, O., Snoek, J., Adams, R.P.: Spectral representations for convolutional neural networks. In: NeurIPS 28 (2015)
34.
go back to reference Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
35.
go back to reference Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS 35, pp. 36479–36494 (2022) Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS 35, pp. 36479–36494 (2022)
36.
go back to reference Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models (2022) Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models (2022)
38.
go back to reference Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020) Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020)
39.
go back to reference Teng, J., et al.: Relay diffusion: unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350 (2023) Teng, J., et al.: Relay diffusion: unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:​2309.​03350 (2023)
40.
go back to reference Wang, J., et al.: Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773 (2023) Wang, J., et al.: Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:​2309.​02773 (2023)
41.
go back to reference Xiao, C., Yang, Q., Zhou, F., Zhang, C.: From text to mask: localizing entities using the attention of text-to-image diffusion models. arXiv preprint arXiv:2309.04109 (2023) Xiao, C., Yang, Q., Zhou, F., Zhang, C.: From text to mask: localizing entities using the attention of text-to-image diffusion models. arXiv preprint arXiv:​2309.​04109 (2023)
43.
go back to reference Zhang, R.: Making convolutional networks shift-invariant again. In: ICML, pp. 7324–7334. PMLR (2019) Zhang, R.: Making convolutional networks shift-invariant again. In: ICML, pp. 7324–7334. PMLR (2019)
44.
go back to reference Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: ECCV, pp. 286–301 (2018) Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: ECCV, pp. 286–301 (2018)
45.
go back to reference Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception. In: ICCV (2023) Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception. In: ICCV (2023)
46.
go back to reference Zheng, Q., et al.: Any-size-diffusion: toward efficient text-driven synthesis for any-size HD images. arXiv preprint arXiv:2308.16582 (2023) Zheng, Q., et al.: Any-size-diffusion: toward efficient text-driven synthesis for any-size HD images. arXiv preprint arXiv:​2308.​16582 (2023)
47.
go back to reference Zhu, Q., et al.: FouriDown: factoring down-sampling into shuffling and superposing. In: NeurIPS (2023) Zhu, Q., et al.: FouriDown: factoring down-sampling into shuffling and superposing. In: NeurIPS (2023)
Metadata
Title
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
Authors
Linjiang Huang
Rongyao Fang
Aiping Zhang
Guanglu Song
Si Liu
Yu Liu
Hongsheng Li
Copyright Year
2025
DOI
https://doi.org/10.1007/978-3-031-73254-6_12

Premium Partner