Skip to main content

02.04.2024

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

verfasst von: Jianbin Zheng, Daqing Liu, Chaoyue Wang, Minghui Hu, Zuopeng Yang, Changxing Ding, Dacheng Tao

Erschienen in: International Journal of Computer Vision

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://​jabir-zheng.​github.​io/​MMoT.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Bond-Taylor, S., Leach, A., Long, Y., & Willcocks, C. G. (2021). Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7327–7347.CrossRef Bond-Taylor, S., Leach, A., Long, Y., & Willcocks, C. G. (2021). Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7327–7347.CrossRef
Zurück zum Zitat Caesar, H., Uijlings, J., & Ferrari, V. (2018). Cocostuff: Thing and stuff classes in context. In Conference on computer vision and pattern recognition (pp. 1209–1218). Caesar, H., Uijlings, J., & Ferrari, V. (2018). Cocostuff: Thing and stuff classes in context. In Conference on computer vision and pattern recognition (pp. 1209–1218).
Zurück zum Zitat Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., Li, Y., & Krishnan, D. (2023). Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., Li, Y., & Krishnan, D. (2023). Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:​2301.​00704
Zurück zum Zitat Chang, H., Zhang, H., Jiang, L., Liu, C., & Freeman, W.T. (2022). Maskgit: Masked generative image transformer. In Conference on computer vision and pattern recognition (pp. 11315– 11325). Chang, H., Zhang, H., Jiang, L., Liu, C., & Freeman, W.T. (2022). Maskgit: Masked generative image transformer. In Conference on computer vision and pattern recognition (pp. 11315– 11325).
Zurück zum Zitat Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.CrossRef Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.CrossRef
Zurück zum Zitat Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative pretraining from pixels. In International conference on machine learning (pp. 1691– 1703). Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative pretraining from pixels. In International conference on machine learning (pp. 1691– 1703).
Zurück zum Zitat Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:​1904.​10509
Zurück zum Zitat Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794. Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Zurück zum Zitat Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Conference on computer vision and pattern recognition (pp. 12873–12883). Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Conference on computer vision and pattern recognition (pp. 12873–12883).
Zurück zum Zitat Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-ascene: Scene-based text-to-image generation with human priors. In European conference on computer vision (pp. 89–106). Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-ascene: Scene-based text-to-image generation with human priors. In European conference on computer vision (pp. 89–106).
Zurück zum Zitat Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.MathSciNetCrossRef Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.MathSciNetCrossRef
Zurück zum Zitat He, S., Liao, W., Yang, M.Y., Yang, Y., Song, Y.-Z., Rosenhahn, B., & Xiang, T. (2021). Context-aware layout to image generation with enhanced object appearance. In Conference on computer vision and pattern recognition (pp. 15049–15058). He, S., Liao, W., Yang, M.Y., Yang, Y., Song, Y.-Z., Rosenhahn, B., & Xiang, T. (2021). Context-aware layout to image generation with enhanced object appearance. In Conference on computer vision and pattern recognition (pp. 15049–15058).
Zurück zum Zitat Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30, 6626–6637. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30, 6626–6637.
Zurück zum Zitat Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.CrossRef Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.CrossRef
Zurück zum Zitat Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Zurück zum Zitat Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., & Zhou, J. (2023). Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778 Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., & Zhou, J. (2023). Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:​2302.​09778
Zurück zum Zitat Huang, X., Mallya, A., Wang, T.-C., & Liu, M.-Y. (2022). Multimodal conditional image synthesis with product-of-experts GANs. In European conference on computer vision (pp. 91–109). Huang, X., Mallya, A., Wang, T.-C., & Liu, M.-Y. (2022). Multimodal conditional image synthesis with product-of-experts GANs. In European conference on computer vision (pp. 91–109).
Zurück zum Zitat Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., & Huang, L. (2021). What makes multimodal learning better than single (provably). Advances in Neural Information Processing Systems, 34, 10944–10956. Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., & Huang, L. (2021). What makes multimodal learning better than single (provably). Advances in Neural Information Processing Systems, 34, 10944–10956.
Zurück zum Zitat Huang, Y., Lin, J., Zhou, C., Yang, H., & Huang, L. (2022). Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International conference on machine learning (pp. 9226–9259). Huang, Y., Lin, J., Zhou, C., Yang, H., & Huang, L. (2022). Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International conference on machine learning (pp. 9226–9259).
Zurück zum Zitat Ismail, A.A., Hasan, M., & Ishtiaq, F. (2020). Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102 Ismail, A.A., Hasan, M., & Ishtiaq, F. (2020). Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:​2011.​06102
Zurück zum Zitat Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Conference on computer vision and pattern recognition (pp. 1125–1134). Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Conference on computer vision and pattern recognition (pp. 1125–1134).
Zurück zum Zitat Jahn, M., Rombach, R., Ommer, B. (2021). Highresolution complex scene synthesis with transformers. In Conference on computer vision and pattern recognition workshop. Jahn, M., Rombach, R., Ommer, B. (2021). Highresolution complex scene synthesis with transformers. In Conference on computer vision and pattern recognition workshop.
Zurück zum Zitat Kingma, D. P., & Welling, M. (2014). Autoencoding variational Bayes. In International conference on learning representations. Kingma, D. P., & Welling, M. (2014). Autoencoding variational Bayes. In International conference on learning representations.
Zurück zum Zitat Kobyzev, I., Prince, S. J., & Brubaker, M. A. (2020). Normalizing flows: An introduction and review of current methods. Transactions on Pattern Analysis and Machine Intelligence, 43(11), 3964–3979.CrossRef Kobyzev, I., Prince, S. J., & Brubaker, M. A. (2020). Normalizing flows: An introduction and review of current methods. Transactions on Pattern Analysis and Machine Intelligence, 43(11), 3964–3979.CrossRef
Zurück zum Zitat LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data, 1(0). LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data, 1(0).
Zurück zum Zitat Li, Z., Wu, J., Koh, I., Tang, Y., & Sun, L. (2021). Image synthesis from layout with locality-aware mask adaption. In International conference on computer vision (pp. 13819–13828). Li, Z., Wu, J., Koh, I., Tang, Y., & Sun, L. (2021). Image synthesis from layout with locality-aware mask adaption. In International conference on computer vision (pp. 13819–13828).
Zurück zum Zitat Li, Z., Zhou, H., Bai, S., Li, P., Zhou, C., & Yang, H. (2022). M6-fashion: High-fidelity multimodal image generation and editing. arXiv preprint arXiv:2205.11705 Li, Z., Zhou, H., Bai, S., Li, P., Zhou, C., & Yang, H. (2022). M6-fashion: High-fidelity multimodal image generation and editing. arXiv preprint arXiv:​2205.​11705
Zurück zum Zitat Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755).
Zurück zum Zitat Liu, X., Yin, G., Shao, J., Wang, X., & Li, H. (2019). Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Advances in Neural Information Processing Systems, 32, 568–578. Liu, X., Yin, G., Shao, J., Wang, X., & Li, H. (2019). Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Advances in Neural Information Processing Systems, 32, 568–578.
Zurück zum Zitat Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. In International conference on learning representations. Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. In International conference on learning representations.
Zurück zum Zitat Ma, M., Ren, J., Zhao, L., Testuggine, D., & Peng, X. (2022). Are multimodal transformers robust to missing modality? In Conference on computer vision and pattern recognition (pp. 18177–18186). Ma, M., Ren, J., Zhao, L., Testuggine, D., & Peng, X. (2022). Are multimodal transformers robust to missing modality? In Conference on computer vision and pattern recognition (pp. 18177–18186).
Zurück zum Zitat Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. In Advances in neural information processing systems workshop. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. In Advances in neural information processing systems workshop.
Zurück zum Zitat Miyato, T., & Koyama, M. (2018). cgans with projection discriminator. In International conference on learning representations. Miyato, T., & Koyama, M. (2018). cgans with projection discriminator. In International conference on learning representations.
Zurück zum Zitat Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X. (2023). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X. (2023). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:​2302.​08453
Zurück zum Zitat Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. In International conference on machine learning (pp. 2642–2651). Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. In International conference on machine learning (pp. 2642–2651).
Zurück zum Zitat Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1), 2617–2680.MathSciNet Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1), 2617–2680.MathSciNet
Zurück zum Zitat Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.- Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Conference on computer vision and pattern recognition (pp. 2337–2346). Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.- Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Conference on computer vision and pattern recognition (pp. 2337–2346).
Zurück zum Zitat Parmar, G., Zhang, R., & Zhu, J.-Y. (2021). On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222 Parmar, G., Zhang, R., & Zhu, J.-Y. (2021). On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:​2104.​11222
Zurück zum Zitat Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. In International conference on machine learning (pp. 4055–4064). Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. In International conference on machine learning (pp. 4055–4064).
Zurück zum Zitat Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8238–8247). Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8238–8247).
Zurück zum Zitat Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763).
Zurück zum Zitat Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:​2204.​06125
Zurück zum Zitat Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In International conference on machine learning (pp. 8821–8831). Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In International conference on machine learning (pp. 8821–8831).
Zurück zum Zitat Razavi, A., Van den Oord, A., & Vinyals, O. (2019). Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems, 32, 14837–14847. Razavi, A., Van den Oord, A., & Vinyals, O. (2019). Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems, 32, 14837–14847.
Zurück zum Zitat Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Conference on computer vision and pattern recognition (pp. 10684–10695). Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Conference on computer vision and pattern recognition (pp. 10684–10695).
Zurück zum Zitat Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., Salimans, T., Ho, J., Fleet, D. J., & Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., Salimans, T., Ho, J., Fleet, D. J., & Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494.
Zurück zum Zitat Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2226–2234. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2226–2234.
Zurück zum Zitat Simo-Serra, E., Iizuka, S., Sasaki, K., & Ishikawa, H. (2016). Learning to simplify: Fully convolutional networks for rough sketch cleanup. ACM Transactions on Graphics, 35(4), 1–11.CrossRef Simo-Serra, E., Iizuka, S., Sasaki, K., & Ishikawa, H. (2016). Learning to simplify: Fully convolutional networks for rough sketch cleanup. ACM Transactions on Graphics, 35(4), 1–11.CrossRef
Zurück zum Zitat Skorokhodov, I., Sotnikov, G., & Elhoseiny, M. (2021). Aligning latent and image spaces to connect the unconnectable. In International conference on computer vision (pp. 14144–14153). Skorokhodov, I., Sotnikov, G., & Elhoseiny, M. (2021). Aligning latent and image spaces to connect the unconnectable. In International conference on computer vision (pp. 14144–14153).
Zurück zum Zitat Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in Neural Information Processing Systems, 28, 3483–3491. Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in Neural Information Processing Systems, 28, 3483–3491.
Zurück zum Zitat Sun, W., & Wu, T. (2021). Learning layout and style reconfigurable GANs for controllable image synthesis. Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5070–5087. Sun, W., & Wu, T. (2021). Learning layout and style reconfigurable GANs for controllable image synthesis. Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5070–5087.
Zurück zum Zitat Sun, Y., Mai, S., & Hu, H. (2021). Learning to balance the learning rates between various modalities via adaptive tracking factor. IEEE Signal Processing Letters, 28, 1650–1654.CrossRef Sun, Y., Mai, S., & Hu, H. (2021). Learning to balance the learning rates between various modalities via adaptive tracking factor. IEEE Signal Processing Letters, 28, 1650–1654.CrossRef
Zurück zum Zitat Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., & Khoreva, A. (2022). Oasis: Only adversarial supervision for semantic image synthesis. International Journal of Computer Vision, 130(12), 2903–2923.CrossRef Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., & Khoreva, A. (2022). Oasis: Only adversarial supervision for semantic image synthesis. International Journal of Computer Vision, 130(12), 2903–2923.CrossRef
Zurück zum Zitat Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., & Lempitsky, V. (2022). Resolution-robust large mask inpainting with Fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2149–159). Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., & Lempitsky, V. (2022). Resolution-robust large mask inpainting with Fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2149–159).
Zurück zum Zitat Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R. D., & Sharma, S. (2021). Object-centric image generation from layouts. In AAAI conference on artificial intelligence (vol. 35, pp. 2647–2655). Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R. D., & Sharma, S. (2021). Object-centric image generation from layouts. In AAAI conference on artificial intelligence (vol. 35, pp. 2647–2655).
Zurück zum Zitat Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y., Wu, F., & Bao, B. (2020). DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865 Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y., Wu, F., & Bao, B. (2020). DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:​2008.​05865
Zurück zum Zitat Van Den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 6306–6315. Van Den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 6306–6315.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.
Zurück zum Zitat Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., & Wen, F. (2022). Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952 Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., & Wen, F. (2022). Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:​2205.​12952
Zurück zum Zitat Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. In Conference on computer vision and pattern recognition (pp. 8798–8807). Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. In Conference on computer vision and pattern recognition (pp. 8798–8807).
Zurück zum Zitat Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12695–12705). Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12695–12705).
Zurück zum Zitat Wu, C., Liang, J., Hu, X., Gan, Z., Wang, J., Wang, L., Liu, Z., Fang, Y., & Duan, N. (2022). Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814 Wu, C., Liang, J., Hu, X., Gan, Z., Wang, J., Wang, L., Liu, Z., Fang, Y., & Duan, N. (2022). Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:​2207.​09814
Zurück zum Zitat Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D., & Duan, N. (2022). Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision (pp. 720–736). Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D., & Duan, N. (2022). Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision (pp. 720–736).
Zurück zum Zitat Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In International conference on computer vision (pp. 1395–1403). Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In International conference on computer vision (pp. 1395–1403).
Zurück zum Zitat Yang, C., Shen, Y., & Zhou, B. (2021). Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision, 129, 1451–1466.CrossRef Yang, C., Shen, Y., & Zhou, B. (2021). Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision, 129, 1451–1466.CrossRef
Zurück zum Zitat Yang, Z., Liu, D., Wang, C., Yang, J., & Tao, D. (2022). Modeling image composition for complex scene generation. In Conference on computer vision and pattern recognition (pp. 7764–7773). Yang, Z., Liu, D., Wang, C., Yang, J., & Tao, D. (2022). Modeling image composition for complex scene generation. In Conference on computer vision and pattern recognition (pp. 7764–7773).
Zurück zum Zitat Ye, H., Yang, X., Takac, M., Sunderraman, R., & Ji, S. (2021). Improving text-to-image synthesis using contrastive learning. In British machine vision conference. Ye, H., Yang, X., Takac, M., Sunderraman, R., & Ji, S. (2021). Improving text-to-image synthesis using contrastive learning. In British machine vision conference.
Zurück zum Zitat Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., & Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., & Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:​2206.​10789
Zurück zum Zitat Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595). Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).
Zurück zum Zitat Zhang, Z., Ma, J., Zhou, C., Men, R., Li, Z., Ding, M., & Yang, H. (2021). UFC-BERT: Unifying multi-modal controls for conditional image synthesis. Advances in Neural Information Processing Systems, 34, 27196–27208. Zhang, Z., Ma, J., Zhou, C., Men, R., Li, Z., Ding, M., & Yang, H. (2021). UFC-BERT: Unifying multi-modal controls for conditional image synthesis. Advances in Neural Information Processing Systems, 34, 27196–27208.
Zurück zum Zitat Zhao, B., Yin, W., Meng, L., & Sigal, L. (2020). Layout2image: Image generation from layout. International Journal of Computer Vision, 128, 2418–2435.CrossRef Zhao, B., Yin, W., Meng, L., & Sigal, L. (2020). Layout2image: Image generation from layout. International Journal of Computer Vision, 128, 2418–2435.CrossRef
Zurück zum Zitat Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., & Sun, T. (2021). Lafite: Towards language-free training for text-to-image generation. In Conference on computer vision and pattern recognition. Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., & Sun, T. (2021). Lafite: Towards language-free training for text-to-image generation. In Conference on computer vision and pattern recognition.
Metadaten
Titel
MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis
verfasst von
Jianbin Zheng
Daqing Liu
Chaoyue Wang
Minghui Hu
Zuopeng Yang
Changxing Ding
Dacheng Tao
Publikationsdatum
02.04.2024
Verlag
Springer US
Erschienen in
International Journal of Computer Vision
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-024-02044-4

Premium Partner