nach oben

International Journal of Computer Vision

02.04.2024

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

verfasst von: Jianbin Zheng, Daqing Liu, Chaoyue Wang, Minghui Hu, Zuopeng Yang, Changxing Ding, Dacheng Tao

Erschienen in: International Journal of Computer Vision

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Bond-Taylor, S., Leach, A., Long, Y., & Willcocks, C. G. (2021). Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7327–7347.CrossRef

Caesar, H., Uijlings, J., & Ferrari, V. (2018). Cocostuff: Thing and stuff classes in context. In Conference on computer vision and pattern recognition (pp. 1209–1218).

Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., Li, Y., & Krishnan, D. (2023). Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704

Chang, H., Zhang, H., Jiang, L., Liu, C., & Freeman, W.T. (2022). Maskgit: Masked generative image transformer. In Conference on computer vision and pattern recognition (pp. 11315– 11325).

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.CrossRef

Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative pretraining from pixels. In International conference on machine learning (pp. 1691– 1703).

Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.

Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Conference on computer vision and pattern recognition (pp. 12873–12883).

Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-ascene: Scene-based text-to-image generation with human priors. In European conference on computer vision (pp. 89–106).

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.MathSciNetCrossRef

He, S., Liao, W., Yang, M.Y., Yang, Y., Song, Y.-Z., Rosenhahn, B., & Xiang, T. (2021). Context-aware layout to image generation with enhanced object appearance. In Conference on computer vision and pattern recognition (pp. 15049–15058).

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30, 6626–6637.

Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.CrossRef

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., & Zhou, J. (2023). Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778

Huang, X., Mallya, A., Wang, T.-C., & Liu, M.-Y. (2022). Multimodal conditional image synthesis with product-of-experts GANs. In European conference on computer vision (pp. 91–109).

Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., & Huang, L. (2021). What makes multimodal learning better than single (provably). Advances in Neural Information Processing Systems, 34, 10944–10956.

Huang, Y., Lin, J., Zhou, C., Yang, H., & Huang, L. (2022). Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International conference on machine learning (pp. 9226–9259).

Ismail, A.A., Hasan, M., & Ishtiaq, F. (2020). Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102

Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Conference on computer vision and pattern recognition (pp. 1125–1134).

Jahn, M., Rombach, R., Ommer, B. (2021). Highresolution complex scene synthesis with transformers. In Conference on computer vision and pattern recognition workshop.

Kingma, D. P., & Welling, M. (2014). Autoencoding variational Bayes. In International conference on learning representations.

Kobyzev, I., Prince, S. J., & Brubaker, M. A. (2020). Normalizing flows: An introduction and review of current methods. Transactions on Pattern Analysis and Machine Intelligence, 43(11), 3964–3979.CrossRef

LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data, 1(0).

Li, Z., Wu, J., Koh, I., Tang, Y., & Sun, L. (2021). Image synthesis from layout with locality-aware mask adaption. In International conference on computer vision (pp. 13819–13828).

Li, Z., Zhou, H., Bai, S., Li, P., Zhou, C., & Yang, H. (2022). M6-fashion: High-fidelity multimodal image generation and editing. arXiv preprint arXiv:2205.11705

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755).

Liu, X., Yin, G., Shao, J., Wang, X., & Li, H. (2019). Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Advances in Neural Information Processing Systems, 32, 568–578.

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. In International conference on learning representations.

Ma, M., Ren, J., Zhao, L., Testuggine, D., & Peng, X. (2022). Are multimodal transformers robust to missing modality? In Conference on computer vision and pattern recognition (pp. 18177–18186).

Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. In Advances in neural information processing systems workshop.

Miyato, T., & Koyama, M. (2018). cgans with projection discriminator. In International conference on learning representations.

Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X. (2023). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453

Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. In International conference on machine learning (pp. 2642–2651).

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1), 2617–2680.MathSciNet

Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.- Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Conference on computer vision and pattern recognition (pp. 2337–2346).

Parmar, G., Zhang, R., & Zhu, J.-Y. (2021). On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. In International conference on machine learning (pp. 4055–4064).

Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8238–8247).

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763).

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In International conference on machine learning (pp. 8821–8831).

Razavi, A., Van den Oord, A., & Vinyals, O. (2019). Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems, 32, 14837–14847.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Conference on computer vision and pattern recognition (pp. 10684–10695).

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., Salimans, T., Ho, J., Fleet, D. J., & Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2226–2234.

Simo-Serra, E., Iizuka, S., Sasaki, K., & Ishikawa, H. (2016). Learning to simplify: Fully convolutional networks for rough sketch cleanup. ACM Transactions on Graphics, 35(4), 1–11.CrossRef

Skorokhodov, I., Sotnikov, G., & Elhoseiny, M. (2021). Aligning latent and image spaces to connect the unconnectable. In International conference on computer vision (pp. 14144–14153).

Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in Neural Information Processing Systems, 28, 3483–3491.

Sun, W., & Wu, T. (2021). Learning layout and style reconfigurable GANs for controllable image synthesis. Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5070–5087.

Sun, Y., Mai, S., & Hu, H. (2021). Learning to balance the learning rates between various modalities via adaptive tracking factor. IEEE Signal Processing Letters, 28, 1650–1654.CrossRef

Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., & Khoreva, A. (2022). Oasis: Only adversarial supervision for semantic image synthesis. International Journal of Computer Vision, 130(12), 2903–2923.CrossRef

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., & Lempitsky, V. (2022). Resolution-robust large mask inpainting with Fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2149–159).

Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R. D., & Sharma, S. (2021). Object-centric image generation from layouts. In AAAI conference on artificial intelligence (vol. 35, pp. 2647–2655).

Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y., Wu, F., & Bao, B. (2020). DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865

Van Den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 6306–6315.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., & Wen, F. (2022). Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. In Conference on computer vision and pattern recognition (pp. 8798–8807).

Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12695–12705).

Wu, C., Liang, J., Hu, X., Gan, Z., Wang, J., Wang, L., Liu, Z., Fang, Y., & Duan, N. (2022). Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814

Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D., & Duan, N. (2022). Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision (pp. 720–736).

Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In International conference on computer vision (pp. 1395–1403).

Yang, C., Shen, Y., & Zhou, B. (2021). Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision, 129, 1451–1466.CrossRef

Yang, Z., Liu, D., Wang, C., Yang, J., & Tao, D. (2022). Modeling image composition for complex scene generation. In Conference on computer vision and pattern recognition (pp. 7764–7773).

Ye, H., Yang, X., Takac, M., Sunderraman, R., & Ji, S. (2021). Improving text-to-image synthesis using contrastive learning. In British machine vision conference.

Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., & Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789

Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).

Zhang, Z., Ma, J., Zhou, C., Men, R., Li, Z., Ding, M., & Yang, H. (2021). UFC-BERT: Unifying multi-modal controls for conditional image synthesis. Advances in Neural Information Processing Systems, 34, 27196–27208.

Zhao, B., Yin, W., Meng, L., & Sigal, L. (2020). Layout2image: Image generation from layout. International Journal of Computer Vision, 128, 2418–2435.CrossRef

Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., & Sun, T. (2021). Lafite: Towards language-free training for text-to-image generation. In Conference on computer vision and pattern recognition.

Titel: MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis
verfasst von: Jianbin Zheng
Daqing Liu
Chaoyue Wang
Minghui Hu
Zuopeng Yang
Changxing Ding
Dacheng Tao
Publikationsdatum: 02.04.2024
Verlag: Springer US
Erschienen in: International Journal of Computer Vision
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-024-02044-4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Premium Partner