Skip to main content

2024 | OriginalPaper | Buchkapitel

Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?

verfasst von : Yan Gao, Tong Xu, Enhong Chen

Erschienen in: Intelligent Information Processing XII

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

It is commonly seen that the imperfect multi-modal data with missing modality appears in realistic application scenarios, which usually break the data completeness assumption of multi-modal analysis. Therefore, large efforts in multi-modal learning communities have been made on the robust solution for modality-missing data. Recently, pre-trained models based on Mixture-of-Modality-Experts (MoME) Transformers have been proposed, which achieved competitive performance in various downstream tasks, by utilizing different experts of feed-forward networks for single/multi modal inputs. One natural question arises: are Mixture-of-Modality-Experts Transformers robust to missing modality? To that end, in this paper, we conduct a deep investigation on MoME Transformer under the missing modality problem. Specifically, we propose a novel multi-task learning strategy, which leverages a uniform model to handle missing modalities during training and inference. In this way, the MoME Transformer will be empowered with robustness to missing modality. To validate the effectiveness of our proposed method, we conduct extensive experiments on three popular datasets, which indicate our method could outperform the state-of-the-art (SOTA) methods with a large margin.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017) Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:​1702.​01992 (2017)
2.
Zurück zum Zitat Bao, H., et al.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022) Bao, H., et al.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022)
3.
Zurück zum Zitat Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: LaTr: layout-aware transformer for scene-text VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16548–16558 (2022) Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: LaTr: layout-aware transformer for scene-text VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16548–16558 (2022)
4.
Zurück zum Zitat Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018) Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018)
5.
Zurück zum Zitat Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021) Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
6.
Zurück zum Zitat Gallo, I., Ria, G., Landro, N., Grassa, R.L.: Image and text fusion for UPMC food-101 using BERT and CNNs. In: International Conference on Image and Vision Computing New Zealand (IVCNZ 2020), pp. 1–6 (2020) Gallo, I., Ria, G., Landro, N., Grassa, R.L.: Image and text fusion for UPMC food-101 using BERT and CNNs. In: International Conference on Image and Vision Computing New Zealand (IVCNZ 2020), pp. 1–6 (2020)
7.
Zurück zum Zitat Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural. Inf. Process. Syst. 33, 2611–2624 (2020) Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural. Inf. Process. Syst. 33, 2611–2624 (2020)
8.
Zurück zum Zitat Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021) Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
9.
Zurück zum Zitat Lee, Y.L., Tsai, Y.H., Chiu, W.C., Lee, C.Y.: Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952 (2023) Lee, Y.L., Tsai, Y.H., Chiu, W.C., Lee, C.Y.: Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952 (2023)
10.
Zurück zum Zitat Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928–8937 (2019) Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928–8937 (2019)
11.
Zurück zum Zitat Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. In: EMNLP (2020) Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. In: EMNLP (2020)
12.
Zurück zum Zitat Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019) Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBert: a simple and performant baseline for vision and language. arXiv preprint arXiv:​1908.​03557 (2019)
13.
Zurück zum Zitat Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019) Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
14.
Zurück zum Zitat Luo, Z., Hsieh, J.-T., Jiang, L., Niebles, J.C., Fei-Fei, L.: Graph distillation for action detection with privileged modalities. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 174–192. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_11CrossRef Luo, Z., Hsieh, J.-T., Jiang, L., Niebles, J.C., Fei-Fei, L.: Graph distillation for action detection with privileged modalities. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 174–192. Springer, Cham (2018). https://​doi.​org/​10.​1007/​978-3-030-01264-9_​11CrossRef
15.
Zurück zum Zitat Ma, M., Ren, J., Zhao, L., Testuggine, D., Peng, X.: Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186 (2022) Ma, M., Ren, J., Zhao, L., Testuggine, D., Peng, X.: Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186 (2022)
16.
Zurück zum Zitat Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: SMIL: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021) Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: SMIL: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021)
17.
Zurück zum Zitat Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019) Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:​1908.​08530 (2019)
18.
Zurück zum Zitat Sun, L., Xia, C., Yin, W., Liang, T., Philip, S.Y., He, L.: Mixup-transformer: dynamic data augmentation for NLP tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3436–3440 (2020) Sun, L., Xia, C., Yin, W., Liang, T., Philip, S.Y., He, L.: Mixup-transformer: dynamic data augmentation for NLP tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3436–3440 (2020)
19.
Zurück zum Zitat Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019) Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)
20.
Zurück zum Zitat Vermaa, V., et al.: Interpolation consistency training for semi-supervised learning. Neural Netw. 145, 90–106 (2022)CrossRef Vermaa, V., et al.: Interpolation consistency training for semi-supervised learning. Neural Netw. 145, 90–106 (2022)CrossRef
21.
Zurück zum Zitat Wang, W., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023) Wang, W., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)
22.
Zurück zum Zitat Yao, S., Wan, X.: Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4346–4350 (2020) Yao, S., Wan, X.: Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4346–4350 (2020)
23.
Zurück zum Zitat Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2020)CrossRef Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2020)CrossRef
24.
Zurück zum Zitat Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia. MM 2021, pp. 4400–4407, New York, NY, USA. Association for Computing Machinery (2021) Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia. MM 2021, pp. 4400–4407, New York, NY, USA. Association for Computing Machinery (2021)
25.
Zurück zum Zitat Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1545–1554 (2022) Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1545–1554 (2022)
26.
Zurück zum Zitat Zhang, B., Fang, Y., Ren, T., Wu, G.: Multimodal analysis for deep video understanding with video language transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 7165–7169 (2022) Zhang, B., Fang, Y., Ren, T., Wu, G.: Multimodal analysis for deep video understanding with video language transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 7165–7169 (2022)
27.
Zurück zum Zitat Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017) Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:​1710.​09412 (2017)
Metadaten
Titel
Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?
verfasst von
Yan Gao
Tong Xu
Enhong Chen
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-57808-3_12