Skip to main content
Erschienen in:

14.11.2022

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

verfasst von: Fan Wang, Shengwei Tian, Long Yu, Jing Liu, Junwen Wang, Kun Li, Yongtao Wang

Erschienen in: Cognitive Computation | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Multimodal sentiment analysis is a popular and challenging research topic in natural language processing, but the impact of individual modal data in videos on sentiment analysis results can be different. In the temporal dimension, natural language sentiment is influenced by nonnatural language sentiment, which may enhance or weaken the original sentiment of the current natural language. In addition, there is a general problem of poor quality of nonnatural language features, which essentially hinders the effect of multimodal fusion. To address the above issues, we proposed a multimodal encoding–decoding translation network with a transformer and adopted a joint encoding–decoding method with text as the primary information and sound and image as the secondary information. To reduce the negative impact of nonnatural language data on natural language data, we propose a modality reinforcement cross-attention module to convert nonnatural language features into natural language features to improve their quality and better integrate multimodal features. Moreover, the dynamic filtering mechanism filters out the error information generated in the cross-modal interaction to further improve the final output. We evaluated the proposed method on two multimodal sentiment analysis benchmark datasets (MOSI and MOSEI), and the accuracy of the method was 89.3% and 85.9%, respectively. In addition, our method outperformed the current state-of-the-art methods. Our model can greatly improve the effect of multimodal fusion and more accurately analyze human sentiment.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
10.
Zurück zum Zitat Poria S, Cambria E, Bajpai R, Hussain A. A review of affective computing: from unimodal analysis to multimodal fusion. Inform Fusion. 2017;37:98–125.CrossRef Poria S, Cambria E, Bajpai R, Hussain A. A review of affective computing: from unimodal analysis to multimodal fusion. Inform Fusion. 2017;37:98–125.CrossRef
11.
Zurück zum Zitat Shenoy A, Sardana A. Multilogue-Net: A context-aware RNN for multi-modal emotion detection and sentiment analysis in conversation. Proceedings of the Second Grand Challenge and Workshop on Multimodal Language; 2020. p. 19–28. Shenoy A, Sardana A. Multilogue-Net: A context-aware RNN for multi-modal emotion detection and sentiment analysis in conversation. Proceedings of the Second Grand Challenge and Workshop on Multimodal Language; 2020. p. 19–28.
13.
Zurück zum Zitat Zadeh A, Liang P, Mazumder N, Poria S. Memory fusion network for multi-view sequential learning. Proc AAAI Conf Artif Intell. 2018;32(1):5634–41. Zadeh A, Liang P, Mazumder N, Poria S. Memory fusion network for multi-view sequential learning. Proc AAAI Conf Artif Intell. 2018;32(1):5634–41.
16.
Zurück zum Zitat Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, vol 1, June 2–7, 2019 (Long and Short Papers). Association for Computational Linguistics; 2019. p. 4171–86. Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, vol 1, June 2–7, 2019 (Long and Short Papers). Association for Computational Linguistics; 2019. p. 4171–86.
17.
Zurück zum Zitat Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L. Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y, editors. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers. Association for Computational Linguistics; 2018. p. 2247–56. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L. Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y, editors. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers. Association for Computational Linguistics; 2018. p. 2247–56.
18.
Zurück zum Zitat Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P. Memory fusion network for multi-view sequential learning. Proc AAAI Conf Artif Intell. 2018;32(1):5634–41. Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P. Memory fusion network for multi-view sequential learning. Proc AAAI Conf Artif Intell. 2018;32(1):5634–41.
22.
Zurück zum Zitat Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021. p. 9180–92. Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021. p. 9180–92.
23.
Zurück zum Zitat Li Q, Gkoumas D, Lioma C, Melucci M. Quantum-inspired multimodal fusion for video sentiment analysis. Inf Fusion. 2021;65:58–71.CrossRef Li Q, Gkoumas D, Lioma C, Melucci M. Quantum-inspired multimodal fusion for video sentiment analysis. Inf Fusion. 2021;65:58–71.CrossRef
24.
Zurück zum Zitat He J, Hu H. MF-BERT: Multimodal fusion in pre-trained BERT for sentiment analysis. IEEE Signal Process Lett. 2022;29:454–8.CrossRef He J, Hu H. MF-BERT: Multimodal fusion in pre-trained BERT for sentiment analysis. IEEE Signal Process Lett. 2022;29:454–8.CrossRef
25.
Zurück zum Zitat Wu Y, Lin Z, Zhao Y, Qin B, Zhu L-N. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics; 2021. p. 4730–8.CrossRef Wu Y, Lin Z, Zhao Y, Qin B, Zhu L-N. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics; 2021. p. 4730–8.CrossRef
26.
Zurück zum Zitat Jiang D, Wei R, Liu H, Wen J, Tu G, Zheng L, Cambria E. A multitask learning framework for multimodal sentiment analysis. In: 2021 International Conference on Data Mining Workshops (ICDMW). IEEE; 2021. p. 151–7.CrossRef Jiang D, Wei R, Liu H, Wen J, Tu G, Zheng L, Cambria E. A multitask learning framework for multimodal sentiment analysis. In: 2021 International Conference on Data Mining Workshops (ICDMW). IEEE; 2021. p. 151–7.CrossRef
27.
Zurück zum Zitat Yang B, Wu L, Zhu J, Shao B, Lin X, Liu TY. Multimodal sentiment analysis with two-phase multi-task learning. IEEE/ACM Trans Audio Speech Language Process. 2022;30:2015–24.CrossRef Yang B, Wu L, Zhu J, Shao B, Lin X, Liu TY. Multimodal sentiment analysis with two-phase multi-task learning. IEEE/ACM Trans Audio Speech Language Process. 2022;30:2015–24.CrossRef
29.
Zurück zum Zitat Wang Y, Chen Z, Chen S, Zhu Y. Multi-task Learning for Multimodal Emotion Recognition. In: Pimenidis E, Angelov P, Jayne C, Papaleonidas A, Aydin ML, editors. Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol. 13531. Cham: Springer; 2022. Wang Y, Chen Z, Chen S, Zhu Y. Multi-task Learning for Multimodal Emotion Recognition. In: Pimenidis E, Angelov P, Jayne C, Papaleonidas A, Aydin ML, editors. Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol. 13531. Cham: Springer; 2022.
30.
Zurück zum Zitat Song Y, Fan X, Yang Y, Ren G, Pan W. A Cross-Modal Attention and Multi-task Learning Based Approach for Multi-modal Sentiment Analysis. In: Liang Q, Wang W, Mu J, Liu X, Na Z, editors. Artificial Intelligence in China. Lecture Notes in Electrical Engineering, vol. 854. Singapore: Springer; 2022. p. 159–66. https://doi.org/10.1016/j.knosys.2022.109924.CrossRef Song Y, Fan X, Yang Y, Ren G, Pan W. A Cross-Modal Attention and Multi-task Learning Based Approach for Multi-modal Sentiment Analysis. In: Liang Q, Wang W, Mu J, Liu X, Na Z, editors. Artificial Intelligence in China. Lecture Notes in Electrical Engineering, vol. 854. Singapore: Springer; 2022. p. 159–66. https://​doi.​org/​10.​1016/​j.​knosys.​2022.​109924.CrossRef
31.
Zurück zum Zitat Yang L, Na J-C, Yu J. Cross-Modal Multitask Transformer for End-to-End Multimodal Aspect-Based Sentiment Analysis. Inf Process Manag. 2022;59(5):103038 ISSN 0306–4573.CrossRef Yang L, Na J-C, Yu J. Cross-Modal Multitask Transformer for End-to-End Multimodal Aspect-Based Sentiment Analysis. Inf Process Manag. 2022;59(5):103038 ISSN 0306–4573.CrossRef
34.
Zurück zum Zitat Shu Y, Xu G. Emotion recognition from music enhanced by domain knowledge. In: The Pacific Rim International Conference On Artificial Intelligence 2019. Trends In Artificial Intelligence; 2019. p. 121–34. Shu Y, Xu G. Emotion recognition from music enhanced by domain knowledge. In: The Pacific Rim International Conference On Artificial Intelligence 2019. Trends In Artificial Intelligence; 2019. p. 121–34.
35.
Zurück zum Zitat Zhang K, Li Y, Wang J, Cambria E, Li X. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans Circuits Syst Video Technol. 2021;32(3):1034–47.CrossRef Zhang K, Li Y, Wang J, Cambria E, Li X. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans Circuits Syst Video Technol. 2021;32(3):1034–47.CrossRef
36.
Zurück zum Zitat Pham H, Liang PP, Manzini T, Morency L-P, Póczos B. Found in translation: learning robust joint representations by cyclic translations between modalities. Proc AAAI Conf Artif Intell. 2019;33:6892–9. Pham H, Liang PP, Manzini T, Morency L-P, Póczos B. Found in translation: learning robust joint representations by cyclic translations between modalities. Proc AAAI Conf Artif Intell. 2019;33:6892–9.
37.
Zurück zum Zitat Cambria E, Howard N, Hsu J, Hussain A. Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics. In: 2013 IEEE Symposium on Computational Intelligence for Human-Like Intelligence (CIHLI). Singapore, Singapore: IEEE; 2013. p. 108–17. https://doi.org/10.1109/CIHLI.2013.6613272.CrossRef Cambria E, Howard N, Hsu J, Hussain A. Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics. In: 2013 IEEE Symposium on Computational Intelligence for Human-Like Intelligence (CIHLI). Singapore, Singapore: IEEE; 2013. p. 108–17. https://​doi.​org/​10.​1109/​CIHLI.​2013.​6613272.CrossRef
38.
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008.
39.
Zurück zum Zitat Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J. Google’s neural machine translation system: Bridging the gap between human and machine translation. http://arxiv.org/abs/1609.08144. Accessed 18 Oct 2016. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J. Google’s neural machine translation system: Bridging the gap between human and machine translation. http://​arxiv.​org/​abs/​1609.​08144. Accessed 18 Oct 2016.
40.
Zurück zum Zitat Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti A, Pang B, Daelemans W, editors. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL; 2014. p. 1724–34. https://doi.org/10.3115/v1/d14-1179.CrossRef Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti A, Pang B, Daelemans W, editors. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL; 2014. p. 1724–34. https://​doi.​org/​10.​3115/​v1/​d14-1179.CrossRef
41.
Zurück zum Zitat Zadeh A, Zellers R, Pincus E, Morency L. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. IEEE Intell Syst. 2016;31(6):82–8. http://arxiv.org/abs/1606.06259. Accessed 12 Aug 2016. Zadeh A, Zellers R, Pincus E, Morency L. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. IEEE Intell Syst. 2016;31(6):82–8. http://​arxiv.​org/​abs/​1606.​06259. Accessed 12 Aug 2016.
42.
Zurück zum Zitat Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2018. p. 2236–46. Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2018. p. 2236–46.
43.
Zurück zum Zitat Degottex G, Kane J, Drugman T, Raitio T, Scherer S. COVAREP–a collaborative voice analysis- repository for speech technologies. In: 2014 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE; 2014. p. 960–4.CrossRef Degottex G, Kane J, Drugman T, Raitio T, Scherer S. COVAREP–a collaborative voice analysis- repository for speech technologies. In: 2014 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE; 2014. p. 960–4.CrossRef
44.
Zurück zum Zitat Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency L-P. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proc AAAI Conf Artif Intell. 2019;33:7216–23. Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency L-P. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proc AAAI Conf Artif Intell. 2019;33:7216–23.
45.
Zurück zum Zitat Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics; 2019. p. 6558–69.CrossRef Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics; 2019. p. 6558–69.CrossRef
46.
Zurück zum Zitat Tsai YH, P Liang P, Zadeh A, Morency L, Salakhutdinov R. Learning factorized multimodal representations. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net; 2019. https://openreview.net/forum?id=rygqqsA9KX. Accessed 14 May 2019. Tsai YH, P Liang P, Zadeh A, Morency L, Salakhutdinov R. Learning factorized multimodal representations. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net; 2019. https://​openreview.​net/​forum?​id=​rygqqsA9KX. Accessed 14 May 2019.
48.
Zurück zum Zitat Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency L-P, Hoque E. Integrating multimodal information in large pretrained transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; 2020. p. 2359–69. Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency L-P, Hoque E. Integrating multimodal information in large pretrained transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; 2020. p. 2359–69.
Metadaten
Titel
TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis
verfasst von
Fan Wang
Shengwei Tian
Long Yu
Jing Liu
Junwen Wang
Kun Li
Yongtao Wang
Publikationsdatum
14.11.2022
Verlag
Springer US
Erschienen in
Cognitive Computation / Ausgabe 1/2023
Print ISSN: 1866-9956
Elektronische ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-022-10073-9