Skip to main content
Top

2022 | OriginalPaper | Chapter

Non-Uniform Attention Network for Multi-modal Sentiment Analysis

Authors : Binqiang Wang, Gang Dong, Yaqian Zhao, Rengang Li, Qichun Cao, Yinyin Chao

Published in: MultiMedia Modeling

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Remarkable success has been achieved in the multi-modal sentiment analysis community thanks to the existence of annotated multi-modal data sets. However, coming from three different modalities, text, sound, and vision, establishes significant barriers for better feature fusion. In this paper, we introduce “NUAN”, a non-uniform attention network for multi-modal feature fusion. NUAN is designed based on attention mechanism via considering three modalities simultaneously, but not uniformly: the text is seen as a determinate representation, with the hope that by leveraging the acoustic and visual representation, we are able to inject the effective information into a solid representation, named as tripartite interaction representation. A novel non-uniform attention module is inserted into adjacent time steps in LSTM (Long Shot-Term Memory) and processes information recurrently. The final outputs of LSTM and NUAM are concatenated to a vector, which is imported into a linear embedding layer to output the sentiment analysis result. The experimental analysis of two databases demonstrates the effectiveness of the proposed method.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Hu, M., Chu, Q., Wang, X., He, L., Ren, F.: A two-stage spatiotemporal attention convolution network for continuous dimensional emotion recognition from facial video. IEEE Signal Process. Lett. 28, 698–702 (2021)CrossRef Hu, M., Chu, Q., Wang, X., He, L., Ren, F.: A two-stage spatiotemporal attention convolution network for continuous dimensional emotion recognition from facial video. IEEE Signal Process. Lett. 28, 698–702 (2021)CrossRef
2.
go back to reference He, J., Mai, S., Hu, H.: A unimodal reinforced transformer with time squeeze fusion for multimodal sentiment analysis. IEEE Signal Process. Lett. 28, 992–996 (2021)CrossRef He, J., Mai, S., Hu, H.: A unimodal reinforced transformer with time squeeze fusion for multimodal sentiment analysis. IEEE Signal Process. Lett. 28, 992–996 (2021)CrossRef
3.
go back to reference Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017) Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:​1707.​07250 (2017)
4.
go back to reference Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018) Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:​1806.​00064 (2018)
5.
go back to reference Abdullah, S.M.S.A., Ameen, S.Y.A., Sadeeq, M.A., Zeebaree, S.: Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2(02), 52–58 (2021)CrossRef Abdullah, S.M.S.A., Ameen, S.Y.A., Sadeeq, M.A., Zeebaree, S.: Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2(02), 52–58 (2021)CrossRef
6.
go back to reference Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–2604 (2018) Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–2604 (2018)
7.
go back to reference Nguyen, D., Nguyen, K., Sridharan, S., Ghasemi, A., Dean, D., Fookes, C.: Deep spatio-temporal features for multimodal emotion recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1215–1223. IEEE (2017) Nguyen, D., Nguyen, K., Sridharan, S., Ghasemi, A., Dean, D., Fookes, C.: Deep spatio-temporal features for multimodal emotion recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1215–1223. IEEE (2017)
8.
go back to reference Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Topics Signal Process. 11(8), 1301–1309 (2017)CrossRef Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Topics Signal Process. 11(8), 1301–1309 (2017)CrossRef
9.
go back to reference Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y., Li, X.: Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:1909.05645 (2019) Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y., Li, X.: Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:​1909.​05645 (2019)
10.
go back to reference Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3er: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1359–1367 (2020) Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3er: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1359–1367 (2020)
11.
go back to reference Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243 (2020) Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243 (2020)
12.
go back to reference Gkoumas, D., Li, Q., Lioma, C., Yu, Y., Song, D.: What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Inf. Fusion 66, 184–197 (2021)CrossRef Gkoumas, D., Li, Q., Lioma, C., Yu, Y., Song, D.: What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Inf. Fusion 66, 184–197 (2021)CrossRef
13.
go back to reference Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
14.
go back to reference Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7216–7223 (2019) Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7216–7223 (2019)
15.
go back to reference Guanghui, C., Xiaoping, Z.: Multi-modal emotion recognition by fusing correlation features of speech-visual. IEEE Signal Process. Lett. 28, 533–537 (2021)CrossRef Guanghui, C., Xiaoping, Z.: Multi-modal emotion recognition by fusing correlation features of speech-visual. IEEE Signal Process. Lett. 28, 533–537 (2021)CrossRef
16.
go back to reference Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016) Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:​1606.​06259 (2016)
17.
go back to reference Rosas, V.P., Mihalcea, R., Morency, L.P.: Multimodal sentiment analysis of Spanish online videos. IEEE Intell. Syst. 28(3), 38–45 (2013)CrossRef Rosas, V.P., Mihalcea, R., Morency, L.P.: Multimodal sentiment analysis of Spanish online videos. IEEE Intell. Syst. 28(3), 38–45 (2013)CrossRef
18.
go back to reference Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544 (2015) Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544 (2015)
19.
go back to reference Fu, Y., Guo, L., Wang, L., Liu, Z., Liu, J., Dang, J.: A sentiment similarity-oriented attention model with multi-task learning for text-based emotion recognition. In: Lokoč, J., Skopal, T., Schoeffmann, K., Mezaris, V., Li, X., Vrochidis, S., Patras, I. (eds.) MMM 2021. LNCS, vol. 12572, pp. 278–289. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67832-6_23CrossRef Fu, Y., Guo, L., Wang, L., Liu, Z., Liu, J., Dang, J.: A sentiment similarity-oriented attention model with multi-task learning for text-based emotion recognition. In: Lokoč, J., Skopal, T., Schoeffmann, K., Mezaris, V., Li, X., Vrochidis, S., Patras, I. (eds.) MMM 2021. LNCS, vol. 12572, pp. 278–289. Springer, Cham (2021). https://​doi.​org/​10.​1007/​978-3-030-67832-6_​23CrossRef
20.
go back to reference You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
21.
go back to reference Tsai, Y., Bai, S., Liang, P.P., Kolter, J.Z., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019) Tsai, Y., Bai, S., Liang, P.P., Kolter, J.Z., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
22.
go back to reference Zhang, Y., Yuan, Y., Feng, Y., Lu, X.: Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 57(8), 5535–5548 (2019)CrossRef Zhang, Y., Yuan, Y., Feng, Y., Lu, X.: Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 57(8), 5535–5548 (2019)CrossRef
23.
go back to reference Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)CrossRef Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)CrossRef
24.
go back to reference Ghosal, D., Akhtar, M.S., Chauhan, D., Poria, S., Bhattacharyya, P.: Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) Ghosal, D., Akhtar, M.S., Chauhan, D., Poria, S., Bhattacharyya, P.: Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
25.
go back to reference Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(Aug), 115–143 (2002)MathSciNet Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(Aug), 115–143 (2002)MathSciNet
26.
go back to reference Rahman, W., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020) Rahman, W., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020)
27.
go back to reference Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2236–2246 (2018) Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2236–2246 (2018)
28.
go back to reference Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
29.
go back to reference Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of Empirical Methods Natural Language Processing, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of Empirical Methods Natural Language Processing, pp. 1532–1543 (2014)
Metadata
Title
Non-Uniform Attention Network for Multi-modal Sentiment Analysis
Authors
Binqiang Wang
Gang Dong
Yaqian Zhao
Rengang Li
Qichun Cao
Yinyin Chao
Copyright Year
2022
DOI
https://doi.org/10.1007/978-3-030-98358-1_48

Premium Partner