Skip to main content

2022 | OriginalPaper | Buchkapitel

Non-Uniform Attention Network for Multi-modal Sentiment Analysis

verfasst von : Binqiang Wang, Gang Dong, Yaqian Zhao, Rengang Li, Qichun Cao, Yinyin Chao

Erschienen in: MultiMedia Modeling

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Remarkable success has been achieved in the multi-modal sentiment analysis community thanks to the existence of annotated multi-modal data sets. However, coming from three different modalities, text, sound, and vision, establishes significant barriers for better feature fusion. In this paper, we introduce “NUAN”, a non-uniform attention network for multi-modal feature fusion. NUAN is designed based on attention mechanism via considering three modalities simultaneously, but not uniformly: the text is seen as a determinate representation, with the hope that by leveraging the acoustic and visual representation, we are able to inject the effective information into a solid representation, named as tripartite interaction representation. A novel non-uniform attention module is inserted into adjacent time steps in LSTM (Long Shot-Term Memory) and processes information recurrently. The final outputs of LSTM and NUAM are concatenated to a vector, which is imported into a linear embedding layer to output the sentiment analysis result. The experimental analysis of two databases demonstrates the effectiveness of the proposed method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Hu, M., Chu, Q., Wang, X., He, L., Ren, F.: A two-stage spatiotemporal attention convolution network for continuous dimensional emotion recognition from facial video. IEEE Signal Process. Lett. 28, 698–702 (2021)CrossRef Hu, M., Chu, Q., Wang, X., He, L., Ren, F.: A two-stage spatiotemporal attention convolution network for continuous dimensional emotion recognition from facial video. IEEE Signal Process. Lett. 28, 698–702 (2021)CrossRef
2.
Zurück zum Zitat He, J., Mai, S., Hu, H.: A unimodal reinforced transformer with time squeeze fusion for multimodal sentiment analysis. IEEE Signal Process. Lett. 28, 992–996 (2021)CrossRef He, J., Mai, S., Hu, H.: A unimodal reinforced transformer with time squeeze fusion for multimodal sentiment analysis. IEEE Signal Process. Lett. 28, 992–996 (2021)CrossRef
3.
Zurück zum Zitat Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017) Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:​1707.​07250 (2017)
4.
Zurück zum Zitat Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018) Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:​1806.​00064 (2018)
5.
Zurück zum Zitat Abdullah, S.M.S.A., Ameen, S.Y.A., Sadeeq, M.A., Zeebaree, S.: Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2(02), 52–58 (2021)CrossRef Abdullah, S.M.S.A., Ameen, S.Y.A., Sadeeq, M.A., Zeebaree, S.: Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2(02), 52–58 (2021)CrossRef
6.
Zurück zum Zitat Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–2604 (2018) Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–2604 (2018)
7.
Zurück zum Zitat Nguyen, D., Nguyen, K., Sridharan, S., Ghasemi, A., Dean, D., Fookes, C.: Deep spatio-temporal features for multimodal emotion recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1215–1223. IEEE (2017) Nguyen, D., Nguyen, K., Sridharan, S., Ghasemi, A., Dean, D., Fookes, C.: Deep spatio-temporal features for multimodal emotion recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1215–1223. IEEE (2017)
8.
Zurück zum Zitat Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Topics Signal Process. 11(8), 1301–1309 (2017)CrossRef Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Topics Signal Process. 11(8), 1301–1309 (2017)CrossRef
9.
Zurück zum Zitat Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y., Li, X.: Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:1909.05645 (2019) Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y., Li, X.: Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:​1909.​05645 (2019)
10.
Zurück zum Zitat Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3er: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1359–1367 (2020) Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3er: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1359–1367 (2020)
11.
Zurück zum Zitat Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243 (2020) Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243 (2020)
12.
Zurück zum Zitat Gkoumas, D., Li, Q., Lioma, C., Yu, Y., Song, D.: What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Inf. Fusion 66, 184–197 (2021)CrossRef Gkoumas, D., Li, Q., Lioma, C., Yu, Y., Song, D.: What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Inf. Fusion 66, 184–197 (2021)CrossRef
13.
Zurück zum Zitat Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
14.
Zurück zum Zitat Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7216–7223 (2019) Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7216–7223 (2019)
15.
Zurück zum Zitat Guanghui, C., Xiaoping, Z.: Multi-modal emotion recognition by fusing correlation features of speech-visual. IEEE Signal Process. Lett. 28, 533–537 (2021)CrossRef Guanghui, C., Xiaoping, Z.: Multi-modal emotion recognition by fusing correlation features of speech-visual. IEEE Signal Process. Lett. 28, 533–537 (2021)CrossRef
16.
Zurück zum Zitat Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016) Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:​1606.​06259 (2016)
17.
Zurück zum Zitat Rosas, V.P., Mihalcea, R., Morency, L.P.: Multimodal sentiment analysis of Spanish online videos. IEEE Intell. Syst. 28(3), 38–45 (2013)CrossRef Rosas, V.P., Mihalcea, R., Morency, L.P.: Multimodal sentiment analysis of Spanish online videos. IEEE Intell. Syst. 28(3), 38–45 (2013)CrossRef
18.
Zurück zum Zitat Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544 (2015) Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544 (2015)
19.
Zurück zum Zitat Fu, Y., Guo, L., Wang, L., Liu, Z., Liu, J., Dang, J.: A sentiment similarity-oriented attention model with multi-task learning for text-based emotion recognition. In: Lokoč, J., Skopal, T., Schoeffmann, K., Mezaris, V., Li, X., Vrochidis, S., Patras, I. (eds.) MMM 2021. LNCS, vol. 12572, pp. 278–289. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67832-6_23CrossRef Fu, Y., Guo, L., Wang, L., Liu, Z., Liu, J., Dang, J.: A sentiment similarity-oriented attention model with multi-task learning for text-based emotion recognition. In: Lokoč, J., Skopal, T., Schoeffmann, K., Mezaris, V., Li, X., Vrochidis, S., Patras, I. (eds.) MMM 2021. LNCS, vol. 12572, pp. 278–289. Springer, Cham (2021). https://​doi.​org/​10.​1007/​978-3-030-67832-6_​23CrossRef
20.
Zurück zum Zitat You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
21.
Zurück zum Zitat Tsai, Y., Bai, S., Liang, P.P., Kolter, J.Z., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019) Tsai, Y., Bai, S., Liang, P.P., Kolter, J.Z., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
22.
Zurück zum Zitat Zhang, Y., Yuan, Y., Feng, Y., Lu, X.: Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 57(8), 5535–5548 (2019)CrossRef Zhang, Y., Yuan, Y., Feng, Y., Lu, X.: Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 57(8), 5535–5548 (2019)CrossRef
23.
Zurück zum Zitat Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)CrossRef Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)CrossRef
24.
Zurück zum Zitat Ghosal, D., Akhtar, M.S., Chauhan, D., Poria, S., Bhattacharyya, P.: Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) Ghosal, D., Akhtar, M.S., Chauhan, D., Poria, S., Bhattacharyya, P.: Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
25.
Zurück zum Zitat Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(Aug), 115–143 (2002)MathSciNet Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(Aug), 115–143 (2002)MathSciNet
26.
Zurück zum Zitat Rahman, W., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020) Rahman, W., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020)
27.
Zurück zum Zitat Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2236–2246 (2018) Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2236–2246 (2018)
28.
Zurück zum Zitat Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
29.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of Empirical Methods Natural Language Processing, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of Empirical Methods Natural Language Processing, pp. 1532–1543 (2014)
Metadaten
Titel
Non-Uniform Attention Network for Multi-modal Sentiment Analysis
verfasst von
Binqiang Wang
Gang Dong
Yaqian Zhao
Rengang Li
Qichun Cao
Yinyin Chao
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-030-98358-1_48

Premium Partner