skip to main content
10.1145/3123266.3123413acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Public Access

Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content

Published:23 October 2017Publication History

ABSTRACT

The sheer amount of human-centric multimedia content has led to increased research on human behavior understanding. Most existing methods model behavioral sequences without considering the temporal saliency. This work is motivated by the psychological observation that temporally selective attention enables the human perceptual system to process the most relevant information. In this paper, we introduce a new approach, named Temporally Selective Attention Model (TSAM), designed to selectively attend to salient parts of human-centric video sequences. Our TSAM models learn to recognize affective and social states using a new loss function called speaker-distribution loss. Extensive experiments show that our model achieves the state-of-the-art performance on rapport detection and multimodal sentiment analysis. We also show that our speaker-distribution loss function can generalize to other computational models, improving the prediction performance of deep averaging network and Long Short Term Memory (LSTM).

References

  1. Tadas Baltruvsaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit WACV.Google ScholarGoogle Scholar
  2. Chen Chen, Zuxuan Wu, and Yu-Gang Jiang. 2016. Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition Proceedings of the 2016 ACM on Multimedia Conference. ACM, 127--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems. 3079--3087. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ionut Damian, Tobias Baur, and Elisabeth André. 2016. Measuring the impact of multimodal behavioural feedback loops on social interactions ICMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 960--964.Google ScholarGoogle ScholarCross RefCross Ref
  7. Zhi-Hong Deng, Hongliang Yu, and Yunlun Yang. 2016. Identifying Sentiment Words Using an Optimization Model with L1 Regularization AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification. In ACL.Google ScholarGoogle Scholar
  9. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith. 2015. Transition-based dependency parsing with stack long short-term memory. ACL (2015).Google ScholarGoogle Scholar
  10. Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models. TPAMI, Vol. 32, 9 (2010), 1627--1645. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer. 2016. Representation Learning for Speech Emotion Recognition. Interspeech (2016).Google ScholarGoogle Scholar
  12. Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM ASRU.Google ScholarGoogle Scholar
  13. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In SIGKDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Composition Rivals Syntactic Methods for Text Classification ACL.Google ScholarGoogle Scholar
  16. Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Caglar Gulcehre, Vincent Michalski, Kishore Konda, Sébastien Jean, Pierre Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, et almbox.. 2016. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces Vol. 10 (2016), 99--111.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yelin Kim and Emily Mower Provost. 2016. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions ICMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR (2015).Google ScholarGoogle Scholar
  19. Alexandra König, Aharon Satt, Alexander Sorin, Ron Hoory, Orith Toledo-Ronen, Alexandre Derreumaux, Valeria Manera, Frans Verhey, Pauline Aalten, Phillipe H Robert, et almbox.. 2015. Automatic speech analysis for the assessment of patients with predementia and Alzheimer's disease. Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring Vol. 1 (2015), 112--124.Google ScholarGoogle ScholarCross RefCross Ref
  20. Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. ICML, Vol. Vol. 14. 1188--1196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Christine L Lisetti and Fatma Nasoz. 2002. MAUI: a multimodal affective user interface. In MM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 142--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Alberto Montes, Amaia Salvador, and Xavier Giro-i Nieto. 2016. Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks. NIPS Workshop (2016).Google ScholarGoogle Scholar
  24. Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web ICMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Catherine Neubauer, Joshua Woolley, Peter Khooshabeh, and Stefan Scherer. 2016. Getting to know you: a multimodal investigation of team behavior and resilience to stress ICMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Thien Hai Nguyen and Kiyoaki Shirai. 2015. PhraseRNN: Phrase Recursive Neural Network for Aspect-based Sentiment Analysis EMNLP.Google ScholarGoogle Scholar
  27. Behnaz Nojavanasghari, Tadas Baltruvsaitis, Charles E Hughes, and Louis-Philippe Morency. 2016. EmoReact: a multimodal approach and dataset for recognizing emotional responses in children ICMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lauri Oksama and Jukka Hyönäa. 2008. Dynamic binding of identity and location information: A serial model of multiple identity tracking. Cognitive psychology Vol. 56 (2008), 237--283.Google ScholarGoogle Scholar
  29. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques EMNLP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wenjie Pei, Tadas Baltruvsaitis, David MJ Tax, and Louis-Philippe Morency. 2017. Temporal Attention-Gated Model for Robust Sequence Classification. (2017).Google ScholarGoogle Scholar
  31. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP, Vol. Vol. 14. 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  32. Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-Level Multimodal Sentiment Analysis. In ACL.Google ScholarGoogle Scholar
  33. Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion Vol. 37 (2017), 98--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir Hussain. 2016 a. Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing Vol. 174 (2016), 50--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016 b. Convolutional MKL based multimodal emotion recognition and sentiment analysis Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 439--448.Google ScholarGoogle Scholar
  36. Jia Rong, Gang Li, and Yi-Ping Phoebe Chen. 2009. Acoustic feature selection for automatic emotion recognition from speech. Information processing & management Vol. 45, 3 (2009), 315--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. EMNLP (2015).Google ScholarGoogle Scholar
  38. Lisa D Sanders and Lori B Astheimer. 2008. Temporally selective attention modulates early perceptual processing: Event-related potential evidence. Attention, Perception, & Psychophysics Vol. 70, 4 (2008), 732--742.Google ScholarGoogle ScholarCross RefCross Ref
  39. Klaus R Scherer. 2005. What are emotions? And how can they be measured? Social science information Vol. 44, 4 (2005), 695--729.Google ScholarGoogle Scholar
  40. Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing Vol. 14 (2004), 199--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, Christopher Potts, et almbox.. 2013. Recursive deep models for semantic compositionality over a sentiment treebank EMNLP.Google ScholarGoogle Scholar
  42. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics Vol. 37 (2011), 267--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory network EMNLP.Google ScholarGoogle Scholar
  45. Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification ACL.Google ScholarGoogle Scholar
  46. Imen Trabelsi, D Ben Ayed, and Noureddine Ellouze. 2016. Comparison Between GMM-SVM Sequence Kernel And GMM: Application To Speech Emotion Recognition. Journal of Engineering Science and Technology, Vol. 11, 9 (2016), 1221--1233.Google ScholarGoogle Scholar
  47. George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In ICASSP, 2016 IEEE International Conference on.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. 2016. Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis. arXiv preprint arXiv:1609.05244 (2016).Google ScholarGoogle Scholar
  49. Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews: Sentiment analysis in an audio-visual context. Intelligent Systems Vol. 28 (2013), 46--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering CVPR.Google ScholarGoogle Scholar
  51. Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016 a. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv preprint arXiv:1606.06259 (2016).Google ScholarGoogle Scholar
  52. Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016 b. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. intelligent Systems Vol. 31 (2016), 82--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. TPAMI Vol. 31 (2009), 39--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ran Zhao, Alexandros Papangelis, and Justine Cassell. 2014. Towards a dyadic computational model of rapport management for human-virtual agent interaction International Conference on Intelligent Virtual Agents. Springer, 514--527.Google ScholarGoogle Scholar
  55. Ran Zhao, Tanmay Sinha, Alan W Black, and Justine Cassell. 2016 a. Socially-aware virtual agents: Automatically assessing dyadic rapport from temporal patterns of behavior. In IVA.Google ScholarGoogle Scholar
  56. Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, Wenlong Xie, Xiaolei Jiang, and Tat-Seng Chua. 2016 b. Predicting personalized emotion perceptions of social images Proceedings of the 2016 ACM on Multimedia Conference. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '17: Proceedings of the 25th ACM international conference on Multimedia
          October 2017
          2028 pages
          ISBN:9781450349062
          DOI:10.1145/3123266

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 October 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader