ABSTRACT
The sheer amount of human-centric multimedia content has led to increased research on human behavior understanding. Most existing methods model behavioral sequences without considering the temporal saliency. This work is motivated by the psychological observation that temporally selective attention enables the human perceptual system to process the most relevant information. In this paper, we introduce a new approach, named Temporally Selective Attention Model (TSAM), designed to selectively attend to salient parts of human-centric video sequences. Our TSAM models learn to recognize affective and social states using a new loss function called speaker-distribution loss. Extensive experiments show that our model achieves the state-of-the-art performance on rapport detection and multimodal sentiment analysis. We also show that our speaker-distribution loss function can generalize to other computational models, improving the prediction performance of deep averaging network and Long Short Term Memory (LSTM).
- Tadas Baltruvsaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit WACV.Google Scholar
- Chen Chen, Zuxuan Wu, and Yu-Gang Jiang. 2016. Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition Proceedings of the 2016 ACM on Multimedia Conference. ACM, 127--131. Google ScholarDigital Library
- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google ScholarDigital Library
- Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems. 3079--3087. Google ScholarDigital Library
- Ionut Damian, Tobias Baur, and Elisabeth André. 2016. Measuring the impact of multimodal behavioural feedback loops on social interactions ICMI. Google ScholarDigital Library
- Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 960--964.Google ScholarCross Ref
- Zhi-Hong Deng, Hongliang Yu, and Yunlun Yang. 2016. Identifying Sentiment Words Using an Optimization Model with L1 Regularization AAAI. Google ScholarDigital Library
- Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification. In ACL.Google Scholar
- Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith. 2015. Transition-based dependency parsing with stack long short-term memory. ACL (2015).Google Scholar
- Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models. TPAMI, Vol. 32, 9 (2010), 1627--1645. Google ScholarDigital Library
- Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer. 2016. Representation Learning for Speech Emotion Recognition. Interspeech (2016).Google Scholar
- Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM ASRU.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In SIGKDD. Google ScholarDigital Library
- Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Composition Rivals Syntactic Methods for Text Classification ACL.Google Scholar
- Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Caglar Gulcehre, Vincent Michalski, Kishore Konda, Sébastien Jean, Pierre Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, et almbox.. 2016. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces Vol. 10 (2016), 99--111.Google ScholarCross Ref
- Yelin Kim and Emily Mower Provost. 2016. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions ICMI. Google ScholarDigital Library
- Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR (2015).Google Scholar
- Alexandra König, Aharon Satt, Alexander Sorin, Ron Hoory, Orith Toledo-Ronen, Alexandre Derreumaux, Valeria Manera, Frans Verhey, Pauline Aalten, Phillipe H Robert, et almbox.. 2015. Automatic speech analysis for the assessment of patients with predementia and Alzheimer's disease. Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring Vol. 1 (2015), 112--124.Google ScholarCross Ref
- Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. ICML, Vol. Vol. 14. 1188--1196. Google ScholarDigital Library
- Christine L Lisetti and Fatma Nasoz. 2002. MAUI: a multimodal affective user interface. In MM. Google ScholarDigital Library
- Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 142--150. Google ScholarDigital Library
- Alberto Montes, Amaia Salvador, and Xavier Giro-i Nieto. 2016. Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks. NIPS Workshop (2016).Google Scholar
- Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web ICMI. Google ScholarDigital Library
- Catherine Neubauer, Joshua Woolley, Peter Khooshabeh, and Stefan Scherer. 2016. Getting to know you: a multimodal investigation of team behavior and resilience to stress ICMI. Google ScholarDigital Library
- Thien Hai Nguyen and Kiyoaki Shirai. 2015. PhraseRNN: Phrase Recursive Neural Network for Aspect-based Sentiment Analysis EMNLP.Google Scholar
- Behnaz Nojavanasghari, Tadas Baltruvsaitis, Charles E Hughes, and Louis-Philippe Morency. 2016. EmoReact: a multimodal approach and dataset for recognizing emotional responses in children ICMI. Google ScholarDigital Library
- Lauri Oksama and Jukka Hyönäa. 2008. Dynamic binding of identity and location information: A serial model of multiple identity tracking. Cognitive psychology Vol. 56 (2008), 237--283.Google Scholar
- Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques EMNLP. Google ScholarDigital Library
- Wenjie Pei, Tadas Baltruvsaitis, David MJ Tax, and Louis-Philippe Morency. 2017. Temporal Attention-Gated Model for Robust Sequence Classification. (2017).Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP, Vol. Vol. 14. 1532--1543.Google ScholarCross Ref
- Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-Level Multimodal Sentiment Analysis. In ACL.Google Scholar
- Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion Vol. 37 (2017), 98--125. Google ScholarDigital Library
- Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir Hussain. 2016 a. Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing Vol. 174 (2016), 50--59. Google ScholarDigital Library
- Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016 b. Convolutional MKL based multimodal emotion recognition and sentiment analysis Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 439--448.Google Scholar
- Jia Rong, Gang Li, and Yi-Ping Phoebe Chen. 2009. Acoustic feature selection for automatic emotion recognition from speech. Information processing & management Vol. 45, 3 (2009), 315--328. Google ScholarDigital Library
- Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. EMNLP (2015).Google Scholar
- Lisa D Sanders and Lori B Astheimer. 2008. Temporally selective attention modulates early perceptual processing: Event-related potential evidence. Attention, Perception, & Psychophysics Vol. 70, 4 (2008), 732--742.Google ScholarCross Ref
- Klaus R Scherer. 2005. What are emotions? And how can they be measured? Social science information Vol. 44, 4 (2005), 695--729.Google Scholar
- Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing Vol. 14 (2004), 199--222. Google ScholarDigital Library
- Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, Christopher Potts, et almbox.. 2013. Recursive deep models for semantic compositionality over a sentiment treebank EMNLP.Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks NIPS. Google ScholarDigital Library
- Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics Vol. 37 (2011), 267--307. Google ScholarDigital Library
- Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory network EMNLP.Google Scholar
- Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification ACL.Google Scholar
- Imen Trabelsi, D Ben Ayed, and Noureddine Ellouze. 2016. Comparison Between GMM-SVM Sequence Kernel And GMM: Application To Speech Emotion Recognition. Journal of Engineering Science and Technology, Vol. 11, 9 (2016), 1221--1233.Google Scholar
- George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In ICASSP, 2016 IEEE International Conference on.Google ScholarDigital Library
- Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. 2016. Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis. arXiv preprint arXiv:1609.05244 (2016).Google Scholar
- Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews: Sentiment analysis in an audio-visual context. Intelligent Systems Vol. 28 (2013), 46--53. Google ScholarDigital Library
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering CVPR.Google Scholar
- Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016 a. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv preprint arXiv:1606.06259 (2016).Google Scholar
- Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016 b. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. intelligent Systems Vol. 31 (2016), 82--88. Google ScholarDigital Library
- Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. TPAMI Vol. 31 (2009), 39--58. Google ScholarDigital Library
- Ran Zhao, Alexandros Papangelis, and Justine Cassell. 2014. Towards a dyadic computational model of rapport management for human-virtual agent interaction International Conference on Intelligent Virtual Agents. Springer, 514--527.Google Scholar
- Ran Zhao, Tanmay Sinha, Alan W Black, and Justine Cassell. 2016 a. Socially-aware virtual agents: Automatically assessing dyadic rapport from temporal patterns of behavior. In IVA.Google Scholar
- Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, Wenlong Xie, Xiaolei Jiang, and Tat-Seng Chua. 2016 b. Predicting personalized emotion perceptions of social images Proceedings of the 2016 ACM on Multimedia Conference. ACM. Google ScholarDigital Library
Index Terms
- Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content
Recommendations
Selective Attention for Identification Model: Simulating visual neglect
Special issue: Attention and performance in computer visionRecently, we have introduced a computational model, termed Selective Attention Identification Model [SAIM; Heinke, Humphreys, Attention, spatial representation and visual neglect: simulating emergent attention and spatial memory in the Selective ...
Visual selective attention model considering bottom-up saliency and psychological distance
ICONIP'10: Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part ICongruency or incongruency between a psychological distance and a spatial distance for given visual stimuli can affect reaction time to corresponding visual stimuli in the course of visual perception. Human reacts more rapidly for congruent stimuli than ...
Dynamic visual selective attention model
We propose a new biologically motivated dynamic bottom-up selective attention model, which can generate a saliency map (SM) by considering dynamics of continuous input scenes as well as saliency of the primitive features of a static input scene. The ...
Comments