Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content

Authors:
Hongliang Yu

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Liangke Gui

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Michael Madaio

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Amy Ogan

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Justine Cassell

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Louis-Philippe Morency

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 1743–1751https://doi.org/10.1145/3123266.3123413

Published:23 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1743–1751

ABSTRACT

The sheer amount of human-centric multimedia content has led to increased research on human behavior understanding. Most existing methods model behavioral sequences without considering the temporal saliency. This work is motivated by the psychological observation that temporally selective attention enables the human perceptual system to process the most relevant information. In this paper, we introduce a new approach, named Temporally Selective Attention Model (TSAM), designed to selectively attend to salient parts of human-centric video sequences. Our TSAM models learn to recognize affective and social states using a new loss function called speaker-distribution loss. Extensive experiments show that our model achieves the state-of-the-art performance on rapport detection and multimodal sentiment analysis. We also show that our speaker-distribution loss function can generalize to other computational models, improving the prediction performance of deep averaging network and Long Short Term Memory (LSTM).

References

Tadas Baltruvsaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit WACV.Google Scholar
Chen Chen, Zuxuan Wu, and Yu-Gang Jiang. 2016. Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition Proceedings of the 2016 ACM on Multimedia Conference. ACM, 127--131. Google ScholarDigital Library
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google ScholarDigital Library
Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems. 3079--3087. Google ScholarDigital Library
Ionut Damian, Tobias Baur, and Elisabeth André. 2016. Measuring the impact of multimodal behavioural feedback loops on social interactions ICMI. Google ScholarDigital Library
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 960--964.Google ScholarCross Ref
Zhi-Hong Deng, Hongliang Yu, and Yunlun Yang. 2016. Identifying Sentiment Words Using an Optimization Model with L1 Regularization AAAI. Google ScholarDigital Library
Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification. In ACL.Google Scholar
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith. 2015. Transition-based dependency parsing with stack long short-term memory. ACL (2015).Google Scholar
Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models. TPAMI, Vol. 32, 9 (2010), 1627--1645. Google ScholarDigital Library
Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer. 2016. Representation Learning for Speech Emotion Recognition. Interspeech (2016).Google Scholar
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM ASRU.Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In SIGKDD. Google ScholarDigital Library
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Composition Rivals Syntactic Methods for Text Classification ACL.Google Scholar
Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Caglar Gulcehre, Vincent Michalski, Kishore Konda, Sébastien Jean, Pierre Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, et almbox.. 2016. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces Vol. 10 (2016), 99--111.Google ScholarCross Ref
Yelin Kim and Emily Mower Provost. 2016. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions ICMI. Google ScholarDigital Library
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR (2015).Google Scholar
Alexandra König, Aharon Satt, Alexander Sorin, Ron Hoory, Orith Toledo-Ronen, Alexandre Derreumaux, Valeria Manera, Frans Verhey, Pauline Aalten, Phillipe H Robert, et almbox.. 2015. Automatic speech analysis for the assessment of patients with predementia and Alzheimer's disease. Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring Vol. 1 (2015), 112--124.Google ScholarCross Ref
Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. ICML, Vol. Vol. 14. 1188--1196. Google ScholarDigital Library
Christine L Lisetti and Fatma Nasoz. 2002. MAUI: a multimodal affective user interface. In MM. Google ScholarDigital Library
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 142--150. Google ScholarDigital Library
Alberto Montes, Amaia Salvador, and Xavier Giro-i Nieto. 2016. Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks. NIPS Workshop (2016).Google Scholar
Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web ICMI. Google ScholarDigital Library
Catherine Neubauer, Joshua Woolley, Peter Khooshabeh, and Stefan Scherer. 2016. Getting to know you: a multimodal investigation of team behavior and resilience to stress ICMI. Google ScholarDigital Library
Thien Hai Nguyen and Kiyoaki Shirai. 2015. PhraseRNN: Phrase Recursive Neural Network for Aspect-based Sentiment Analysis EMNLP.Google Scholar
Behnaz Nojavanasghari, Tadas Baltruvsaitis, Charles E Hughes, and Louis-Philippe Morency. 2016. EmoReact: a multimodal approach and dataset for recognizing emotional responses in children ICMI. Google ScholarDigital Library
Lauri Oksama and Jukka Hyönäa. 2008. Dynamic binding of identity and location information: A serial model of multiple identity tracking. Cognitive psychology Vol. 56 (2008), 237--283.Google Scholar
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques EMNLP. Google ScholarDigital Library
Wenjie Pei, Tadas Baltruvsaitis, David MJ Tax, and Louis-Philippe Morency. 2017. Temporal Attention-Gated Model for Robust Sequence Classification. (2017).Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP, Vol. Vol. 14. 1532--1543.Google ScholarCross Ref
Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-Level Multimodal Sentiment Analysis. In ACL.Google Scholar
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion Vol. 37 (2017), 98--125. Google ScholarDigital Library
Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir Hussain. 2016 a. Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing Vol. 174 (2016), 50--59. Google ScholarDigital Library
Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016 b. Convolutional MKL based multimodal emotion recognition and sentiment analysis Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 439--448.Google Scholar
Jia Rong, Gang Li, and Yi-Ping Phoebe Chen. 2009. Acoustic feature selection for automatic emotion recognition from speech. Information processing & management Vol. 45, 3 (2009), 315--328. Google ScholarDigital Library
Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. EMNLP (2015).Google Scholar
Lisa D Sanders and Lori B Astheimer. 2008. Temporally selective attention modulates early perceptual processing: Event-related potential evidence. Attention, Perception, & Psychophysics Vol. 70, 4 (2008), 732--742.Google ScholarCross Ref
Klaus R Scherer. 2005. What are emotions? And how can they be measured? Social science information Vol. 44, 4 (2005), 695--729.Google Scholar
Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing Vol. 14 (2004), 199--222. Google ScholarDigital Library
Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, Christopher Potts, et almbox.. 2013. Recursive deep models for semantic compositionality over a sentiment treebank EMNLP.Google Scholar
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks NIPS. Google ScholarDigital Library
Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics Vol. 37 (2011), 267--307. Google ScholarDigital Library
Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory network EMNLP.Google Scholar
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification ACL.Google Scholar
Imen Trabelsi, D Ben Ayed, and Noureddine Ellouze. 2016. Comparison Between GMM-SVM Sequence Kernel And GMM: Application To Speech Emotion Recognition. Journal of Engineering Science and Technology, Vol. 11, 9 (2016), 1221--1233.Google Scholar
George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In ICASSP, 2016 IEEE International Conference on.Google ScholarDigital Library
Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. 2016. Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis. arXiv preprint arXiv:1609.05244 (2016).Google Scholar
Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews: Sentiment analysis in an audio-visual context. Intelligent Systems Vol. 28 (2013), 46--53. Google ScholarDigital Library
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering CVPR.Google Scholar
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016 a. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv preprint arXiv:1606.06259 (2016).Google Scholar
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016 b. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. intelligent Systems Vol. 31 (2016), 82--88. Google ScholarDigital Library
Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. TPAMI Vol. 31 (2009), 39--58. Google ScholarDigital Library
Ran Zhao, Alexandros Papangelis, and Justine Cassell. 2014. Towards a dyadic computational model of rapport management for human-virtual agent interaction International Conference on Intelligent Virtual Agents. Springer, 514--527.Google Scholar
Ran Zhao, Tanmay Sinha, Alan W Black, and Justine Cassell. 2016 a. Socially-aware virtual agents: Automatically assessing dyadic rapport from temporal patterns of behavior. In IVA.Google Scholar
Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, Wenlong Xie, Xiaolei Jiang, and Tat-Seng Chua. 2016 b. Predicting personalized emotion perceptions of social images Proceedings of the 2016 ACM on Multimedia Conference. ACM. Google ScholarDigital Library

Index Terms

Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by regression
    2. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI

Recommendations

Selective Attention for Identification Model: Simulating visual neglect
Special issue: Attention and performance in computer vision

Recently, we have introduced a computational model, termed Selective Attention Identification Model [SAIM; Heinke, Humphreys, Attention, spatial representation and visual neglect: simulating emergent attention and spatial memory in the Selective ...
Read More
Visual selective attention model considering bottom-up saliency and psychological distance
ICONIP'10: Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part I

Congruency or incongruency between a psychological distance and a spatial distance for given visual stimuli can affect reaction time to corresponding visual stimuli in the course of visual perception. Human reacts more rapidly for congruent stimuli than ...
Read More
Dynamic visual selective attention model

We propose a new biologically motivated dynamic bottom-up selective attention model, which can generate a saliency map (SM) by considering dynamics of continuous input scenes as well as saliency of the primitive features of a static input scene. The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
affective state recognition
speaker-distribution loss
temporally selective attention
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 580
  Total Downloads
- Downloads (Last 12 months)114
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content

MM '17: Proceedings of the 25th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Selective Attention for Identification Model: Simulating visual neglect

Visual selective attention model considering bottom-up saliency and psychological distance

Dynamic visual selective attention model