ABSTRACT
Emotion recognition is a core research area at the intersection of artificial intelligence and human communication analysis. It is a significant technical challenge since humans display their emotions through complex idiosyncratic combinations of the language, visual and acoustic modalities. In contrast to traditional multimodal fusion techniques, we approach emotion recognition from both direct person-independent and relative person-dependent perspectives. The direct person-independent perspective follows the conventional emotion recognition approach which directly infers absolute emotion labels from observed multimodal features. The relative person-dependent perspective approaches emotion recognition in a relative manner by comparing partial video segments to determine if there was an increase or decrease in emotional intensity. Our proposed model integrates these direct and relative prediction perspectives by dividing the emotion recognition task into three easier subtasks. The first subtask involves a multimodal local ranking of relative emotion intensities between two short segments of a video. The second subtask uses local rankings to infer global relative emotion ranks with a Bayesian ranking algorithm. The third subtask incorporates both direct predictions from observed multimodal behaviors and relative emotion ranks from local-global rankings for final emotion prediction. Our approach displays excellent performance on an audio-visual emotion recognition benchmark and improves over other algorithms for multimodal fusion.
- Harika Abburi, Rajendra Prasath, Manish Shrivastava, and Suryakanth V Gan- gashetty. 2016. Multimodal Sentiment Analysis Using Deep Neural Networks. In International Conference on Mining Intelligence and Knowledge Exploration. Springer, 58--65.Google Scholar
- Fernando Alonso-Martin, Maria Malfaz, Joao Sequeira, Javier F. Gorostiza, and Miguel A. Salichs. 2013. A Multimodal Emotion Detection System during Human- Robot Interaction. Sensors 13, 11 (2013), 15549--15581.Google ScholarCross Ref
- T. Baltrušaitis, L. Li, and L. P. Morency. 2017. Local-global ranking for fa- cial expression intensity estimation. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) . 111--118.Google Scholar
- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation 42, 4 (dec 2008), 335--359.Google ScholarCross Ref
- Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement Learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI 2017). ACM, New York, NY, USA, 163--171. Google ScholarDigital Library
- R. Elliott, Z. Agnew, and J. F. W. Deakin. Medial orbitofrontal cortex codes relative rather than absolute value of financial rewards in humans. European Journal of Neuroscience 27, 9 (????), 2213--2218.Google Scholar
- Arpad E. Elo. 1978. The rating of chessplayers, past and present. Arco Pub., New York. http://www.amazon.com/Rating-Chess-Players-Past-Present/dp/ 0668047216Google Scholar
- C. A. Frantzidis, C. Bratsas, M. A. Klados, E. Konstantinidis, C. D. Lithari, A. B. Vivas, C. L. Papadelis, E. Kaldoudi, C. Pappas, and P. D. Bamidis. 2010. On the Classification of Emotional Biosignals Evoked While Viewing Affective Pictures: An Integrated Data-Mining-Based Approach for Healthcare Applications. IEEE Transactions on Information Technology in Biomedicine 14, 2 (March 2010), 309--318. Google ScholarDigital Library
- A. Graves, A. r. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.Google Scholar
- Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. TrueSkill?: A Bayesian Skill Rating System. In Advances in Neural Information Processing Systems 19 , B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.). MIT Press, 569--576. http://papers. nips.cc/paper/3079-trueskilltm-a-bayesian-skill-rating-system.pdf Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Lawrence I-Kuei Lin. 1989. A Concordance Correlation-Coefficient To Evaluate Reproducibility. 45 (04 1989), 255--68.Google Scholar
- Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe Morency. 2018. Multi- modal Language Analysis with Recurrent Multistage Fusion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018).Google Scholar
- Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2247--2256. http://aclweb.org/anthology/P18--1209Google ScholarCross Ref
- Z. Liu, M. Wu, W. Cao, L. Chen, J. Xu, R. Zhang, M. Zhou, and J. Mao. 2017. A facial expression emotion recognition based human-robot interaction system. IEEE/CAA Journal of Automatica Sinica 4, 4 (2017), 668--676.Google ScholarCross Ref
- Mehdi Malekzadeh, Mumtaz Begum Mustafa, and Adel Lahsasna. 2015. A Review of Emotion Regulation in Intelligent Tutoring Systems. Journal of Educa- tional Technology and Society 18, 4 (2015), 435--445. http://www.jstor.org/stable/jeductechsoci.18.4.435Google Scholar
- George Miller. 1956. The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. (1956). http://cogprints.org/730/ One of the 100 most influential papers in cognitive science: http://cogsci.umn.edu/millennium/final.html.Google Scholar
- Sintija Petrovica, Alla Anohina-Naumeca, and HazÃDÂśm Kemal Ekenel. 2017. Emotion Recognition in Affective Tutoring Systems: Collection of Ground-truth Data. Procedia Computer Science 104 (2017), 437--444. ICTE 2016, Riga Technical University, Latvia. Google ScholarDigital Library
- Hai Pham, Thomas Manzini, Paul Pu Liang, and Barnabas Poczos. 2018. Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis. In Proceedings of Grand Challenge and Workshop on Human Multi-modal Language (Challenge-HML). Association for Computational Linguistics, Melbourne, Australia, 53--63. http://www.aclweb.org/anthology/W18--3308Google ScholarCross Ref
- Johannes Pittermann, Angela Pittermann, and Wolfgang Minker. 2009. Handling Emotions in Human-Computer Dialogues (1st ed.). Springer Publishing Company, Incorporated. Google ScholarDigital Library
- Johannes Pittermann, Angela Pittermann, and Wolfgang Minker. 2010. Emotion recognition and adaptation in spoken dialogue systems. International Journal of Speech Technology 13, 1 (01 Mar 2010), 49--60.Google ScholarCross Ref
- Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrušaitis, and Roland Goecke. 2016. Extending long short-term memory for multi-view structured learning. In European Conference on Computer Vision.Google ScholarCross Ref
- Fabien Ringeval, Andreas Sonderegger, JÃijrgen S. Sauer, and Denis Lalanne. 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions.. In FG. IEEE Computer Society, 1--8.Google Scholar
- M. Schuster and K.K. Paliwal. 1997. Bidirectional Recurrent Neural Networks. Trans. Sig. Proc. 45, 11 (Nov. 1997), 2673--2681. Google ScholarDigital Library
- Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. 2005. Early Versus Late Fusion in Semantic Video Analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA '05). ACM, New York, NY, USA, 399--402. Google ScholarDigital Library
- Neil Stewart, Gordon D. A. Brown, and Nick Chater. 2005. Absolute identification by relative judgment. Psychological review 112 4 (2005), 881--911.Google Scholar
- Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2018. Learning Factorized Multimodal Representations. arXiv preprint arXiv:1806.06176 (2018).Google Scholar
- Alexandria K. Vail, Joseph F. Grafsgaard, Kristy Elizabeth Boyer, Eric N. Wiebe, and James C. Lester. 2016. Gender Differences in Facial Expressions of Affect During Learning. In Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization (UMAP '16). ACM, New York, NY, USA, 65--73. Google ScholarDigital Library
- N. Veček, M.črepinšek, M. Mernik, and D. Hrnčič. 2014. A comparison between different chess rating systems for ranking evolutionary algorithms. In 2014 Federated Conference on Computer Science and Information Systems. 511--518.Google Scholar
- Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. arXiv preprint arXiv:1304.5634 (2013).Google Scholar
- Hongliang Yu, Liangke Gui, Michael Madaio, Amy Ogan, Justine Cassell, and Louis-Philippe Morency. 2017. Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content. In Proceedings of the 2017 ACM on Multimedia Conference (MM '17). ACM, New York, NY, USA, 1743--1751. Google ScholarDigital Library
- Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-view Sequential Learning. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (2018).Google ScholarCross Ref
- Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
Index Terms
- Multimodal Local-Global Ranking Fusion for Emotion Recognition
Recommendations
Fine-grained emotion recognition: fusion of physiological signals and facial expressions on spontaneous emotion corpus
The recognition of fine-grained emotions (i.e., happiness, sad, etc.) has shown its importance in a real-world implementation. The emotion recognition using physiological signals is a challenging task due to the precision of the labelled data while using ...
Hierarchical multimodal-fusion of physiological signals for emotion recognition with scenario adaption and contrastive alignment
Highlights- A novel RHPRNet for multimodal physiological emotion recognition was proposed.
- A pre-training procedure with scenario adaptation was proposed.
- A loss function of contrastive alignment was employed.
- The results of multiple ...
AbstractThe lack of complementary affective responses from both the central and peripheral nervous systems could limit the performance of emotion recognition with the single-modal physiological signal. However, when integrating multimodalities, a direct ...
Multimodal Emotion Expressions of Virtual Agents, Mimic and Vocal Emotion Expressions and Their Effects on Emotion Recognition
ACII '13: Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent InteractionEmotional expressions of virtual agents are widely believed to enhance the interaction with the user by utilizing more natural means of communication. However, as a result of the current technology virtual agents are often only able to produce facial ...
Comments