research-article

Recurrent Neural Networks for Emotion Recognition in Video

Authors:
Samira Ebrahimi Kahou

Ecole Polytechnique de Montréal, Montreal, Canada

Ecole Polytechnique de Montréal, Montreal, Canada
View Profile

,
Vincent Michalski

University of Montreal, Montreal, Canada

University of Montreal, Montreal, Canada
View Profile

,
Kishore Konda

Goethe-University Frankfurt, Frankfurt am Main, Germany

Goethe-University Frankfurt, Frankfurt am Main, Germany
View Profile

,
Roland Memisevic

University of Montreal, Montreal, Canada

University of Montreal, Montreal, Canada
View Profile

,
Christopher Pal

Ecole Polytechnique, Montreal, Canada

Ecole Polytechnique, Montreal, Canada
View Profile

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal InteractionNovember 2015Pages 467–474https://doi.org/10.1145/2818346.2830596

Published:09 November 2015Publication History

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pages 467–474

ABSTRACT

Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recognition and activity recognition. In the case of video, information often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emotion recognition in video has relied on temporal averaging and pooling operations reminiscent of widely used approaches for the spatial aggregation of information. Recurrent neural networks (RNNs) have seen an explosion of recent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attractive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge. We focus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.

References

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In A. Salah and B. Lepri, editors, Human Behavior Understanding, volume 7065 of Lecture Notes in Computer Science, pages 29--39. Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.Google Scholar
F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.Google Scholar
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281--305, 2012. Google ScholarDigital Library
J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.Google Scholar
P.-L. Carrier, A. Courville, I. J. Goodfellow, M. Mirza, and Y. Bengio. FER-2013 Face Database. Technical report, 1365, Université de Montréal, 2013.Google Scholar
A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In Proceedings of the 16th International Conference on Multimodal Interaction, ICMI '14, pages 461--466, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial-expression databases from movies. MultiMedia, IEEE, 19(3):34--41, July 2012. Google ScholarDigital Library
A. Dhall, O. V. R. Murthy, R. Goecke, J. Joshi, and T. Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 17th ACM on International Conference on Multimodal Interaction, ICMI '15, 2015. Google ScholarDigital Library
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. 2014.Google Scholar
F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the international conference on Multimedia, pages 1459--1462. ACM, 2010. Google ScholarDigital Library
F. Eyben, M. Wöllmer, and B. Schuller. openear introducing the munich open-source emotion and affect recognition toolkit. In ACII, pages 576--581, 2009.Google ScholarCross Ref
A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645--6649. IEEE, 2013.Google ScholarCross Ref
A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, pages 545--552, 2009.Google ScholarDigital Library
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google ScholarDigital Library
S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, R. Chandias Ferrari, M. Mirza, D. Warde-Farley, A. Courville, P. Vincent, R. Memisevic, C. Pal, and Y. Bengio. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, pages 1--13, 2015.Google Scholar
S. E. Kahou, P. Froumenty, and C. Pal. Facial expression analysis based on high dimensional binary features. In ECCV Workshop on Computer Vision with Local Binary Patterns Variants, Zurich, Switzerland, 2014.Google Scholar
S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. Gulcehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, M. Mirza, S. Jean, P.-L. Carrier, Y. Dauphin, N. Boulanger-Lewandowski, A. Aggarwal, J. Zumer, P. Lamblin, J.-P. Raymond, G. Desjardins, R. Pascanu, D. Warde-Farley, A. Torabi, A. Sharma, E. Bengio, M. Côté, K. R. Konda, and Z. Wu. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI '13, 2013. Google ScholarDigital Library
N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sentences. arXiv:1404.2188, 2014.Google Scholar
K. R. Konda, R. Memisevic, and V. Michalski. Learning to encode motion using spatio-temporal synchrony. In Proceedings of ICLR, April 2014.Google Scholar
A. Krizhevsky. Cuda-convnet Google code home page. https://code.google.com/p/cuda-convnet/, Aug. 2012.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarDigital Library
Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. Google ScholarDigital Library
Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.Google Scholar
M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen. Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th International Conference on Multimodal Interaction, ICMI '14, pages 494--501, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
Nagadomi. Github: kaggle-cifar10-torch7. https: //github.com/nagadomi/kaggle-cifar10-torch7/, 2014.Google Scholar
B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. Avec 2011--the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction, pages 415--424. Springer, 2011. Google ScholarDigital Library
C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput., 27(6):803--816, May 2009. Google ScholarDigital Library
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google Scholar
Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR '13, pages 3476--3483, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
J. Susskind, A. Anderson, and G. Hinton. The toronto face database. Technical report, UTML TR 2010-001, University of Toronto, 2010.Google Scholar
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.Google ScholarDigital Library
Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. arXiv preprint arXiv:1504.01561, 2015.Google Scholar

Index Terms

Recurrent Neural Networks for Emotion Recognition in Video
1. Computing methodologies
  1. Machine learning

Recommendations

Multi-Feature Based Emotion Recognition for Video Clips
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

In this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video ...
Read More
Deep Neural Networks for Emotion Recognition
Distributed Computer and Communication Networks
Abstract
The paper investigates the problem of recognizing human emotions by voice using deep learning methods. Deep convolutional neural networks and recurrent neural networks with bidirectional LSTM memory cell were used as models of deep neural ...
Read More
Video-based emotion recognition using CNN-RNN and C3D hybrid networks
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

In this paper, we present a video-based emotion recognition system submitted to the EmotiW 2016 Challenge. The core module of this system is a hybrid network that combines recurrent neural network (RNN) and 3D convolutional networks (C3D) in a late-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction
November 2015
678 pages
ISBN:9781450339124
DOI:10.1145/2818346
General Chairs:
Zhengyou Zhang
Microsoft Research, USA
,
Phil Cohen
VoiceBox Technologies, USA
,
Program Chairs:
Dan Bohus
Microsoft Research, USA
,
Radu Horaud
INRIA Grenoble Rhone-Alpes, France
,
Helen Meng
Chinese University of Hong Kong, China
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
emotion recognition
model combination
multimodal learning
recurrent neural networks
Qualifiers
- research-article
Conference

Acceptance Rates
ICMI '15 Paper Acceptance Rate52of127submissions,41%Overall Acceptance Rate453of1,080submissions,42%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 205
  Total Citations
  View Citations
- 2,846
  Total Downloads
- Downloads (Last 12 months)215
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Recurrent Neural Networks for Emotion Recognition in Video

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-Feature Based Emotion Recognition for Video Clips

Deep Neural Networks for Emotion Recognition

Video-based emotion recognition using CNN-RNN and C3D hybrid networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Recurrent Neural Networks for Emotion Recognition in Video

ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-Feature Based Emotion Recognition for Video Clips

Deep Neural Networks for Emotion Recognition

Video-based emotion recognition using CNN-RNN and C3D hybrid networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media