ABSTRACT
Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recognition and activity recognition. In the case of video, information often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emotion recognition in video has relied on temporal averaging and pooling operations reminiscent of widely used approaches for the spatial aggregation of information. Recurrent neural networks (RNNs) have seen an explosion of recent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attractive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge. We focus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.
- M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In A. Salah and B. Lepri, editors, Human Behavior Understanding, volume 7065 of Lecture Notes in Computer Science, pages 29--39. Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.Google Scholar
- F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.Google Scholar
- J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281--305, 2012. Google ScholarDigital Library
- J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.Google Scholar
- P.-L. Carrier, A. Courville, I. J. Goodfellow, M. Mirza, and Y. Bengio. FER-2013 Face Database. Technical report, 1365, Université de Montréal, 2013.Google Scholar
- A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In Proceedings of the 16th International Conference on Multimodal Interaction, ICMI '14, pages 461--466, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial-expression databases from movies. MultiMedia, IEEE, 19(3):34--41, July 2012. Google ScholarDigital Library
- A. Dhall, O. V. R. Murthy, R. Goecke, J. Joshi, and T. Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 17th ACM on International Conference on Multimodal Interaction, ICMI '15, 2015. Google ScholarDigital Library
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. 2014.Google Scholar
- F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the international conference on Multimedia, pages 1459--1462. ACM, 2010. Google ScholarDigital Library
- F. Eyben, M. Wöllmer, and B. Schuller. openear introducing the munich open-source emotion and affect recognition toolkit. In ACII, pages 576--581, 2009.Google ScholarCross Ref
- A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645--6649. IEEE, 2013.Google ScholarCross Ref
- A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, pages 545--552, 2009.Google ScholarDigital Library
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google ScholarDigital Library
- S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, R. Chandias Ferrari, M. Mirza, D. Warde-Farley, A. Courville, P. Vincent, R. Memisevic, C. Pal, and Y. Bengio. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, pages 1--13, 2015.Google Scholar
- S. E. Kahou, P. Froumenty, and C. Pal. Facial expression analysis based on high dimensional binary features. In ECCV Workshop on Computer Vision with Local Binary Patterns Variants, Zurich, Switzerland, 2014.Google Scholar
- S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. Gulcehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, M. Mirza, S. Jean, P.-L. Carrier, Y. Dauphin, N. Boulanger-Lewandowski, A. Aggarwal, J. Zumer, P. Lamblin, J.-P. Raymond, G. Desjardins, R. Pascanu, D. Warde-Farley, A. Torabi, A. Sharma, E. Bengio, M. Côté, K. R. Konda, and Z. Wu. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI '13, 2013. Google ScholarDigital Library
- N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sentences. arXiv:1404.2188, 2014.Google Scholar
- K. R. Konda, R. Memisevic, and V. Michalski. Learning to encode motion using spatio-temporal synchrony. In Proceedings of ICLR, April 2014.Google Scholar
- A. Krizhevsky. Cuda-convnet Google code home page. https://code.google.com/p/cuda-convnet/, Aug. 2012.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarDigital Library
- Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. Google ScholarDigital Library
- Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.Google Scholar
- M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen. Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th International Conference on Multimodal Interaction, ICMI '14, pages 494--501, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- Nagadomi. Github: kaggle-cifar10-torch7. https: //github.com/nagadomi/kaggle-cifar10-torch7/, 2014.Google Scholar
- B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. Avec 2011--the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction, pages 415--424. Springer, 2011. Google ScholarDigital Library
- C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput., 27(6):803--816, May 2009. Google ScholarDigital Library
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google Scholar
- Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR '13, pages 3476--3483, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
- J. Susskind, A. Anderson, and G. Hinton. The toronto face database. Technical report, UTML TR 2010-001, University of Toronto, 2010.Google Scholar
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.Google ScholarDigital Library
- Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. arXiv preprint arXiv:1504.01561, 2015.Google Scholar
Index Terms
- Recurrent Neural Networks for Emotion Recognition in Video
Recommendations
Multi-Feature Based Emotion Recognition for Video Clips
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionIn this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video ...
Deep Neural Networks for Emotion Recognition
Distributed Computer and Communication NetworksAbstractThe paper investigates the problem of recognizing human emotions by voice using deep learning methods. Deep convolutional neural networks and recurrent neural networks with bidirectional LSTM memory cell were used as models of deep neural ...
Video-based emotion recognition using CNN-RNN and C3D hybrid networks
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal InteractionIn this paper, we present a video-based emotion recognition system submitted to the EmotiW 2016 Challenge. The core module of this system is a hybrid network that combines recurrent neural network (RNN) and 3D convolutional networks (C3D) in a late-...
Comments