Skip to main content
Erschienen in: Journal on Multimodal User Interfaces 2/2016

01.06.2016 | Original Paper

EmoNets: Multimodal deep learning approaches for emotion recognition in video

verfasst von: Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Caglar Gulcehre, Vincent Michalski, Kishore Konda, Sébastien Jean, Pierre Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, Raul Chandias Ferrari, Mehdi Mirza, David Warde-Farley, Aaron Courville, Pascal Vincent, Roland Memisevic, Christopher Pal, Yoshua Bengio

Erschienen in: Journal on Multimodal User Interfaces | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Yaafe: audio features extraction toolbox: http://​yaafe.​sourceforge.​net/​.
 
Literatur
1.
Zurück zum Zitat Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284–299CrossRef Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284–299CrossRef
2.
Zurück zum Zitat Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Bengio Y (2012) Theano: new features and speed improvements. arXiv:1211.5590 Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Bengio Y (2012) Theano: new features and speed improvements. arXiv:​1211.​5590
3.
Zurück zum Zitat Bergstra J, Breuleux O, Bastien F, Lamblin P, Pascanu R, Desjardins G, Turian J, Warde-Farley D, Bengio Y (2010) Theano: a CPU and GPU math Expression compiler. In: Proceedings of the Python for scientific Computing conference (SciPy), vol 4. Austin Bergstra J, Breuleux O, Bastien F, Lamblin P, Pascanu R, Desjardins G, Turian J, Warde-Farley D, Bengio Y (2010) Theano: a CPU and GPU math Expression compiler. In: Proceedings of the Python for scientific Computing conference (SciPy), vol 4. Austin
4.
Zurück zum Zitat Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. JMLR 13:281–305MathSciNetMATH Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. JMLR 13:281–305MathSciNetMATH
5.
Zurück zum Zitat Carrier PL, Courville A, Goodfellow IJ, Mirza M, Bengio Y (2013) FER-2013 face database. Tech rep, 1365 (Université de Montréal) Carrier PL, Courville A, Goodfellow IJ, Mirza M, Bengio Y (2013) FER-2013 face database. Tech rep, 1365 (Université de Montréal)
6.
Zurück zum Zitat Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intel Syst Technol 2:27:1–27:27CrossRef Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intel Syst Technol 2:27:1–27:27CrossRef
7.
Zurück zum Zitat Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513. ACM Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513. ACM
8.
Zurück zum Zitat Coates A, Lee H, Ng AY (2011) An analysis of single-layer networks in unsupervised feature learning. In: AISTATS Coates A, Lee H, Ng AY (2011) An analysis of single-layer networks in unsupervised feature learning. In: AISTATS
9.
Zurück zum Zitat Dahl GE, Sainath TN, Hinton GE (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In: Proc. ICASSP Dahl GE, Sainath TN, Hinton GE (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In: Proc. ICASSP
10.
Zurück zum Zitat Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 461–466. ACM Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 461–466. ACM
11.
Zurück zum Zitat Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: ACM ICMI Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: ACM ICMI
12.
Zurück zum Zitat Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multi Media 3:34–41CrossRef Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multi Media 3:34–41CrossRef
13.
Zurück zum Zitat Gehrig T, Ekenel HK (2013) Why is facial expression analysis in the wild challenging? In: Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop, pp. 9–16. ACM Gehrig T, Ekenel HK (2013) Why is facial expression analysis in the wild challenging? In: Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop, pp. 9–16. ACM
15.
Zurück zum Zitat Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. IEEE Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. IEEE
16.
Zurück zum Zitat Hamel P, Lemieux S, Bengio Y, Eck D (2011) Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: ISMIR, pp. 729–734 Hamel P, Lemieux S, Bengio Y, Eck D (2011) Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: ISMIR, pp. 729–734
17.
Zurück zum Zitat Heusch G, Cardinaux F, Marcel S (2005) Lighting normalization algorithms for face verification. IDIAP Communication Com05-03 Heusch G, Cardinaux F, Marcel S (2005) Lighting normalization algorithms for face verification. IDIAP Communication Com05-03
18.
Zurück zum Zitat Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig Proc Magazine 29(6):82–97CrossRef Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig Proc Magazine 29(6):82–97CrossRef
19.
20.
Zurück zum Zitat Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580 Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:​1207.​0580
21.
Zurück zum Zitat Kahou SE, Froumenty P, Pal C (2015) Facial expression analysis based on high dimensional binary features. In: L Agapito, MM Bronstein, C Rother (eds) Computer vision - ECCV 2014 Workshops, Lecture Notes in Computer Science, vol. 8926 Kahou SE, Froumenty P, Pal C (2015) Facial expression analysis based on high dimensional binary features. In: L Agapito, MM Bronstein, C Rother (eds) Computer vision - ECCV 2014 Workshops, Lecture Notes in Computer Science, vol. 8926
22.
Zurück zum Zitat Kahou SE, Pal C, Bouthillier X, Froumenty P, Gulcehre C, Memisevic R, Vincent P, Courville A, Bengio Y, Ferrari RC, Mirza M, Jean S, Carrier PL, Dauphin Y, Boulanger-Lewandowski N, Aggarwal A, Zumer J, Lamblin P, Raymond JP, Desjardins G, Pascanu R, Warde-Farley D, Torabi A, Sharma A, Bengio E, Côté M, Konda KR, Wu Z (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI ’13 Kahou SE, Pal C, Bouthillier X, Froumenty P, Gulcehre C, Memisevic R, Vincent P, Courville A, Bengio Y, Ferrari RC, Mirza M, Jean S, Carrier PL, Dauphin Y, Boulanger-Lewandowski N, Aggarwal A, Zumer J, Lamblin P, Raymond JP, Desjardins G, Pascanu R, Warde-Farley D, Torabi A, Sharma A, Bengio E, Côté M, Konda KR, Wu Z (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI ’13
23.
24.
Zurück zum Zitat Konda KR, Memisevic R, Michalski V (2014) The role of spatio-temporal synchrony in the encoding of motion. In: ICLR Konda KR, Memisevic R, Michalski V (2014) The role of spatio-temporal synchrony in the encoding of motion. In: ICLR
25.
Zurück zum Zitat Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech rep Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech rep
27.
Zurück zum Zitat Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114
28.
Zurück zum Zitat Le Q, Zou W, Yeung S, Ng A (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR Le Q, Zou W, Yeung S, Ng A (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR
29.
Zurück zum Zitat Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 525–530. ACM Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 525–530. ACM
30.
Zurück zum Zitat Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501. ACM Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501. ACM
32.
Zurück zum Zitat Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 517–524. ACM Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 517–524. ACM
33.
Zurück zum Zitat Štruc V, Pavešić N (2009) Gabor-based kernel partial-least-squares discrimination features for face recognition. Informatica 20(1):115–138MATH Štruc V, Pavešić N (2009) Gabor-based kernel partial-least-squares discrimination features for face recognition. Informatica 20(1):115–138MATH
34.
Zurück zum Zitat Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 481–486. ACM Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 481–486. ACM
35.
Zurück zum Zitat Susskind J, Anderson A, Hinton G (2010) The toronto face database. Tech Rep, UTML TR 2010-001, University of Toronto Susskind J, Anderson A, Hinton G (2010) The toronto face database. Tech Rep, UTML TR 2010-001, University of Toronto
36.
Zurück zum Zitat Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: ICML 2013 Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: ICML 2013
37.
Zurück zum Zitat Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European conference on Computer vision: Part VI, ECCV’10 Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European conference on Computer vision: Part VI, ECCV’10
38.
Zurück zum Zitat Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: CVPR Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: CVPR
39.
Zurück zum Zitat Štruc V, Pavešić N (2011) Photometric normalization techniques for illumination invariance, pp. 279–300. IGI-Global Štruc V, Pavešić N (2011) Photometric normalization techniques for illumination invariance, pp. 279–300. IGI-Global
40.
Zurück zum Zitat Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC
41.
Zurück zum Zitat Zhu X, Ramanan D (2012) Face Detection, Pose Estimation, and Landmark Localization in the Wild. In: CVPR Zhu X, Ramanan D (2012) Face Detection, Pose Estimation, and Landmark Localization in the Wild. In: CVPR
Metadaten
Titel
EmoNets: Multimodal deep learning approaches for emotion recognition in video
verfasst von
Samira Ebrahimi Kahou
Xavier Bouthillier
Pascal Lamblin
Caglar Gulcehre
Vincent Michalski
Kishore Konda
Sébastien Jean
Pierre Froumenty
Yann Dauphin
Nicolas Boulanger-Lewandowski
Raul Chandias Ferrari
Mehdi Mirza
David Warde-Farley
Aaron Courville
Pascal Vincent
Roland Memisevic
Christopher Pal
Yoshua Bengio
Publikationsdatum
01.06.2016
Verlag
Springer International Publishing
Erschienen in
Journal on Multimodal User Interfaces / Ausgabe 2/2016
Print ISSN: 1783-7677
Elektronische ISSN: 1783-8738
DOI
https://doi.org/10.1007/s12193-015-0195-2

Weitere Artikel der Ausgabe 2/2016

Journal on Multimodal User Interfaces 2/2016 Zur Ausgabe

Premium Partner