Skip to main content
Erschienen in: Journal on Multimodal User Interfaces 4/2016

01.12.2016 | Original Paper

Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

verfasst von: M. Shamim Hossain, Ghulam Muhammad

Erschienen in: Journal on Multimodal User Interfaces | Ausgabe 4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we propose an audio-visual emotion recognition system using multi-directional regression (MDR) audio features and ridgelet transform based face image features. MDR features capture directional derivative information in a spectro-temporal domain of speech, and, thereby, suitable to encode different levels of increasing or decreasing pitch and formant frequencies. For video inputs, interest points in a time frame are detected using spectro-temporal filters, and ridgelet transform is applied to cuboids around the interest points. Two separate extreme learning machine classifiers, one for speech modality and the other for face modality, are used. The scores of these two classifiers are fused using a Bayesian sum rule to make the final decision. Experimental results on eNTERFACE database show that the proposed method achieves accuracy of 85.06 % using bimodal inputs, 64.04 % using speech only, and 58.38 % using face only; these accuracies outnumber the accuracies obtained by some other state-of-the-art systems using the same database.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine—belief network architecture. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp I-577–580 Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine—belief network architecture. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp I-577–580
2.
Zurück zum Zitat Zhou Y, Sun Y, Zhang J, Yan Y (2009) Speech emotion recognition using both spectral and prosodic features. In: Proceedings of International Conference Information Engineering and Computer Science (ICIECS), pp 1–4 Zhou Y, Sun Y, Zhang J, Yan Y (2009) Speech emotion recognition using both spectral and prosodic features. In: Proceedings of International Conference Information Engineering and Computer Science (ICIECS), pp 1–4
3.
Zurück zum Zitat Devillers L, Vidrascu V (2006) Real-life emotion detection with lexical and paralinguistic cues on Human-Human call center dialogs. In: Proceedings of Interspeech’2006, Pittsburgh Devillers L, Vidrascu V (2006) Real-life emotion detection with lexical and paralinguistic cues on Human-Human call center dialogs. In: Proceedings of Interspeech’2006, Pittsburgh
4.
Zurück zum Zitat Gharavian D, Sheikhan M, Nazerieh AR, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl 21(8):2115–2126. doi:10.1007/s00521-011-0643-1 CrossRef Gharavian D, Sheikhan M, Nazerieh AR, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl 21(8):2115–2126. doi:10.​1007/​s00521-011-0643-1 CrossRef
5.
Zurück zum Zitat Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570CrossRef Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570CrossRef
6.
Zurück zum Zitat Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech’2005, Lisbon Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech’2005, Lisbon
7.
Zurück zum Zitat Bettadapura V (2012) Face expression recognition and analysis: the state of the art. College of Computing, Georgia Institute of Technology. arXiv:1203.6722v1 Bettadapura V (2012) Face expression recognition and analysis: the state of the art. College of Computing, Georgia Institute of Technology. arXiv:​1203.​6722v1
8.
Zurück zum Zitat Senechal T, Rapp V, Salam H, Seguier R, Bailly K, Prevost L (2012) Facial action recognition combining heterogeneous features via multikernel learning. IEEE Trans Syst Man Cybern B 42(4):993–1005CrossRef Senechal T, Rapp V, Salam H, Seguier R, Bailly K, Prevost L (2012) Facial action recognition combining heterogeneous features via multikernel learning. IEEE Trans Syst Man Cybern B 42(4):993–1005CrossRef
9.
Zurück zum Zitat Agrawal S, Khatri P (2015) Facial expression detection techniques: based on Viola and Jones algorithm and principal component analysis. In: Proceedings of 2015 Fifth International Conference on Advanced Computing & Communication Technologies (ACCT), pp 108–112, 21-22 Agrawal S, Khatri P (2015) Facial expression detection techniques: based on Viola and Jones algorithm and principal component analysis. In: Proceedings of 2015 Fifth International Conference on Advanced Computing & Communication Technologies (ACCT), pp 108–112, 21-22
10.
Zurück zum Zitat Majumder A, Behera L, Subramanian VK (2014) Emotion recognition from geometric facial features using self-organizing map. Pattern Recogn 47(3):1282–1293CrossRef Majumder A, Behera L, Subramanian VK (2014) Emotion recognition from geometric facial features using self-organizing map. Pattern Recogn 47(3):1282–1293CrossRef
11.
Zurück zum Zitat Pantic M, Valstar MF, Rademaker R, Maat L (2005) Web-based database for facial expression analysis. In: Proceedings of 13th ACM International Conference on Multimedia’05, pp 317–321. Database available at http://www.mmifacedb.com/ Pantic M, Valstar MF, Rademaker R, Maat L (2005) Web-based database for facial expression analysis. In: Proceedings of 13th ACM International Conference on Multimedia’05, pp 317–321. Database available at http://​www.​mmifacedb.​com/​
12.
Zurück zum Zitat Bejani M, Gharavian D, Charkari NM (2014) Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks. Neural Comput Appl 24(2):399–412CrossRef Bejani M, Gharavian D, Charkari NM (2014) Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks. Neural Comput Appl 24(2):399–412CrossRef
13.
Zurück zum Zitat Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE’05 audiovisual emotion database. In: Proceedings of ICDEW’2006, p 8, Atlanta, April 3–8 Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE’05 audiovisual emotion database. In: Proceedings of ICDEW’2006, p 8, Atlanta, April 3–8
14.
Zurück zum Zitat Kachele M, Glodek M, Zharkov D, Meudt S, Schwenker F (2014) Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), pp 671–678 Kachele M, Glodek M, Zharkov D, Meudt S, Schwenker F (2014) Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), pp 671–678
15.
Zurück zum Zitat Jeremie N, Vincent R, Kevin B, Lionel P, Mohamed C (2014) Audio-visual emotion recognition: a dynamic, multimodal approach. In: Proceedings of 26th French conference on interaction of human-machine (IHM’14), Lille Jeremie N, Vincent R, Kevin B, Lionel P, Mohamed C (2014) Audio-visual emotion recognition: a dynamic, multimodal approach. In: Proceedings of 26th French conference on interaction of human-machine (IHM’14), Lille
16.
Zurück zum Zitat Lin J-C, Wu C-H, Wei W-L (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156CrossRef Lin J-C, Wu C-H, Wei W-L (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156CrossRef
17.
Zurück zum Zitat Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3687–3691, 26–31 May 2013 Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3687–3691, 26–31 May 2013
18.
Zurück zum Zitat Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198CrossRef Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198CrossRef
19.
Zurück zum Zitat Mesgarani N, David S, Fritz J, Shamma S (2008) Phoneme representation and classification in primary cortex. J Acoust Soc Am 123:899–909CrossRef Mesgarani N, David S, Fritz J, Shamma S (2008) Phoneme representation and classification in primary cortex. J Acoust Soc Am 123:899–909CrossRef
20.
Zurück zum Zitat Muhammad G, Mesallam T, Almalki K, Farahat M, Mahmood A, Alsulaiman M (2012) Multi directional regression (MDR) based features for automatic voice disorder detection. J Voice 26(6):817.e19–817.e27CrossRef Muhammad G, Mesallam T, Almalki K, Farahat M, Mahmood A, Alsulaiman M (2012) Multi directional regression (MDR) based features for automatic voice disorder detection. J Voice 26(6):817.e19–817.e27CrossRef
21.
22.
Zurück zum Zitat Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501CrossRef Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501CrossRef
23.
Zurück zum Zitat Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings of IEEE VS-PETS’2005, pp 65–72, Beijing, 15–16 Oct 2005 Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings of IEEE VS-PETS’2005, pp 65–72, Beijing, 15–16 Oct 2005
24.
25.
Zurück zum Zitat Huang G-B, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern B 42(2):513–529CrossRef Huang G-B, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern B 42(2):513–529CrossRef
26.
Zurück zum Zitat Huang W, Li N, Lin Z, Huang G-B, Zong W, Zhou J, Duan Y (2013) Liver tumor detection and segmentation using kernel-based extreme learning machine. In: Proceedings of 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC ’13), pp 3662–3665, Osaka Huang W, Li N, Lin Z, Huang G-B, Zong W, Zhou J, Duan Y (2013) Liver tumor detection and segmentation using kernel-based extreme learning machine. In: Proceedings of 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC ’13), pp 3662–3665, Osaka
27.
Zurück zum Zitat Savojardo C, Fariselli P, Casadio R (2013) BETAWARE: a machine-learning tool to detect and predict transmembrane beta barrel proteins in Prokaryotes. Bioinformatics 29(4):504–505CrossRef Savojardo C, Fariselli P, Casadio R (2013) BETAWARE: a machine-learning tool to detect and predict transmembrane beta barrel proteins in Prokaryotes. Bioinformatics 29(4):504–505CrossRef
28.
Zurück zum Zitat Yin XX, Hadjiloucas S, Zhang Y (2014) Complex extreme learning machine applications in terahertz pulsed signals feature sets. Comput Methods Programs Biomed 117(2):387–403CrossRef Yin XX, Hadjiloucas S, Zhang Y (2014) Complex extreme learning machine applications in terahertz pulsed signals feature sets. Comput Methods Programs Biomed 117(2):387–403CrossRef
29.
Zurück zum Zitat Hossain MS, Muhammad G, Song B, Hassan M, Alelaiwi A, Alamri A (2015) Audio-visual emotion-aware cloud gaming framework. IEEE Trans Circuits Syst Video Technol. doi:10.1109/TCSVT.2015.2444731 Hossain MS, Muhammad G, Song B, Hassan M, Alelaiwi A, Alamri A (2015) Audio-visual emotion-aware cloud gaming framework. IEEE Trans Circuits Syst Video Technol. doi:10.​1109/​TCSVT.​2015.​2444731
30.
Zurück zum Zitat Kanade T, Cohn J, Tian Y (2000) Comprehensive database for facial expression analysis. In: Proceedings of IEEE international conference on face and gesture recognition (AFGR ‘00), pp 46–53 Kanade T, Cohn J, Tian Y (2000) Comprehensive database for facial expression analysis. In: Proceedings of IEEE international conference on face and gesture recognition (AFGR ‘00), pp 46–53
31.
Zurück zum Zitat Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297CrossRef Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297CrossRef
32.
Zurück zum Zitat Jiang D, Cui Y, Zhang X, Fan P, Ganzalez I, Sahli H (2011) Audio visual emotion recognition based on triple-stream dynamic bayesian network models. In: D’Mello S, et al. (eds) ACII 2011, Part I, LNCS 6974, pp 609–618 Jiang D, Cui Y, Zhang X, Fan P, Ganzalez I, Sahli H (2011) Audio visual emotion recognition based on triple-stream dynamic bayesian network models. In: D’Mello S, et al. (eds) ACII 2011, Part I, LNCS 6974, pp 609–618
33.
Zurück zum Zitat Paleari M, Huet B (June 2008) Toward emotion indexing of multi-media excerpts. in: Proceedings of International Workshop on Content Based Multimedia Indexing (CBMI), pp 425-432, London Paleari M, Huet B (June 2008) Toward emotion indexing of multi-media excerpts. in: Proceedings of International Workshop on Content Based Multimedia Indexing (CBMI), pp 425-432, London
34.
Zurück zum Zitat Muhammad G, Masud M, Alelaiwi A, Rahman MA, Karime A, Alamri A, Hossain MS (2015) Spectro-temporal directional derivative based automatic speech recognition for a serious game scenario. Multimed Tools Appl 74(14):5313–5327. doi:10.1007/s11042-014-1973-7 CrossRef Muhammad G, Masud M, Alelaiwi A, Rahman MA, Karime A, Alamri A, Hossain MS (2015) Spectro-temporal directional derivative based automatic speech recognition for a serious game scenario. Multimed Tools Appl 74(14):5313–5327. doi:10.​1007/​s11042-014-1973-7 CrossRef
35.
Zurück zum Zitat Jin Q, Li C, Chen S, Wu H (2015) Speech emotion recognition with acoustic and lexical features. In: Proceedings 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4749–4753, 19–24 Apr 2015 Jin Q, Li C, Chen S, Wu H (2015) Speech emotion recognition with acoustic and lexical features. In: Proceedings 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4749–4753, 19–24 Apr 2015
Metadaten
Titel
Audio-visual emotion recognition using multi-directional regression and Ridgelet transform
verfasst von
M. Shamim Hossain
Ghulam Muhammad
Publikationsdatum
01.12.2016
Verlag
Springer International Publishing
Erschienen in
Journal on Multimodal User Interfaces / Ausgabe 4/2016
Print ISSN: 1783-7677
Elektronische ISSN: 1783-8738
DOI
https://doi.org/10.1007/s12193-015-0207-2

Weitere Artikel der Ausgabe 4/2016

Journal on Multimodal User Interfaces 4/2016 Zur Ausgabe

Premium Partner