Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 4/2020

20-12-2019 | Research Article - Computer Engineering and Computer Science

An Efficient Language-Independent Acoustic Emotion Classification System

Authors: Rajwinder Singh, Harshita Puri, Naveen Aggarwal, Varun Gupta

Published in: Arabian Journal for Science and Engineering | Issue 4/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Emotion recognition from human speech is essential to understand the convoluted human nature. For any machine to accurately decipher the intended message in the speech, it must understand the emotion of spoken words. Emotions control the modulations in the speech, and these modulations may even change the context. Through this paper, we aim to propose a system which can efficiently detect the emotions from speech. The domain of emotion recognition from human speech is very complex due to highly overlapping regions of emotions, and it sometimes becomes very difficult to distinguish between two emotions just based on voice. Such ambiguity in the label assignment is responsible for low classification accuracy in existing systems. In the proposed system, we have worked on finding both the suitable feature set as well as the classifier. The proposed system achieved 29.74% increase in classification accuracy in comparison with the baseline human accuracy on the primary dataset, i.e. ‘CREMA-D’. Further, we have validated on other standard datasets such as ‘EmoDB’, ‘RAVDESS’, and ‘SAVEE’. ‘EmoDB’ is a German language dataset, while the other two are English language datasets, which is in line with the language-independent nature of our system. When compared to the current state of the art in this domain on these datasets, the proposed system gives better accuracies for most of the cases, and for some cases, it gives comparable accuracies to baseline models or existing published work.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Vlasenko, B.; Schuller, B.; Wendemuth, A.; Rigoll, G.: On the influence of phonetic content variation for acoustic emotion recognition. In: International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, pp. 217–220. Springer, Berlin, Heidelberg (2008) Vlasenko, B.; Schuller, B.; Wendemuth, A.; Rigoll, G.: On the influence of phonetic content variation for acoustic emotion recognition. In: International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, pp. 217–220. Springer, Berlin, Heidelberg (2008)
2.
go back to reference Scherer, K.R.: Vocal communication of emotion: a review of research paradigms. Speech Commun. 40(1–2), 227–256 (2003)CrossRef Scherer, K.R.: Vocal communication of emotion: a review of research paradigms. Speech Commun. 40(1–2), 227–256 (2003)CrossRef
3.
go back to reference Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R.: ‘CREMA-D’: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)CrossRef Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R.: ‘CREMA-D’: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)CrossRef
4.
go back to reference Barsoum, E.; Zhang, C.; Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283. ACM (2016) Barsoum, E.; Zhang, C.; Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283. ACM (2016)
5.
go back to reference Abdelwahab, M.; Busso, C.: Study of dense network approaches for speech emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5084–5088. IEEE (2018) Abdelwahab, M.; Busso, C.: Study of dense network approaches for speech emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5084–5088. IEEE (2018)
6.
go back to reference Burmania, A.; Busso, C.: A stepwise analysis of aggregated crowdsourced labels describing multimodal emotional behaviors. In: INTERSPEECH, pp. 152–156 (2017) Burmania, A.; Busso, C.: A stepwise analysis of aggregated crowdsourced labels describing multimodal emotional behaviors. In: INTERSPEECH, pp. 152–156 (2017)
7.
go back to reference Arora, P.; Chaspari, T.: Exploring siamese neural network architectures for preserving speaker identity in speech emotion classification. In: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human–Machine Interaction, pp. 15–18. ACM (2018) Arora, P.; Chaspari, T.: Exploring siamese neural network architectures for preserving speaker identity in speech emotion classification. In: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human–Machine Interaction, pp. 15–18. ACM (2018)
8.
go back to reference Oudeyer, P.Y.: Novel useful features and algorithms for the recognition of emotions in human speech. In: Speech Prosody 2002, International Conference (2002) Oudeyer, P.Y.: Novel useful features and algorithms for the recognition of emotions in human speech. In: Speech Prosody 2002, International Conference (2002)
9.
go back to reference Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B.: A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005) Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B.: A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005)
10.
go back to reference Livingstone, S.R.; Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)CrossRef Livingstone, S.R.; Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)CrossRef
11.
go back to reference Jackson, P.; Haq, S.: Surrey Audio-Visual Expressed Emotion (SAVEE) Database. University of Surrey, Guildford (2014) Jackson, P.; Haq, S.: Surrey Audio-Visual Expressed Emotion (SAVEE) Database. University of Surrey, Guildford (2014)
12.
go back to reference Neiberg, D.; Elenius, K.; Karlsson, I.; Laskowski, K.: Emotion recognition in spontaneous speech. In: Proceedings of Fonetik, pp. 101–104 (2006) Neiberg, D.; Elenius, K.; Karlsson, I.; Laskowski, K.: Emotion recognition in spontaneous speech. In: Proceedings of Fonetik, pp. 101–104 (2006)
13.
go back to reference Blouin, C.; Maffiolo, V.: A study on the automatic detection and characterization of emotion in a voice service context. In: Ninth European Conference on Speech Communication and Technology (2005) Blouin, C.; Maffiolo, V.: A study on the automatic detection and characterization of emotion in a voice service context. In: Ninth European Conference on Speech Communication and Technology (2005)
14.
go back to reference Cummings, K.E.; Clements, M.A.: Analysis of the glottal excitation of emotionally styled and stressed speech. J. Acoust. Soc. Am. 98(1), 88–98 (1995)CrossRef Cummings, K.E.; Clements, M.A.: Analysis of the glottal excitation of emotionally styled and stressed speech. J. Acoust. Soc. Am. 98(1), 88–98 (1995)CrossRef
15.
go back to reference Sauter, D.A.; Eisner, F.; Ekman, P.; Scott, S.K.: Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proc. Natl. Acad. Sci. 107(6), 2408–2412 (2010)CrossRef Sauter, D.A.; Eisner, F.; Ekman, P.; Scott, S.K.: Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proc. Natl. Acad. Sci. 107(6), 2408–2412 (2010)CrossRef
16.
go back to reference Fayek, H.M.; Lech, M.; Cavedon, L.: Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68 (2017)CrossRef Fayek, H.M.; Lech, M.; Cavedon, L.: Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68 (2017)CrossRef
17.
go back to reference Huang, C.-W.; Narayanan, S.S.: Attention Assisted discovery of sub-utterance structure in speech emotion recognition. In: Proceedings of Interspeech, pp. 1387–1391 (2016) Huang, C.-W.; Narayanan, S.S.: Attention Assisted discovery of sub-utterance structure in speech emotion recognition. In: Proceedings of Interspeech, pp. 1387–1391 (2016)
18.
go back to reference Lee, J.; Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: INTERSPEECH, pp. 1537–1540 (2015) Lee, J.; Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: INTERSPEECH, pp. 1537–1540 (2015)
19.
go back to reference Singh, R.; Rana, R.; Singh, S.K.: Performance evaluation of VGG models in detection of wheat rust. Asian J. Comput. Sci. Technol. 7(3), 76–81 (2018)MathSciNet Singh, R.; Rana, R.; Singh, S.K.: Performance evaluation of VGG models in detection of wheat rust. Asian J. Comput. Sci. Technol. 7(3), 76–81 (2018)MathSciNet
20.
21.
22.
go back to reference Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; et al.: Deep Speech: Scaling Up End-to-End Speech Recognition (204). CoRR, arXiv:1412.5567 Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; et al.: Deep Speech: Scaling Up End-to-End Speech Recognition (204). CoRR, arXiv:​1412.​5567
23.
go back to reference Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016). arXiv:1609.08144[cs] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016). arXiv:​1609.​08144[cs]
24.
go back to reference Lakomkin, E.; Zamani, M.A.; Weber, C.; Magg, S.; Wermter, S.: Emorl: continuous acoustic emotion classification using deep reinforcement learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–6. IEEE (2018) Lakomkin, E.; Zamani, M.A.; Weber, C.; Magg, S.; Wermter, S.: Emorl: continuous acoustic emotion classification using deep reinforcement learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–6. IEEE (2018)
25.
go back to reference Eyben, F.; Weninger, F.; Gross, F.; Schuller, B.: Recent developments in openSMILE, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, ser. MM’13, pp. 835–838. ACM, New York Eyben, F.; Weninger, F.; Gross, F.; Schuller, B.: Recent developments in openSMILE, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, ser. MM’13, pp. 835–838. ACM, New York
26.
go back to reference Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015) Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
27.
go back to reference Wang, Z.Q.; Tashev, I.: Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5150–5154. IEEE (2017) Wang, Z.Q.; Tashev, I.: Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5150–5154. IEEE (2017)
28.
go back to reference Bothe, C.; Magg, S.; Weber, C.; Wermter, S.: Conversational Analysis using Utterance-Level Attention-Based Bidirectional Recurrent Neural Networks (2018). arXiv preprint arXiv:1805.06242. Bothe, C.; Magg, S.; Weber, C.; Wermter, S.: Conversational Analysis using Utterance-Level Attention-Based Bidirectional Recurrent Neural Networks (2018). arXiv preprint arXiv:​1805.​06242.
29.
go back to reference Erdem, E.S.; Sert, M.: Efficient recognition of human emotional states from audio signals. In: 2014 IEEE International Symposium on Multimedia, pp. 139–142. IEEE (2014) Erdem, E.S.; Sert, M.: Efficient recognition of human emotional states from audio signals. In: 2014 IEEE International Symposium on Multimedia, pp. 139–142. IEEE (2014)
31.
go back to reference Kodukula, S.R.M.: Significance of excitation source information for speech analysis. Doctoral dissertation, Ph.D. thesis, Dept. of Computer Science, IIT, Madras (2009) Kodukula, S.R.M.: Significance of excitation source information for speech analysis. Doctoral dissertation, Ph.D. thesis, Dept. of Computer Science, IIT, Madras (2009)
32.
go back to reference Yegnanarayana, B.; Murthy, P.S.; Avendaño, C.; Hermansky, H.: Enhancement of reverberant speech using LP residual. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), vol. 1, pp. 405–408. IEEE (1998) Yegnanarayana, B.; Murthy, P.S.; Avendaño, C.; Hermansky, H.: Enhancement of reverberant speech using LP residual. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), vol. 1, pp. 405–408. IEEE (1998)
33.
go back to reference Yegnanarayana, B.; Prasanna, S.M., Rao, K.S.: Speech enhancement using excitation source information. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I–541. IEEE (2002) Yegnanarayana, B.; Prasanna, S.M., Rao, K.S.: Speech enhancement using excitation source information. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I–541. IEEE (2002)
34.
go back to reference Ravindran, G.; Shenbagadevi, S.; Selvam, V.S.: Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. J. Biomed. Sci. Eng. 3(01), 85 (2010)CrossRef Ravindran, G.; Shenbagadevi, S.; Selvam, V.S.: Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. J. Biomed. Sci. Eng. 3(01), 85 (2010)CrossRef
35.
go back to reference Ververidis, D.; Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006)CrossRef Ververidis, D.; Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006)CrossRef
36.
go back to reference Bänziger, T.; Scherer, K.R.: The role of intonation in emotional expressions. Speech Commun. 46(3–4), 252–267 (2005)CrossRef Bänziger, T.; Scherer, K.R.: The role of intonation in emotional expressions. Speech Commun. 46(3–4), 252–267 (2005)CrossRef
37.
go back to reference Cowie, R.; Cornelius, R.R.: Describing the emotional states that are expressed in speech. Speech Commun. 40(1–2), 5–32 (2003)CrossRef Cowie, R.; Cornelius, R.R.: Describing the emotional states that are expressed in speech. Speech Commun. 40(1–2), 5–32 (2003)CrossRef
38.
go back to reference Jannat, R.; Tynes, I.; Lime, L.L.; Adorno, J.; Canavan, S.: Ubiquitous emotion recognition using audio and video data. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 956–959. ACM (2018) Jannat, R.; Tynes, I.; Lime, L.L.; Adorno, J.; Canavan, S.: Ubiquitous emotion recognition using audio and video data. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 956–959. ACM (2018)
39.
go back to reference McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O.: Librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015) McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O.: Librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)
40.
go back to reference Graves, A.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universitat Munchen (2008) Graves, A.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universitat Munchen (2008)
41.
go back to reference Gao, M.; Dong, J.; Zhou, D.; Zhang, Q.; Yang, D.: End-to-end speech emotion recognition based on one-dimensional convolutional neural network. In: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence, pp. 78–82. ACM (2019) Gao, M.; Dong, J.; Zhou, D.; Zhang, Q.; Yang, D.: End-to-end speech emotion recognition based on one-dimensional convolutional neural network. In: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence, pp. 78–82. ACM (2019)
42.
go back to reference Anjum, M.: Emotion recognition from speech for an interactive robot agent. In: 2019 IEEE/SICE International Symposium on System Integration (SII), pp. 363–368. IEEE (2019) Anjum, M.: Emotion recognition from speech for an interactive robot agent. In: 2019 IEEE/SICE International Symposium on System Integration (SII), pp. 363–368. IEEE (2019)
44.
go back to reference Jannat, R.; Tynes, I.; Lime, L.L.; Adorno, J.; Canavan, S.: Ubiquitous emotion recognition using audio and video data. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 956–959. ACM (2018) Jannat, R.; Tynes, I.; Lime, L.L.; Adorno, J.; Canavan, S.: Ubiquitous emotion recognition using audio and video data. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 956–959. ACM (2018)
45.
go back to reference Fagerland, M.W.; Lydersen, S.; Laake, P.: Statistical Analysis of Contingency Tables. Taylor & Francis/CRC, Boca Raton (2017)CrossRef Fagerland, M.W.; Lydersen, S.; Laake, P.: Statistical Analysis of Contingency Tables. Taylor & Francis/CRC, Boca Raton (2017)CrossRef
46.
go back to reference Chow, S.C.; Shao, J.; Wang, H.; Lokhnygina, Y.: Sample size calculations in clinical research, 3rd edn. Taylor & Francis/CRC, Boca Raton (2018)MATH Chow, S.C.; Shao, J.; Wang, H.; Lokhnygina, Y.: Sample size calculations in clinical research, 3rd edn. Taylor & Francis/CRC, Boca Raton (2018)MATH
Metadata
Title
An Efficient Language-Independent Acoustic Emotion Classification System
Authors
Rajwinder Singh
Harshita Puri
Naveen Aggarwal
Varun Gupta
Publication date
20-12-2019
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 4/2020
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-019-04293-9

Other articles of this Issue 4/2020

Arabian Journal for Science and Engineering 4/2020 Go to the issue

RESEARCH ARTICLE - SPECIAL ISSUE - INTELLIGENT COMPUTING and INTERDISCIPLINARY APPLICATIONS

An Efficient Filter-Based Feature Selection Model to Identify Significant Features from High-Dimensional Microarray Data

RESEARCH ARTICLE - SPECIAL ISSUE - INTELLIGENT COMPUTING and INTERDISCIPLINARY APPLICATIONS

A Comparative Analysis on Effort Estimation for Agile and Non-agile Software Projects Using DBN-ALO

Research Article - Computer Engineering and Computer Science

A Multi-objective Hybrid Algorithm for Optimal Planning of Distributed Generation

Research Article - Special Issue - Intelligent Computing And Interdisciplinary Applications

An Adaptive Spiking Neural P System for Solving Vehicle Routing Problems

Premium Partners