Top

Arabian Journal for Science and Engineering

Published in:

20-12-2019 | Research Article - Computer Engineering and Computer Science

An Efficient Language-Independent Acoustic Emotion Classification System

Authors: Rajwinder Singh, Harshita Puri, Naveen Aggarwal, Varun Gupta

Published in: Arabian Journal for Science and Engineering | Issue 4/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Emotion recognition from human speech is essential to understand the convoluted human nature. For any machine to accurately decipher the intended message in the speech, it must understand the emotion of spoken words. Emotions control the modulations in the speech, and these modulations may even change the context. Through this paper, we aim to propose a system which can efficiently detect the emotions from speech. The domain of emotion recognition from human speech is very complex due to highly overlapping regions of emotions, and it sometimes becomes very difficult to distinguish between two emotions just based on voice. Such ambiguity in the label assignment is responsible for low classification accuracy in existing systems. In the proposed system, we have worked on finding both the suitable feature set as well as the classifier. The proposed system achieved 29.74% increase in classification accuracy in comparison with the baseline human accuracy on the primary dataset, i.e. ‘CREMA-D’. Further, we have validated on other standard datasets such as ‘EmoDB’, ‘RAVDESS’, and ‘SAVEE’. ‘EmoDB’ is a German language dataset, while the other two are English language datasets, which is in line with the language-independent nature of our system. When compared to the current state of the art in this domain on these datasets, the proposed system gives better accuracies for most of the cases, and for some cases, it gives comparable accuracies to baseline models or existing published work.

previous article Hybrid Particle Swarm Optimization with Sine Cosine Algorithm and Nelder–Mead Simplex for Solving Engineering Design Problems

next article Information Flow Tracking and Auditing for the Internet of Things Using Software-Defined Networking

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Vlasenko, B.; Schuller, B.; Wendemuth, A.; Rigoll, G.: On the influence of phonetic content variation for acoustic emotion recognition. In: International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, pp. 217–220. Springer, Berlin, Heidelberg (2008)

Scherer, K.R.: Vocal communication of emotion: a review of research paradigms. Speech Commun. 40(1–2), 227–256 (2003)CrossRef

Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R.: ‘CREMA-D’: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)CrossRef

Barsoum, E.; Zhang, C.; Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283. ACM (2016)

Abdelwahab, M.; Busso, C.: Study of dense network approaches for speech emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5084–5088. IEEE (2018)

Burmania, A.; Busso, C.: A stepwise analysis of aggregated crowdsourced labels describing multimodal emotional behaviors. In: INTERSPEECH, pp. 152–156 (2017)

Arora, P.; Chaspari, T.: Exploring siamese neural network architectures for preserving speaker identity in speech emotion classification. In: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human–Machine Interaction, pp. 15–18. ACM (2018)

Oudeyer, P.Y.: Novel useful features and algorithms for the recognition of emotions in human speech. In: Speech Prosody 2002, International Conference (2002)

Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B.: A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005)

10.

Livingstone, S.R.; Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)CrossRef

11.

Jackson, P.; Haq, S.: Surrey Audio-Visual Expressed Emotion (SAVEE) Database. University of Surrey, Guildford (2014)

12.

Neiberg, D.; Elenius, K.; Karlsson, I.; Laskowski, K.: Emotion recognition in spontaneous speech. In: Proceedings of Fonetik, pp. 101–104 (2006)

13.

Blouin, C.; Maffiolo, V.: A study on the automatic detection and characterization of emotion in a voice service context. In: Ninth European Conference on Speech Communication and Technology (2005)

14.

Cummings, K.E.; Clements, M.A.: Analysis of the glottal excitation of emotionally styled and stressed speech. J. Acoust. Soc. Am. 98(1), 88–98 (1995)CrossRef

15.

Sauter, D.A.; Eisner, F.; Ekman, P.; Scott, S.K.: Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proc. Natl. Acad. Sci. 107(6), 2408–2412 (2010)CrossRef

16.

Fayek, H.M.; Lech, M.; Cavedon, L.: Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68 (2017)CrossRef

17.

Huang, C.-W.; Narayanan, S.S.: Attention Assisted discovery of sub-utterance structure in speech emotion recognition. In: Proceedings of Interspeech, pp. 1387–1391 (2016)

18.

Lee, J.; Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: INTERSPEECH, pp. 1537–1540 (2015)

19.

Singh, R.; Rana, R.; Singh, S.K.: Performance evaluation of VGG models in detection of wheat rust. Asian J. Comput. Sci. Technol. 7(3), 76–81 (2018)MathSciNet

20.

Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y.: Exploring the Limits of Language Modeling (2016). arXiv:1602.02410[cs]

21.

Radford, A.; Jozefowicz, R.; Sutskever, I.: Learning to Generate Reviews and Discovering Sentiment (2017). arXiv:1704.01444[cs]

22.

Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; et al.: Deep Speech: Scaling Up End-to-End Speech Recognition (204). CoRR, arXiv:1412.5567

23.

Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016). arXiv:1609.08144[cs]

24.

Lakomkin, E.; Zamani, M.A.; Weber, C.; Magg, S.; Wermter, S.: Emorl: continuous acoustic emotion classification using deep reinforcement learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–6. IEEE (2018)

25.

Eyben, F.; Weninger, F.; Gross, F.; Schuller, B.: Recent developments in openSMILE, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, ser. MM’13, pp. 835–838. ACM, New York

26.

Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

27.

Wang, Z.Q.; Tashev, I.: Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5150–5154. IEEE (2017)

28.

Bothe, C.; Magg, S.; Weber, C.; Wermter, S.: Conversational Analysis using Utterance-Level Attention-Based Bidirectional Recurrent Neural Networks (2018). arXiv preprint arXiv:1805.06242.

29.

Erdem, E.S.; Sert, M.: Efficient recognition of human emotional states from audio signals. In: 2014 IEEE International Symposium on Multimedia, pp. 139–142. IEEE (2014)

30.

Fourier Analysis And Synthesis. Hyperphysics.Phy-Astr.Gsu.Edu. http://hyperphysics.phy-astr.gsu.edu/hbase/Audio/fourier.html#c1. (2018). Accessed 21 Nov 2018

31.

Kodukula, S.R.M.: Significance of excitation source information for speech analysis. Doctoral dissertation, Ph.D. thesis, Dept. of Computer Science, IIT, Madras (2009)

32.

Yegnanarayana, B.; Murthy, P.S.; Avendaño, C.; Hermansky, H.: Enhancement of reverberant speech using LP residual. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), vol. 1, pp. 405–408. IEEE (1998)

33.

Yegnanarayana, B.; Prasanna, S.M., Rao, K.S.: Speech enhancement using excitation source information. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I–541. IEEE (2002)

34.

Ravindran, G.; Shenbagadevi, S.; Selvam, V.S.: Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. J. Biomed. Sci. Eng. 3(01), 85 (2010)CrossRef

35.

Ververidis, D.; Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006)CrossRef

36.

Bänziger, T.; Scherer, K.R.: The role of intonation in emotional expressions. Speech Commun. 46(3–4), 252–267 (2005)CrossRef

37.

Cowie, R.; Cornelius, R.R.: Describing the emotional states that are expressed in speech. Speech Commun. 40(1–2), 5–32 (2003)CrossRef

38.

Jannat, R.; Tynes, I.; Lime, L.L.; Adorno, J.; Canavan, S.: Ubiquitous emotion recognition using audio and video data. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 956–959. ACM (2018)

39.

McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O.: Librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)

40.

Graves, A.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universitat Munchen (2008)

41.

Gao, M.; Dong, J.; Zhou, D.; Zhang, Q.; Yang, D.: End-to-end speech emotion recognition based on one-dimensional convolutional neural network. In: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence, pp. 78–82. ACM (2019)

42.

Anjum, M.: Emotion recognition from speech for an interactive robot agent. In: 2019 IEEE/SICE International Symposium on System Integration (SII), pp. 363–368. IEEE (2019)

43.

Avots, E.; Sapiński, T.; Bachmann, M.; et al.: Audiovisual emotion recognition in wild. Mach. Vis. Appl. 30, 975 (2019). https://doi.org/10.1007/s00138-018-0960-9 CrossRef

44.

Jannat, R.; Tynes, I.; Lime, L.L.; Adorno, J.; Canavan, S.: Ubiquitous emotion recognition using audio and video data. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 956–959. ACM (2018)

45.

Fagerland, M.W.; Lydersen, S.; Laake, P.: Statistical Analysis of Contingency Tables. Taylor & Francis/CRC, Boca Raton (2017)CrossRef

46.

Chow, S.C.; Shao, J.; Wang, H.; Lokhnygina, Y.: Sample size calculations in clinical research, 3rd edn. Taylor & Francis/CRC, Boca Raton (2018)MATH

Title: An Efficient Language-Independent Acoustic Emotion Classification System
Authors: Rajwinder Singh
Harshita Puri
Naveen Aggarwal
Varun Gupta
Publication date: 20-12-2019
Publisher: Springer Berlin Heidelberg
Published in: Arabian Journal for Science and Engineering / Issue 4/2020
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI: https://doi.org/10.1007/s13369-019-04293-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 4/2020

An Efficient Filter-Based Feature Selection Model to Identify Significant Features from High-Dimensional Microarray Data

Estimation of the Smoothing Parameter in Probabilistic Neural Network Using Evolutionary Algorithms

A Comparative Analysis on Effort Estimation for Agile and Non-agile Software Projects Using DBN-ALO

Optimal Release Time Determination in Intuitionistic Fuzzy Environment Involving Randomized Cost Budget for SDE-Based Software Reliability Growth Model

A Multi-objective Hybrid Algorithm for Optimal Planning of Distributed Generation

An Adaptive Spiking Neural P System for Solving Vehicle Routing Problems

Premium Partners