Skip to main content
Erschienen in: International Journal of Speech Technology 3/2022

08.07.2022

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

verfasst von: Tulika Jha, Ramisetty Kavya, Jabez Christopher, Vasan Arunachalam

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Speech emotion recognition is one of the fastest growing areas of interest in the field of affective computing. Emotion detection aids human–computer interaction and finds application in a wide gamut of sectors, ranging from healthcare to retail to education. The present work strives to provide a speech emotion recognition framework that is both reliable and efficient enough to work in real-time environments. Speech emotion recognition can be performed using linguistic as well as paralinguistic aspects of speech; this work focusses on the latter, using non-lexical or paralinguistic attributes of speech like pitch, intensity and mel-frequency cepstral coefficients to train supervised machine learning models for emotion recognition. A combination of prosodic and spectral features is used for experimental analysis and classification is performed using algorithms like Gaussian Naïve Bayes, Random Forest, k-Nearest Neighbours, Support Vector Machine and Multilayer Perceptron. The choice of these ML models was based on the swiftness with which they could be trained, making them more suitable for real-time applications. Comparative analysis of the models reveals SVM and MLP to be the best performers with 77.86% and 79.62% accuracies respectively. The performance of these classifiers is compared with benchmark results in literature, and a significant improvement over state-of-the-art models is presented. The observations and findings of this work can be applied to design real-time emotion recognition frameworks that can be used to design and develop applications and technologies for various domains.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agrawal, E., & Christopher, J. (2020). Emotion recognition from periocular features. In International conference on machine learning, image processing, network security and data sciences (pp. 194–208). Springer. Agrawal, E., & Christopher, J. (2020). Emotion recognition from periocular features. In International conference on machine learning, image processing, network security and data sciences (pp. 194–208). Springer.
Zurück zum Zitat Agrawal, E., Christopher, J. J., & Arunachalam, V. (2021). Emotion recognition through voting on expressions in multiple facial regions. ICAART, 2, 1038–1045. Agrawal, E., Christopher, J. J., & Arunachalam, V. (2021). Emotion recognition through voting on expressions in multiple facial regions. ICAART, 2, 1038–1045.
Zurück zum Zitat Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.CrossRef Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.CrossRef
Zurück zum Zitat Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184, 104886.CrossRef Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184, 104886.CrossRef
Zurück zum Zitat Chen, L., Su, W., Feng, Y., Wu, M., She, J., & Hirota, K. (2020). Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Information Sciences, 509, 150–163.CrossRef Chen, L., Su, W., Feng, Y., Wu, M., She, J., & Hirota, K. (2020). Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Information Sciences, 509, 150–163.CrossRef
Zurück zum Zitat Christopher, J. J., Nehemiah, K. H., & Arputharaj, K. (2016). Knowledge-based systems and interestingness measures: Analysis with clinical datasets. Journal of Computing and Information Technology, 24(1), 65–78.CrossRef Christopher, J. J., Nehemiah, K. H., & Arputharaj, K. (2016). Knowledge-based systems and interestingness measures: Analysis with clinical datasets. Journal of Computing and Information Technology, 24(1), 65–78.CrossRef
Zurück zum Zitat Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23, 381–388.CrossRef Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23, 381–388.CrossRef
Zurück zum Zitat Daneshfar, F., & Kabudian, S. J. (2020). Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimedia Tools and Applications, 79(1), 1261–1289.CrossRef Daneshfar, F., & Kabudian, S. J. (2020). Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimedia Tools and Applications, 79(1), 1261–1289.CrossRef
Zurück zum Zitat Gomathy, M. (2021). Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. International Journal of Speech Technology, 24(1), 155–163.CrossRef Gomathy, M. (2021). Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. International Journal of Speech Technology, 24(1), 155–163.CrossRef
Zurück zum Zitat Gupta, K., Gupta, M., Christopher, J., & Arunachalam, V. (2020). Fuzzy system for facial emotion recognition. In International conference on intelligent systems design and applications (pp. 536–552). Springer. Gupta, K., Gupta, M., Christopher, J., & Arunachalam, V. (2020). Fuzzy system for facial emotion recognition. In International conference on intelligent systems design and applications (pp. 536–552). Springer.
Zurück zum Zitat Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.CrossRef Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.CrossRef
Zurück zum Zitat Jadoul, Y., Thompson, B., & De Boer, B. (2018). Introducing parselmouth: A python interface to praat. Journal of Phonetics, 71, 1–15.CrossRef Jadoul, Y., Thompson, B., & De Boer, B. (2018). Introducing parselmouth: A python interface to praat. Journal of Phonetics, 71, 1–15.CrossRef
Zurück zum Zitat Kavya, R., Christopher, J., Panda, S., & Lazarus, Y. B. (2021). Machine learning and XAI approaches for allergy diagnosis. Biomedical Signal Processing and Control, 69, 102681.CrossRef Kavya, R., Christopher, J., Panda, S., & Lazarus, Y. B. (2021). Machine learning and XAI approaches for allergy diagnosis. Biomedical Signal Processing and Control, 69, 102681.CrossRef
Zurück zum Zitat Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, 23(1), 45–55.CrossRef Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, 23(1), 45–55.CrossRef
Zurück zum Zitat Kursa, M. B., Jankowski, A., & Rudnicki, W. R. (2010). Boruta—a system for feature selection. Fundamenta Informaticae, 101(4), 271–285.MathSciNetCrossRef Kursa, M. B., Jankowski, A., & Rudnicki, W. R. (2010). Boruta—a system for feature selection. Fundamenta Informaticae, 101(4), 271–285.MathSciNetCrossRef
Zurück zum Zitat Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. In Eighth European conference on speech communication and technology. Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. In Eighth European conference on speech communication and technology.
Zurück zum Zitat Kwon, S., et al. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical convLSTM network. Mathematics, 8(12), 2133.CrossRef Kwon, S., et al. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical convLSTM network. Mathematics, 8(12), 2133.CrossRef
Zurück zum Zitat Kwon, S., et al. (2021). Mlt-dnet: Speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 167, 114177.CrossRef Kwon, S., et al. (2021). Mlt-dnet: Speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 167, 114177.CrossRef
Zurück zum Zitat Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559–563. Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559–563.
Zurück zum Zitat Liu, G. K. (2018). Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:1806.09010. Liu, G. K. (2018). Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv preprint arXiv:​1806.​09010.
Zurück zum Zitat Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.CrossRef Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.CrossRef
Zurück zum Zitat McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference Vol. 8, (pp. 18–25). Citeseer. McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference Vol. 8, (pp. 18–25). Citeseer.
Zurück zum Zitat Nantasri, P., Phaisangittisagul, E., Karnjana, J., Boonkla, S., Keerativittayanun, S., Rugchatjaroen, A., Usanavasin, S., & Shinozaki, T. (2020). A light-weight artificial neural network for speech emotion recognition using average values of MFCCs and their derivatives. In 2020 17th International conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON) (pp. 41–44). IEEE. Nantasri, P., Phaisangittisagul, E., Karnjana, J., Boonkla, S., Keerativittayanun, S., Rugchatjaroen, A., Usanavasin, S., & Shinozaki, T. (2020). A light-weight artificial neural network for speech emotion recognition using average values of MFCCs and their derivatives. In 2020 17th International conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON) (pp. 41–44). IEEE.
Zurück zum Zitat Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using support vector machine. International Journal of Smart Home, 6(2), 101–108. Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using support vector machine. International Journal of Smart Home, 6(2), 101–108.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825–2830.MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825–2830.MathSciNetMATH
Zurück zum Zitat Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRef Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRef
Zurück zum Zitat Petrushin, V. A. (2000). Emotion recognition in speech signal: Experimental study, development, and application. In Sixth international conference on spoken language processing. Petrushin, V. A. (2000). Emotion recognition in speech signal: Experimental study, development, and application. In Sixth international conference on spoken language processing.
Zurück zum Zitat Quan, C., Zhang, B., Sun, X., & Ren, F. (2017). A combined cepstral distance method for emotional speech recognition. International Journal of Advanced Robotic Systems, 14(4), 1729881417719836.CrossRef Quan, C., Zhang, B., Sun, X., & Ren, F. (2017). A combined cepstral distance method for emotional speech recognition. International Journal of Advanced Robotic Systems, 14(4), 1729881417719836.CrossRef
Zurück zum Zitat Rojas, R. (1996). The backpropagation algorithm. In Neural networks (pp. 149–182). Springer. Rojas, R. (1996). The backpropagation algorithm. In Neural networks (pp. 149–182). Springer.
Zurück zum Zitat Rong, J., Li, G., & Chen, Y.-P.P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information Processing & Management, 45(3), 315–328.CrossRef Rong, J., Li, G., & Chen, Y.-P.P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information Processing & Management, 45(3), 315–328.CrossRef
Zurück zum Zitat Shegokar, P., & Sircar, P. (2016). Continuous wavelet transform based speech emotion recognition. In 2016 10th international conference on signal processing and communication systems (ICSPCS) (pp. 1–8). IEEE. Shegokar, P., & Sircar, P. (2016). Continuous wavelet transform based speech emotion recognition. In 2016 10th international conference on signal processing and communication systems (ICSPCS) (pp. 1–8). IEEE.
Zurück zum Zitat Surampudi, N., Srirangan, M., & Christopher, J. (2019). Enhanced feature extraction approaches for detection of sound events. In 2019 IEEE 9th international conference on advanced computing (IACC) (pp. 223–229). IEEE. Surampudi, N., Srirangan, M., & Christopher, J. (2019). Enhanced feature extraction approaches for detection of sound events. In 2019 IEEE 9th international conference on advanced computing (IACC) (pp. 223–229). IEEE.
Zurück zum Zitat Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-end speech emotion recognition using deep neural networks. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5089–5093). IEEE. Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-end speech emotion recognition using deep neural networks. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5089–5093). IEEE.
Zurück zum Zitat Vogt, T., & André, E. (2006). Improving automatic emotion recognition from speech via gender differentiaion. In LREC (pp. 1123–1126). Vogt, T., & André, E. (2006). Improving automatic emotion recognition from speech via gender differentiaion. In LREC (pp. 1123–1126).
Zurück zum Zitat Zamil, A. A. A., Hasan, S., Baki, S. M. J., Adam, J. M., & Zaman, I. (2019). Emotion detection from speech signals using voting mechanism on classified frames. In 2019 international conference on robotics, electrical and signal processing techniques (ICREST) (pp. 281–285). IEEE. Zamil, A. A. A., Hasan, S., Baki, S. M. J., Adam, J. M., & Zaman, I. (2019). Emotion detection from speech signals using voting mechanism on classified frames. In 2019 international conference on robotics, electrical and signal processing techniques (ICREST) (pp. 281–285). IEEE.
Zurück zum Zitat Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 78(3), 3705–3722.CrossRef Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 78(3), 3705–3722.CrossRef
Zurück zum Zitat Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., & Shamma, S. (2011). Linear versus mel frequency cepstral coefficients for speaker recognition. In 2011 IEEE workshop on automatic speech recognition & understanding (pp. 559–564). IEEE. Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., & Shamma, S. (2011). Linear versus mel frequency cepstral coefficients for speaker recognition. In 2011 IEEE workshop on automatic speech recognition & understanding (pp. 559–564). IEEE.
Metadaten
Titel
Machine learning techniques for speech emotion recognition using paralinguistic acoustic features
verfasst von
Tulika Jha
Ramisetty Kavya
Jabez Christopher
Vasan Arunachalam
Publikationsdatum
08.07.2022
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 3/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-022-09985-6

Weitere Artikel der Ausgabe 3/2022

International Journal of Speech Technology 3/2022 Zur Ausgabe

Neuer Inhalt