Skip to main content

2017 | OriginalPaper | Buchkapitel

Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis

verfasst von : Ingo Siegert, Alicia Flores Lotz, Olga Egorow, Andreas Wendemuth

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the speech intelligibility. But, enabling a natural spoken conversation also requires affective, namely emotional, information, contained in spoken language, to be intelligibly transmitted. Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition [1]. By contrast, in our second assumption we state that the combination of psychoacoustic modeling and Analysis-by-Synthesis codecs could actually improve speech-based emotion recognition by removing certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information. To test both assumptions, we conducted an ITU-recommended POLQA measuring as well as several emotion recognition experiments employing two different datasets to verify the generality of this assumption. We compared our results on the hybrid mode with Analysis-by-Synthesis-only and psychoacoustic modeling-only codecs. The hybrid mode does not show remarkable differences regarding the speech intelligibility, but it outperforms all other compression settings in the multi-class emotion recognition experiments and achieves even an \(\sim \)3.3% absolute higher performance than the uncompressed samples.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We upsampled emoDB from 16 kHz to 20 kHz. Hence, also the bitrate increased from 256 kbit/s to 320 kbit/s. This does not add missing information in the high frequency range but forces Opus to use the hybrid mode, as this mode is only available for super-wideband and fullband signals.
 
Literatur
1.
Zurück zum Zitat Albahri, A., Lech, M., Cheng, E.: Effect of speech compression on the automatic recognition of emotions. IJSPS 4(1), 55–61 (2016) Albahri, A., Lech, M., Cheng, E.: Effect of speech compression on the automatic recognition of emotions. IJSPS 4(1), 55–61 (2016)
2.
Zurück zum Zitat Biundo, S., Wendemuth, A.: Companion-technology for cognitive technical systems. KI - Künstliche Intelligenz 30(1), 71–75 (2016)CrossRef Biundo, S., Wendemuth, A.: Companion-technology for cognitive technical systems. KI - Künstliche Intelligenz 30(1), 71–75 (2016)CrossRef
3.
Zurück zum Zitat Brandenburg, K.: MP3 and AAC explained. In: 17th AES International Conference: High-Quality Audio Coding, Florence, Italy, September 1999 Brandenburg, K.: MP3 and AAC explained. In: 17th AES International Conference: High-Quality Audio Coding, Florence, Italy, September 1999
4.
Zurück zum Zitat Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, pp. 1517–1520, Lisbon, Portugal (2005) Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, pp. 1517–1520, Lisbon, Portugal (2005)
5.
Zurück zum Zitat Byrne, C., Foulkes, P.: The ‘mobile phone effect’ on vowel formants. Int. J. Speech Lang. Law 11(1), 83–102 (2004)CrossRef Byrne, C., Foulkes, P.: The ‘mobile phone effect’ on vowel formants. Int. J. Speech Lang. Law 11(1), 83–102 (2004)CrossRef
6.
Zurück zum Zitat Dhall, A., Goecke, R., Gedeon, T., Sebe, N.: Emotion recognition in the wild. J. Multimodal User Interfaces 10, 95–97 (2016)CrossRef Dhall, A., Goecke, R., Gedeon, T., Sebe, N.: Emotion recognition in the wild. J. Multimodal User Interfaces 10, 95–97 (2016)CrossRef
7.
Zurück zum Zitat Engberg, I.S., Hansen, A.V.: Documentation of the danish emotional speech database (DES), Tech. rep. Aalborg University, Denmark (1996) Engberg, I.S., Hansen, A.V.: Documentation of the danish emotional speech database (DES), Tech. rep. Aalborg University, Denmark (1996)
8.
Zurück zum Zitat Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010, Firenze, Italy (2010) Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010, Firenze, Italy (2010)
9.
Zurück zum Zitat García, N., Vásquez-Correa, J.C., Arias-Londoño, J.D., Várgas-Bonilla, J.F., Orozco-Arroyave, J.R.: Automatic emotion recognition in compressed speech using acoustic and non-linear features. In: Proceedings of STSIVA 2016, pp. 1–7 (2015) García, N., Vásquez-Correa, J.C., Arias-Londoño, J.D., Várgas-Bonilla, J.F., Orozco-Arroyave, J.R.: Automatic emotion recognition in compressed speech using acoustic and non-linear features. In: Proceedings of STSIVA 2016, pp. 1–7 (2015)
10.
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef
11.
Zurück zum Zitat Hoene, C., Valin, J.M., Vos, K., Skoglund, J.: Summary of Opus listening test results draft-valin-codec-results-03. Internet-draft, IETF (2013) Hoene, C., Valin, J.M., Vos, K., Skoglund, J.: Summary of Opus listening test results draft-valin-codec-results-03. Internet-draft, IETF (2013)
12.
Zurück zum Zitat IBM Corporation and Microsoft Corporation: Multimedia programming interface and data specifications 1.0. Tech. rep., August 1991 IBM Corporation and Microsoft Corporation: Multimedia programming interface and data specifications 1.0. Tech. rep., August 1991
16.
Zurück zum Zitat Jokisch, O., Maruschke, M., Meszaros, M., Iaroshenko, V.: Audio and speech quality survey of the opus codec in web real-time communication. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, Leipzig, Germany, pp. 254–262 (2016) Jokisch, O., Maruschke, M., Meszaros, M., Iaroshenko, V.: Audio and speech quality survey of the opus codec in web real-time communication. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, Leipzig, Germany, pp. 254–262 (2016)
17.
Zurück zum Zitat Lotz, A.F., Siegert, I., Maruschke, M., Wendemuth, A.: Audio compression and its impact on emotion recognition in affective computing. In: Elektronische Sprachsignalverarbeitung 2017, vol. 86, Saarbrücken, Germany, pp. 1–8 (2017) Lotz, A.F., Siegert, I., Maruschke, M., Wendemuth, A.: Audio compression and its impact on emotion recognition in affective computing. In: Elektronische Sprachsignalverarbeitung 2017, vol. 86, Saarbrücken, Germany, pp. 1–8 (2017)
18.
Zurück zum Zitat Paulsen, S.: QoS/QoE-Modelle für den Dienst Voice over IP (VoIP). Ph.D. thesis, Universität Hamburg (2015) Paulsen, S.: QoS/QoE-Modelle für den Dienst Voice over IP (VoIP). Ph.D. thesis, Universität Hamburg (2015)
19.
Zurück zum Zitat Pfister, T., Robinson, P.: Speech emotion classification and public speaking skill assessment. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 151–162. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14715-9_15 CrossRef Pfister, T., Robinson, P.: Speech emotion classification and public speaking skill assessment. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 151–162. Springer, Heidelberg (2010). doi:10.​1007/​978-3-642-14715-9_​15 CrossRef
20.
Zurück zum Zitat Rämö, A., Toukomaa, H.: Voice quality characterization of IETF opus codec. In: Proceedings of the INTERSPEECH-2011, pp. 2541–2544, Florence, Italy (2011) Rämö, A., Toukomaa, H.: Voice quality characterization of IETF opus codec. In: Proceedings of the INTERSPEECH-2011, pp. 2541–2544, Florence, Italy (2011)
21.
Zurück zum Zitat Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE ASRU-2009, Merano, Italy, pp. 552–557 (2009) Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE ASRU-2009, Merano, Italy, pp. 552–557 (2009)
22.
Zurück zum Zitat Siegert, I., Lotz, A.F., l. Duong, L., Wendemuth, A.: Measuring the impact of audio compression on the spectral quality of speech data. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, pp. 229–236. Leipzig, Germany (2016) Siegert, I., Lotz, A.F., l. Duong, L., Wendemuth, A.: Measuring the impact of audio compression on the spectral quality of speech data. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, pp. 229–236. Leipzig, Germany (2016)
23.
Zurück zum Zitat Siegert, I., Lotz, A.F., Maruschke, M., Jokisch, O., Wendemuth, A.: Emotion intelligibility within codec-compressed and reduced bandwith speech. In: ITG-Fb. 267: Speech Communication : 12. ITG-Fachtagung Sprachkommunikation 5–7. Oktober 2016 in Paderborn, pp. 215–219. VDE Verlag (2016) Siegert, I., Lotz, A.F., Maruschke, M., Jokisch, O., Wendemuth, A.: Emotion intelligibility within codec-compressed and reduced bandwith speech. In: ITG-Fb. 267: Speech Communication : 12. ITG-Fachtagung Sprachkommunikation 5–7. Oktober 2016 in Paderborn, pp. 215–219. VDE Verlag (2016)
24.
Zurück zum Zitat Steininger, S., Schiel, F., Dioubina, O., Raubold, S.: Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, pp. 33–37 (2002) Steininger, S., Schiel, F., Dioubina, O., Raubold, S.: Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, pp. 33–37 (2002)
25.
Zurück zum Zitat Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys.: Conf. Ser., vol. 450, p. 012053 (2013) Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys.: Conf. Ser., vol. 450, p. 012053 (2013)
27.
Zurück zum Zitat Valin, J.M., Maxwell, G., Terriberry, T.B., Vos, K.: The opus codec. In: 135th AES International Convention, New York, USA, October 2013 Valin, J.M., Maxwell, G., Terriberry, T.B., Vos, K.: The opus codec. In: 135th AES International Convention, New York, USA, October 2013
28.
Zurück zum Zitat Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48, 1162–1181 (2006)CrossRef Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48, 1162–1181 (2006)CrossRef
29.
Zurück zum Zitat Vásquez-Correa, J.C., García, N., Vargas-Bonilla, J.F., Orozco-Arroyave, J.R., Arias-Londoño, J.D., Quintero, M.O.L.: Evaluation of wavelet measures on automatic detection of emotion in noisy and telephony speech signals. In: International Carnahan Conference on Security Technology, pp. 1–6 (2014) Vásquez-Correa, J.C., García, N., Vargas-Bonilla, J.F., Orozco-Arroyave, J.R., Arias-Londoño, J.D., Quintero, M.O.L.: Evaluation of wavelet measures on automatic detection of emotion in noisy and telephony speech signals. In: International Carnahan Conference on Security Technology, pp. 1–6 (2014)
30.
Zurück zum Zitat Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (2009)CrossRef Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (2009)CrossRef
31.
Zurück zum Zitat Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings of the IEEE ASRU-2011, Waikoloa, USA, pp. 523–528 (2011) Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings of the IEEE ASRU-2011, Waikoloa, USA, pp. 523–528 (2011)
Metadaten
Titel
Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis
verfasst von
Ingo Siegert
Alicia Flores Lotz
Olga Egorow
Andreas Wendemuth
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-66429-3_44