Skip to main content
Erschienen in: Wireless Personal Communications 3/2017

21.02.2017

Novel Sub-band Spectral Centroid Weighted Wavelet Packet Features with Importance-Weighted Support Vector Machines for Robust Speech Emotion Recognition

verfasst von: Yongming Huang, Wu Ao, Guobao Zhang

Erschienen in: Wireless Personal Communications | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we propose novel sub-band spectral centroid weighted wavelet packet cepstral coefficients (W-WPCC) for robust speech emotion recognition. Wavelet packet transform (WPT), as an effective tool for non-stationary signal analysis, is applied for speech analysis with a human auditory perception based WP filterbank structure. For each sub-band, the spectral centroid, which has been proved to be noise-robust, is calculated. On this basis, the W-WPCC feature is computed by combining the sub-band energies with sub-band spectral centroids via a weighting scheme to generate noise-robust acoustic features. The importance-weighted support vector machine (IW-SVM) is proposed to improve the robustness of classifier to the noises, while the important weight is utilized to compensate the covariate shift between test dataset and training dataset. Clean speech environments while demonstrates better noise-robustness in noisy environments and the IW-SVM improves the robustness to white Gaussian noise in speech emotion recognition compared with conventional classifiers.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zeng, Z. H., Pantic, M., Roisman, G. I., et al. (2009). A survey of affect recognition methods: audio, visual, and spontaneous expressions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58.CrossRef Zeng, Z. H., Pantic, M., Roisman, G. I., et al. (2009). A survey of affect recognition methods: audio, visual, and spontaneous expressions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58.CrossRef
2.
Zurück zum Zitat Brisson, J., Martel, K., Serres, J., Sirois, S., & Adrien, J. L. (2014). Acoustic analysis of oral productions of infants later diagnosed with autism and their mother. Infant Mental Health Journal, 35(3), 285–295.CrossRef Brisson, J., Martel, K., Serres, J., Sirois, S., & Adrien, J. L. (2014). Acoustic analysis of oral productions of infants later diagnosed with autism and their mother. Infant Mental Health Journal, 35(3), 285–295.CrossRef
3.
Zurück zum Zitat Kiavash, B., Rob, N., & Wim, W. (2016). Towards multimodal emotion recognition in e-learning environments. Interactive Learning Environments, 24(3), 590–605.CrossRef Kiavash, B., Rob, N., & Wim, W. (2016). Towards multimodal emotion recognition in e-learning environments. Interactive Learning Environments, 24(3), 590–605.CrossRef
4.
Zurück zum Zitat Crumpton, J., & Bethel, C. L. (2015). A survey of using vocal prosody to convey emotion in robot speech. International Journal of Social Robotics, 8(2), 271–285.CrossRef Crumpton, J., & Bethel, C. L. (2015). A survey of using vocal prosody to convey emotion in robot speech. International Journal of Social Robotics, 8(2), 271–285.CrossRef
5.
Zurück zum Zitat Inshirah, I., & Salam, M. S. H. (2015). Voice quality features for speech emotion recognition. Journal of Information Assurance and Security, 10(4), 183–191. Inshirah, I., & Salam, M. S. H. (2015). Voice quality features for speech emotion recognition. Journal of Information Assurance and Security, 10(4), 183–191.
6.
Zurück zum Zitat Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs[J]. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.CrossRef Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs[J]. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.CrossRef
7.
Zurück zum Zitat Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture[C]//Acoustics, Speech, and Signal Processing, 2004. In Proceedings. (ICASSP ‘04). IEEE International Conference on, 2004: I-577-580. Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture[C]//Acoustics, Speech, and Signal Processing, 2004. In Proceedings. (ICASSP ‘04). IEEE International Conference on, 2004: I-577-580.
8.
Zurück zum Zitat Vlasenko, B., Schuller, B., Wendemuth, A. et al. (2007). Frame vs. turn-level: Emotion recognition from speech considering static and dynamic processing[C]//Affective Computing and Intelligent Interaction, Proceedings, 2007: 139–147, 781. Vlasenko, B., Schuller, B., Wendemuth, A. et al. (2007). Frame vs. turn-level: Emotion recognition from speech considering static and dynamic processing[C]//Affective Computing and Intelligent Interaction, Proceedings, 2007: 139–147, 781.
9.
Zurück zum Zitat Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification[J]. The Journal of the Acoustical Society of America, 55(6), 1304–1312.CrossRef Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification[J]. The Journal of the Acoustical Society of America, 55(6), 1304–1312.CrossRef
10.
Zurück zum Zitat Guzman, M., Correa, S., Munoz, D., et al. (2013). Influence on spectral energy distribution of emotional expression[J]. Journal of Voice, 27(1), 129.e1–129.e10.CrossRef Guzman, M., Correa, S., Munoz, D., et al. (2013). Influence on spectral energy distribution of emotional expression[J]. Journal of Voice, 27(1), 129.e1–129.e10.CrossRef
11.
Zurück zum Zitat Fastl, H., & Zwicer, E. (1999). Psychoacoustics: Facts and models[M] (2nd ed.). New York: Springer-Verlag. Fastl, H., & Zwicer, E. (1999). Psychoacoustics: Facts and models[M] (2nd ed.). New York: Springer-Verlag.
12.
Zurück zum Zitat Iliev, A. I., & Scordilis M. S. (2011). Spoken emotion recognition using glottal symmetry[J]. Eurasip Journal on Advances in Signal Processing, 2011(1), 1–11.CrossRef Iliev, A. I., & Scordilis M. S. (2011). Spoken emotion recognition using glottal symmetry[J]. Eurasip Journal on Advances in Signal Processing, 2011(1), 1–11.CrossRef
13.
Zurück zum Zitat Hassan, A., Damper, R., & Niranjan, M. (2013). On acoustic emotion recognition: Compensating for covariate shift. IEEE Transactions on Audio, Speech and Language Processing, 21(7), 1458–1468.CrossRef Hassan, A., Damper, R., & Niranjan, M. (2013). On acoustic emotion recognition: Compensating for covariate shift. IEEE Transactions on Audio, Speech and Language Processing, 21(7), 1458–1468.CrossRef
14.
Zurück zum Zitat Shamiand, M., & Verhelst, W. (2007). An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Communication, 49(3), 201–212.CrossRef Shamiand, M., & Verhelst, W. (2007). An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Communication, 49(3), 201–212.CrossRef
15.
Zurück zum Zitat Tahon, M., & Devillers, L. (2016). Towards a small set of robust acoustic features for emotion recognition: Challenges. IEEE-ACM Transactions on Audio Speech and Language Processing, 24(1), 16–28.CrossRef Tahon, M., & Devillers, L. (2016). Towards a small set of robust acoustic features for emotion recognition: Challenges. IEEE-ACM Transactions on Audio Speech and Language Processing, 24(1), 16–28.CrossRef
16.
Zurück zum Zitat Shah, M., Chakrabarti, C., & Spanias, A. (2015). Within and cross-corpus speech emotion recognition using latent topic model-based features. Eurasip Journal on Audio Speech and Music Processing, 2015(1), 1–17.CrossRef Shah, M., Chakrabarti, C., & Spanias, A. (2015). Within and cross-corpus speech emotion recognition using latent topic model-based features. Eurasip Journal on Audio Speech and Music Processing, 2015(1), 1–17.CrossRef
17.
Zurück zum Zitat Deng, J., Xia, R., Zhang, Z., & Liu, Y. (2014) Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. Icassp IEEE International Conference on Acoustics, 4818–4822. Deng, J., Xia, R., Zhang, Z., & Liu, Y. (2014) Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. Icassp IEEE International Conference on Acoustics, 4818–4822.
18.
Zurück zum Zitat Tahon, M., Sehili, M. A., & Devillers, L. (2015). Cross-corpus experiments on laughter and emotion detection in HRI with elderly people. Springer International Publishing, 31(3), 547–548. Tahon, M., Sehili, M. A., & Devillers, L. (2015). Cross-corpus experiments on laughter and emotion detection in HRI with elderly people. Springer International Publishing, 31(3), 547–548.
19.
Zurück zum Zitat Song, P., Jin, Y., Zha, C., & Zhao, L. (2015). Speech emotion recognition method based on hidden factor analysis. Electronics Letters, 51(1), 112–114.CrossRef Song, P., Jin, Y., Zha, C., & Zhao, L. (2015). Speech emotion recognition method based on hidden factor analysis. Electronics Letters, 51(1), 112–114.CrossRef
20.
Zurück zum Zitat Mallat, S. (2009). A wavelet tour of signal processing[M] (3rd ed.). Burlington: Academic Press.MATH Mallat, S. (2009). A wavelet tour of signal processing[M] (3rd ed.). Burlington: Academic Press.MATH
21.
Zurück zum Zitat Daubechies, I. (1992). Ten lectures on wavelets[M] Philadelphia: Society for industrial and applied mathematics. Daubechies, I. (1992). Ten lectures on wavelets[M] Philadelphia: Society for industrial and applied mathematics.
22.
Zurück zum Zitat Mallat, S. G. (1989). A theory for multiresolution signal decomposition: the wavelet representation[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 11(7), 674–693.CrossRefMATH Mallat, S. G. (1989). A theory for multiresolution signal decomposition: the wavelet representation[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 11(7), 674–693.CrossRefMATH
23.
Zurück zum Zitat Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition[M]. New Jersey: Prentice-Hall.MATH Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition[M]. New Jersey: Prentice-Hall.MATH
24.
Zurück zum Zitat Karmakar, A., Kumar, A., & Patney, R. K. (2007). Design of optimal wavelet packet trees based on auditory perception criterion[J]. IEEE Signal Processing Letters, 14(4), 240–243.CrossRef Karmakar, A., Kumar, A., & Patney, R. K. (2007). Design of optimal wavelet packet trees based on auditory perception criterion[J]. IEEE Signal Processing Letters, 14(4), 240–243.CrossRef
25.
Zurück zum Zitat Li, Y., Zhang, G, & Huang, Y. (2013). Adaptive wavelet packet filter-bank based acoustic feature for speech emotion recognition[C]. In Proceedings of 2013 Chinese Intelligent Automation Conference-Intelligent Information Processing. Heidelberg: Springer Verlag, pp. 359–366. Li, Y., Zhang, G, & Huang, Y. (2013). Adaptive wavelet packet filter-bank based acoustic feature for speech emotion recognition[C]. In Proceedings of 2013 Chinese Intelligent Automation Conference-Intelligent Information Processing. Heidelberg: Springer Verlag, pp. 359–366.
26.
Zurück zum Zitat Wu, S. Q., Falk, T. H., & Chan, W. Y. (2011). Automatic speech emotion recognition using modulation spectral features[J]. Speech Communication, 53(5), 768–785.CrossRef Wu, S. Q., Falk, T. H., & Chan, W. Y. (2011). Automatic speech emotion recognition using modulation spectral features[J]. Speech Communication, 53(5), 768–785.CrossRef
27.
Zurück zum Zitat Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., & Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinfor-matics, 22(14), e49–e57.CrossRef Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., & Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinfor-matics, 22(14), e49–e57.CrossRef
28.
Zurück zum Zitat Hido, S., Tsuboi, Y., Kashima, H., & Sugiyama, M. (2007). Novelty detection by density ratio estimation. In Proceedings of IBIS. Hido, S., Tsuboi, Y., Kashima, H., & Sugiyama, M. (2007). Novelty detection by density ratio estimation. In Proceedings of IBIS.
29.
Zurück zum Zitat Mozafari, A. S., & Amzad, M. (2016). A SVM-based model-transferring method for heterogeneous domain adaptation. Pattern Recognition, 56, 142–158. Mozafari, A. S., & Amzad, M. (2016). A SVM-based model-transferring method for heterogeneous domain adaptation. Pattern Recognition, 56, 142–158.
30.
Zurück zum Zitat Burkhardt, F., Paeschke, A., Rolfes, M. et al. (2005). A database of German emotional speech[C]//Proceeding INTERSPEECH 2005, pp. 1517–1520. Burkhardt, F., Paeschke, A., Rolfes, M. et al. (2005). A database of German emotional speech[C]//Proceeding INTERSPEECH 2005, pp. 1517–1520.
Metadaten
Titel
Novel Sub-band Spectral Centroid Weighted Wavelet Packet Features with Importance-Weighted Support Vector Machines for Robust Speech Emotion Recognition
verfasst von
Yongming Huang
Wu Ao
Guobao Zhang
Publikationsdatum
21.02.2017
Verlag
Springer US
Erschienen in
Wireless Personal Communications / Ausgabe 3/2017
Print ISSN: 0929-6212
Elektronische ISSN: 1572-834X
DOI
https://doi.org/10.1007/s11277-017-4052-3

Weitere Artikel der Ausgabe 3/2017

Wireless Personal Communications 3/2017 Zur Ausgabe

Neuer Inhalt