Skip to main content

2017 | OriginalPaper | Buchkapitel

Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features

verfasst von : Kohei Mukaihara, Sakriani Sakti, Satoshi Nakamura

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Emotionally coloured speech recognition is a key technology toward achieving human-like spoken dialog systems. However, despite rapid progress in automatic speech recognition (ASR) and emotion research, much less work has examined ASR systems that recognize the verbal content of emotionally coloured speech. Approaches that exist in emotional speech recognition mostly involve adapting standard ASR models to include information about prosody and emotion. In this study, instead of adapting a model to handle emotional speech, we focus on feature transformation methods to solve the mismatch and improve the ASR performance. In this way, we can train the model with emotionally coloured speech without any explicit emotional annotation. We investigate the use of two different deep bottleneck network structures: deep neural networks (DNNs) and convolutional neural networks (CNNs). We hypothesize that the trained bottleneck features may be able to extract essential information that represents the verbal content while abstracting away from superficial differences caused by emotional variance. We also try various combinations of these two bottleneck features with feature-space speaker adaptation. Experiments using Japanese and English emotional speech data reveal that both varieties of bottleneck features and feature-space speaker adaptation successfully improve the emotional speech recognition performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
This framework was originally called a time-delay neural network [22] in speech recognition.
 
Literatur
1.
Zurück zum Zitat Arimoto, Y., Kawatsu, H., Ohno, S., Iida, H.: Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment. Acoust. Sci. Technol. 33(6), 359–369 (2012)CrossRef Arimoto, Y., Kawatsu, H., Ohno, S., Iida, H.: Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment. Acoust. Sci. Technol. 33(6), 359–369 (2012)CrossRef
2.
Zurück zum Zitat Athanaselis, T., Bakamidis, S., Dologlou, I., Cowie, R., Douglas-Cowie, E., Cox, C.: ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw. 18(4), 437–444 (2005)CrossRef Athanaselis, T., Bakamidis, S., Dologlou, I., Cowie, R., Douglas-Cowie, E., Cox, C.: ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw. 18(4), 437–444 (2005)CrossRef
3.
Zurück zum Zitat Athanaselis, T., Bakamidis, S., Dologlou, I.: Recognizing verbal content of emotionally coloured speech. In: Proceedings of EUSIPCO, Florence, Italy (2006) Athanaselis, T., Bakamidis, S., Dologlou, I.: Recognizing verbal content of emotionally coloured speech. In: Proceedings of EUSIPCO, Florence, Italy (2006)
4.
Zurück zum Zitat Gales, M.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef Gales, M.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef
5.
Zurück zum Zitat Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)CrossRef Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)CrossRef
6.
Zurück zum Zitat Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In: Proceedings of ICASSP, pp. 661–664 (1998) Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In: Proceedings of ICASSP, pp. 661–664 (1998)
7.
Zurück zum Zitat Maekawa, K., Koiso, H., Furui, S., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of LREC, Athens, Greece, pp. 947–952 (2000) Maekawa, K., Koiso, H., Furui, S., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of LREC, Athens, Greece, pp. 947–952 (2000)
8.
Zurück zum Zitat McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012)CrossRef McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012)CrossRef
10.
Zurück zum Zitat Murray, I., Arnott, L.: Toward the simulation of emotion in synthetic speech: a review of the listerature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993)CrossRef Murray, I., Arnott, L.: Toward the simulation of emotion in synthetic speech: a review of the listerature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993)CrossRef
11.
Zurück zum Zitat Paul, D., Baker, J.: The design for the Wall Street Journal-based CSR corpus. In: Proceedings of DARPA Speech and Language Workshop, San Mateo, USA (1992) Paul, D., Baker, J.: The design for the Wall Street Journal-based CSR corpus. In: Proceedings of DARPA Speech and Language Workshop, San Mateo, USA (1992)
12.
13.
Zurück zum Zitat Plutchik, R.: A general psychoevolutionary theory of emotion. In: Theories of emotion. Academic Press (1980) Plutchik, R.: A general psychoevolutionary theory of emotion. In: Theories of emotion. Academic Press (1980)
14.
Zurück zum Zitat Polzin, S., Waibel, A.: Pronunciation variations in emotional speech. In: Proceedings of ESCA, pp. 103–108 (1998) Polzin, S., Waibel, A.: Pronunciation variations in emotional speech. In: Proceedings of ESCA, pp. 103–108 (1998)
15.
Zurück zum Zitat Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Moticek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU, Hawaii, USA (2011) Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Moticek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU, Hawaii, USA (2011)
16.
Zurück zum Zitat Schuller, B., Stadermann, J., Rigoll, G.: Affect-robust speech recognition by dynamic emotional adaptation. In: Proceedings of Speech Prosody (2006) Schuller, B., Stadermann, J., Rigoll, G.: Affect-robust speech recognition by dynamic emotional adaptation. In: Proceedings of Speech Prosody (2006)
17.
Zurück zum Zitat Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, Brighton, United Kingdom, pp. 312–315 (2009) Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, Brighton, United Kingdom, pp. 312–315 (2009)
18.
Zurück zum Zitat Schuller, B., Steidl, S., Burkhardt, F., Devillers, L., Muller, C., Narayanan, S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, Makuhari, Japan, pp. 2794–2797 (2010) Schuller, B., Steidl, S., Burkhardt, F., Devillers, L., Muller, C., Narayanan, S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, Makuhari, Japan, pp. 2794–2797 (2010)
19.
Zurück zum Zitat Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011 - the first international audio/visual emotion challenge. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII), Memphis, Tennessee, pp. 415–424 (2011) Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011 - the first international audio/visual emotion challenge. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII), Memphis, Tennessee, pp. 415–424 (2011)
20.
Zurück zum Zitat Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of of ICSLP, Denver, USA, pp. 901–904 (2002) Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of of ICSLP, Denver, USA, pp. 901–904 (2002)
21.
Zurück zum Zitat Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 37, 3371–3408 (2010)MathSciNetMATH Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 37, 3371–3408 (2010)MathSciNetMATH
22.
Zurück zum Zitat Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)CrossRef Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)CrossRef
23.
Zurück zum Zitat Williams, C., Stevens, K.: Emotion and speech: some acoustical correlates. J. Acoust. Soc. Amer 52, 1238–1250 (1972)CrossRef Williams, C., Stevens, K.: Emotion and speech: some acoustical correlates. J. Acoust. Soc. Amer 52, 1238–1250 (1972)CrossRef
Metadaten
Titel
Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features
verfasst von
Kohei Mukaihara
Sakriani Sakti
Satoshi Nakamura
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-66429-3_63