nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features

verfasst von : Kohei Mukaihara, Sakriani Sakti, Satoshi Nakamura

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Emotionally coloured speech recognition is a key technology toward achieving human-like spoken dialog systems. However, despite rapid progress in automatic speech recognition (ASR) and emotion research, much less work has examined ASR systems that recognize the verbal content of emotionally coloured speech. Approaches that exist in emotional speech recognition mostly involve adapting standard ASR models to include information about prosody and emotion. In this study, instead of adapting a model to handle emotional speech, we focus on feature transformation methods to solve the mismatch and improve the ASR performance. In this way, we can train the model with emotionally coloured speech without any explicit emotional annotation. We investigate the use of two different deep bottleneck network structures: deep neural networks (DNNs) and convolutional neural networks (CNNs). We hypothesize that the trained bottleneck features may be able to extract essential information that represents the verbal content while abstracting away from superficial differences caused by emotional variance. We also try various combinations of these two bottleneck features with feature-space speaker adaptation. Experiments using Japanese and English emotional speech data reveal that both varieties of bottleneck features and feature-space speaker adaptation successfully improve the emotional speech recognition performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Preparing Audio Recordings of Everyday Speech for Prosody Research: The Case of the ORD Corpus

Nächstes Kapitel Relationship Between Perception of Cuteness in Female Voices and Their Durations

This framework was originally called a time-delay neural network [22] in speech recognition.

Arimoto, Y., Kawatsu, H., Ohno, S., Iida, H.: Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment. Acoust. Sci. Technol. 33(6), 359–369 (2012)CrossRef

Athanaselis, T., Bakamidis, S., Dologlou, I., Cowie, R., Douglas-Cowie, E., Cox, C.: ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw. 18(4), 437–444 (2005)CrossRef

Athanaselis, T., Bakamidis, S., Dologlou, I.: Recognizing verbal content of emotionally coloured speech. In: Proceedings of EUSIPCO, Florence, Italy (2006)

Gales, M.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef

Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)CrossRef

Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In: Proceedings of ICASSP, pp. 661–664 (1998)

Maekawa, K., Koiso, H., Furui, S., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of LREC, Athens, Greece, pp. 947–952 (2000)

McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012)CrossRef

Miao, Y.: Kaldi+PDNN: building DNN-based ASR systems with Kaldi and PDNN. arXiv:1401.6984 (2014)

10.

Murray, I., Arnott, L.: Toward the simulation of emotion in synthetic speech: a review of the listerature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993)CrossRef

11.

Paul, D., Baker, J.: The design for the Wall Street Journal-based CSR corpus. In: Proceedings of DARPA Speech and Language Workshop, San Mateo, USA (1992)

12.

Picard, R.: Affective Computing. MIT Press, Cambridge (1997)CrossRef

13.

Plutchik, R.: A general psychoevolutionary theory of emotion. In: Theories of emotion. Academic Press (1980)

14.

Polzin, S., Waibel, A.: Pronunciation variations in emotional speech. In: Proceedings of ESCA, pp. 103–108 (1998)

15.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Moticek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU, Hawaii, USA (2011)

16.

Schuller, B., Stadermann, J., Rigoll, G.: Affect-robust speech recognition by dynamic emotional adaptation. In: Proceedings of Speech Prosody (2006)

17.

Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, Brighton, United Kingdom, pp. 312–315 (2009)

18.

Schuller, B., Steidl, S., Burkhardt, F., Devillers, L., Muller, C., Narayanan, S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, Makuhari, Japan, pp. 2794–2797 (2010)

19.

Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011 - the first international audio/visual emotion challenge. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII), Memphis, Tennessee, pp. 415–424 (2011)

20.

Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of of ICSLP, Denver, USA, pp. 901–904 (2002)

21.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 37, 3371–3408 (2010)MathSciNetMATH

22.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)CrossRef

23.

Williams, C., Stevens, K.: Emotion and speech: some acoustical correlates. J. Acoust. Soc. Amer 52, 1238–1250 (1972)CrossRef

Titel: Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features
verfasst von: Kohei Mukaihara
Sakriani Sakti
Satoshi Nakamura
Verlag: Springer International Publishing
Buch: Speech and Computer
Print ISBN: 978-3-319-66428-6

Electronic ISBN: 978-3-319-66429-3

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-66429-3_63

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"