nach oben

International Journal of Speech Technology

Erschienen in:

20.09.2021

Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems

verfasst von: Pavan Raju Kammili, B. H. V. S. Ramakrishnam Raju, A. Sri Krishna

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, the effect of emotional speech on the performance of neutral speech trained ASR systems is studied. Prosody-modification based data augmentation is explored to compensate the affected ASR performance due to emotional speech. The primary motive is to develop an Telugu ASR system that is least affected by these emotion based intrinsic speaker related acoustic variations. Two factors contributing towards the intrinsic speaker related variability that are focused in this research are the fundamental frequency [\((F_0)\) or pitch] and the speaking rate variations. To simulate ASR task, we performed the training of our ASR system on neutral speech and tested it for data from emotional as well as neutral speech. Compared to the performance metrics of neutral speech at testing stage, emotional speech performance metrics are extremely degraded. This performance degradation is observed due to the difference in the prosody and speaking rate parameters of neutral and emotional speech. To overcome this performance degradation problem, prosody and speaking rate parameters are varied and modified to create the newer augmented versions of the training data. The original and augmented versions of the training data are pooled together and re-trained in order to capture greater emotion-specific variations. For the Telugu ASR experiments, we used Microsoft speech corpus for Indian languages(MSC-IL) for training neutral speech and Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) for evaluating emotional speech. The basic emotions of anger, happiness and sad are considered for evaluation along with neutral speech.

Vorheriger Artikel Closed-set speaker identification using VQ and GMM based models

Nächster Artikel A method for constructing Korean spontaneous spoken language corpus based on an imitation of abbreviated and transformed particles

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Dhananjaya, N., & Yegnanarayana, B. (2010). Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3), 273–276.CrossRef

Gangamohan, P., Mittal, V., & Yegnanarayana, B. (2012). Relative importance of different components of speech contributing to perception of emotion. In Speech Prosody.

Geoffrey, H., Li, D., Dong, Y., George, E. D., & Mohamed, A.-R. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef

Gerosa, M., Giuliani, D., Narayanan, S., & Potamianos, A. (2009). A review of asr technologies for children’s speech. In Proceedings of the 2nd workshop on child, computer and interaction. (pp. 1–8).

Hsiao, R., Ma, J., Hartmann, W., Karafiát, M., Grézl, F., Burget, L., Szöke, I., Černockỳ, J. H., Watanabe, S., Chen, Z., et al. (2015). Robust speech recognition in unknown reverberant and noisy conditions. In Workshop on automatic speech recognition and understanding (ASRU) (pp. 533–538). IEEE.

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017) A study on data augmentation of reverberant speech for robust speech recognition. In International conference on acoustics, speech and signal processing (ICASSP). (pp. 5220–5224). IEEE

Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S.& Rao, K. S. (2009) Iitkgp-sesc: Speech database for emotion analysis. In International conference on contemporary computing. (pp. 485–492). Springer.

Lippmann, R., Martin, E., & Paul, D. (1987) Multi-style training for robust isolated-word speech recognition. In International conference on acoustics, speech, and signal processing (ICASSP), (vol. 12, pp. 705–708). IEEE

Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.CrossRef

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015) Librispeech: An asr corpus based on public domain audio books. In International conference on acoustics, speech and signal processing (ICASSP). (pp. 5206–5210). IEEE.

Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudanpur, S. (2015). Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms. In Workshop on automatic speech recognition and understanding (ASRU), (pp. 539–546)

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. et al. (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.

Prasanna, S., Govind, D., Rao, K. S., & Yegnanarayana, B. (2010). Fast prosody modification using instants of significant excitation. In Proceedings of Fifth International Conference of Speech Prosody.

Raju, V. V., Gangamohan, P., Gangashetty, S. V., & Vuppala, A. K. (2016). Application of prosody modification for speech recognition in different emotion conditions. In IEEE Region 10 Conference (TENCON), IEEE, 2016 (pp. 951–954).

Raju, V. V., Vydana, H. K., Gangashetty, S. V., & Vuppala, A. K. (2017) Importance of non-uniform prosody modification for speech recognition in emotion conditions, In 2017 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). (pp. 573–576). IEEE.

Rao, K. S., Prasanna, S. M., & Yegnanarayana, B. (2007). Determination of instants of significant excitation in speech using hilbert envelope and group delay function. IEEE Signal Processing Letters, 14(10), 762–765.CrossRef

Rousseau, A., Deléglise, P., & Esteve, Y. (2012). Ted-lium: An automatic speech recognition dedicated corpus. In LREC (pp. 125–129).

Russell, M., & D’Arcy, S. (2007). Challenges for computer recognition of children’s speech. In Workshop on speech and language technology in education.

Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., & Strope, B. (2010) “Your word is my command”: Google search by voice: A case study. In Advances in speech recognition. (pp. 61–90). Springer.

Shahnawazuddin, S., Adiga, N., Kathania, H. K., & Sai, B. T. (2020). Creating speaker independent asr system through prosody modification based data augmentation. Pattern Recognition Letters, 131, 213–218.CrossRef

Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing, 3(5), 325–333.CrossRef

Srivastava, B. M. L., Sitaram, S., Mehta, R. K., Mohan, K. D., Matani, P., Satpal, S., Bali, K., Srikanth, R., & Nayak, N. (2018). Interspeech 2018 low resource automatic speech recognition challenge for indian languages. In SLTU (pp. 11–14).

Vegesna, V. V. R., Gurugubelli, K., & Vuppala, A. K. (2018). Prosody modification for speech recognition in emotionally mismatched conditions. International Journal of Speech Technology, 21(3), 521–532.

Vegesna, V. V. R., Gurugubelli, K., & Vuppala, A. K. (2019). Application of emotion recognition and modification for emotional telugu speech recognition. Mobile Networks and Applications, 24(1), 193–201.CrossRef

Vishnu Vidyadhara Raju, V. (2020). Towards building a robust Telugu ASR system for emotional speech. PhD Thesis. http://hdl.handle.net/10603/306355

Yu, D., & Deng, L. (2016). Automatic speech recognition. Springer.MATH

Titel: Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems
verfasst von: Pavan Raju Kammili
B. H. V. S. Ramakrishnam Raju
A. Sri Krishna
Publikationsdatum: 20.09.2021
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 1/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-021-09897-x

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Customer Experience/© © oatawa / Getty Images / iStock, Erdgasmotor 1.5 TGI evo von Volkswagen/© Volkswagen AG, Thorsten Mücke/© Alexandra Bachran, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2022

Acoustic domain mismatch compensation in bird audio detection

Towards a historical dictionary for Arabic language

Noise-robust speech recognition in mobile network based on convolution neural networks

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

Complementary models for audio-visual speech classification

Automatic short utterance speaker recognition using stationary wavelet coefficients of pitch synchronised LP residual

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.