Skip to main content
Log in

Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

An emotion is made of several components such as physiological changes in the body, subjective feelings and expressive behaviors. These changes in speech signal are mainly observed in prosody parameters such as pitch, duration and energy. Hindi language is mostly syllabic in nature. Syllables are the most suitable basic units for the analysis and synthesis of speech. Therefore, vowel onset point detection method is used to segment the speech utterance into syllable like units. In this work, prosody parameters are modified using instants of significant excitation (epochs) and these instants are detected using zero frequency filtering-based method. Epoch locations in the voiced speech correspond to instants of glottal closure, and in the unvoiced region, they correspond to some random instants of significant excitation. Anger, happiness and sadness emotions are considered as target emotions in the proposed emotion conversion framework. Feedforward neural network models are explored for mapping the prosodic parameters between neutral and target emotions. Predicted prosodic parameters of the target emotion are incorporated into neutral speech at syllable level to produce the desired emotional speech. After incorporating the emotion-specific prosody, perceptual quality of the transformed speech is evaluated by subjective tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. B.W. Alan, Z. Heiga, T. Keiichi, Statistical parametric speech synthesis, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing (2007) pp. 1229–1232

  2. M. Bulut, Emotional Speech Resynthesis. University of Southern California, Ph.D. dissertation (2008)

  3. F. Burkhardt, W.F. Sendlmeier, Verification of acoustical correlates of emotional speech using formant synthesis, in ISCA Workshop on Speech and Emotion, pp. 151–156 (2000)

  4. J.P. Cabral, Transforming Prosody and Choice Quality to Generate Emotions in Speech. PhD dissertaion, Instituto Superior Tcnico (IST), Lisbon, Portugal (2006)

  5. J.E. Cahn, The generation of affect in synthesized speech. J. Am. Voice I/O Soc. 8, 1–19 (1990)

    Google Scholar 

  6. F. Charpentier, M. Stella, Diphone synthesis using an overlap add technique for speech waveforms concatenation, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing, vol. 11, pp. 2015–2018 (1986)

  7. R. Chauhan, J. Yadav, S.G. Koolagudi, Text independent emotion recognition using spectral features, in The 4rd International Conference on Contemporary Computing. JIIT university and University of Florida. JIIT university and University of Florida (2011)

  8. R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, J.G. Taylor, Emotion recognition in human–computer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001). doi:10.1109/79.911197

    Article  Google Scholar 

  9. B. C. Csji, Approximation with Artificial Neural Networks. PhD diss, Etvs Lornd University Hungary (2001)

  10. S. Desai, A.W. Black, B. Yegnanarayana, K. Prahallad, Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18, 954–964 (2010)

    Article  Google Scholar 

  11. T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes, in Proceedings of Fourth International Conference on Spoken Language, vol. 3, pp. 1393–1396 (1996)

  12. D. Erro, E. Navas, I. Herndez, I. Saratxaga, Emotion conversion based on prosodic unit selection. IEEE Trans. Audio Speech Lang. Process. 18(5), 974–983 (2010). doi:10.1109/TASL.2009.2038658

    Article  Google Scholar 

  13. D. Govind, S.R.M. Prasanna, Expressive speech synthesis using prosodic modification and dynamic time warping, in Proceedings on NCC, pp. 285–289 (2009)

  14. D. Govind, S.R.M. Prasanna. Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of Interspeech, pp. 2969–2972 (2011)

  15. R.V. Hogg, J. Ledolter, Engineering Statistics. 866 Third Avenue (Macmillan Publishing Company, New York, 1987)

    Google Scholar 

  16. A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing, vol. 1, pp. 373–376 (1996)

  17. S.G. Koolagudi, R. Reddy, J. Yadav, K.S. Rao. IITKGPSEHSC: Hindi speech corpus for emotion analysis, in International Conference on Devices and Communications, pp. 1–5 (2011)

  18. W.D. Massaro, Auditory visual speech processing, in Eurospeech Conference (2001)

  19. E. Moulines, J. Laroche, Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Commun. 16(2), 175–205 (1995)

    Article  Google Scholar 

  20. I.R. Murray, J.L. Arnott, Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Commun. 16, 369–390 (1995)

    Article  Google Scholar 

  21. K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16, 1602–1613 (2008)

    Article  Google Scholar 

  22. N.P. Narendra, K. Sreenivasa Rao, K. Ghosh, V. Ramu Reddy, S. Maity, Development of syllable-based text to speech synthesis system in Bengali. Int. J. Speech Technol. 14, 167–181 (2011)

    Article  Google Scholar 

  23. J. Nicholson, K. Takahashi, R. Nakatsu. Emotion recognition in speech using neural networks, in Sixth International Conference on Neural Information Processing, pp. 495–501 (1999)

  24. S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process. 17(4), 556–565 (2009)

    Article  Google Scholar 

  25. S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana. Fast prosody modification using instants of significant excitation, in Proceedings of Speech Prosody (2010)

  26. K.S. Rao, B. Yegnanarayana, Duration modification using glottal closure instants and vowel onset points. Speech Commun. 51(12), 1263–1269 (2009)

    Article  Google Scholar 

  27. K.S. Rao, Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech Lang. 24(3), 474–494 (2010)

    Article  Google Scholar 

  28. K.S. Rao, A.K. Vuppala, Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Commun. 55(6), 745–756 (2013)

    Article  Google Scholar 

  29. K.S. Rao, B. Yegnanarayana, Prosodymodification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14(3), 972–980 (2006)

    Article  Google Scholar 

  30. P. Sarkar, A. Haque, D. Arup Kumar, M. Gurunath Reddy, D.M. Harikrishna, P. Dhara, N.P. Narendra, R. Verma, S.B. Sunil Kumar, J. Yadav, K. Sreenivasa Rao. Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for indian languages: Bengali, Hindi and Telugu, in Proceedings of the International Conference on Contemporary Computing (IC3 2014), pp. 473–477 (2014)

  31. M. Schrder, R. Cowie, E. Douglas-Cowie, M. Westerdijk, S. Gielen. Acoustic correlates of emotion dimensions in view of speech synthesis, in Proceedings of Eurospeech, pp. 87–90 (2001)

  32. M. Schroder, Can Emotions be Synthesized Without Controlling Voice Quality? University of Saarland. Phonus 4, Research Report of the Institute of Phonetics

  33. K.S.S. Srinivas, K. Prahallad, An FIR implementation of zero frequency filtering of speech signals. IEEE Trans. Audio Speech Lang. Process. 20(9), 2613–2617 (2012). doi:10.1109/TASL.2012.2207114

    Article  Google Scholar 

  34. J. Tao, Y. Kang, A. Li, Prosody con- version from neutral speech to emotional speech. IEEE Trans. Audio Speech Lang. Process. 14(4), 1145–1154 (2006). doi:10.1109/TASL.2006.876113

    Article  Google Scholar 

  35. M. Theune, K. Meijs, D. Heylen. Generating expressive speech for storytelling applications, in IEEE Transactions on Audio, Speech and Language Processing, pp. 1137–1144 (2006)

  36. R. Verma, P. Sarkar, K. Sreenivasa Rao. Conversion of neutral speech to storytelling style speech, in Proceedings of the Eighth IEEE International Conference on Advances in Pattern Recognition, ICAPR 2015 (2015)

  37. A.K. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20(6), 1894–1903 (2012). doi:10.1109/TASL.2012.2191284

    Article  Google Scholar 

  38. B. Yegnanarayana, Artificial Neural Networks. Prentice-Hall of India. (2004). http://books.google.co.in/books?id=RTtvUVU_xL4C

  39. T. Yoshimura, K. Tokuda, T. Kobayashi, T. Masuko, T. Kitamura. Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis (1999)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jainath Yadav.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yadav, J., Rao, K.S. Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language. Circuits Syst Signal Process 35, 139–162 (2016). https://doi.org/10.1007/s00034-015-0051-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-015-0051-3

Keywords

Navigation