Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language

Yadav, Jainath; Rao, K. Sreenivasa

doi:10.1007/s00034-015-0051-3

Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language

Published: 17 April 2015

Volume 35, pages 139–162, (2016)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Jainath Yadav¹ &
K. Sreenivasa Rao¹

398 Accesses
16 Citations
Explore all metrics

Abstract

An emotion is made of several components such as physiological changes in the body, subjective feelings and expressive behaviors. These changes in speech signal are mainly observed in prosody parameters such as pitch, duration and energy. Hindi language is mostly syllabic in nature. Syllables are the most suitable basic units for the analysis and synthesis of speech. Therefore, vowel onset point detection method is used to segment the speech utterance into syllable like units. In this work, prosody parameters are modified using instants of significant excitation (epochs) and these instants are detected using zero frequency filtering-based method. Epoch locations in the voiced speech correspond to instants of glottal closure, and in the unvoiced region, they correspond to some random instants of significant excitation. Anger, happiness and sadness emotions are considered as target emotions in the proposed emotion conversion framework. Feedforward neural network models are explored for mapping the prosodic parameters between neutral and target emotions. Predicted prosodic parameters of the target emotion are incorporated into neutral speech at syllable level to produce the desired emotional speech. After incorporating the emotion-specific prosody, perceptual quality of the transformed speech is evaluated by subjective tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

B.W. Alan, Z. Heiga, T. Keiichi, Statistical parametric speech synthesis, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing (2007) pp. 1229–1232
M. Bulut, Emotional Speech Resynthesis. University of Southern California, Ph.D. dissertation (2008)
F. Burkhardt, W.F. Sendlmeier, Verification of acoustical correlates of emotional speech using formant synthesis, in ISCA Workshop on Speech and Emotion, pp. 151–156 (2000)
J.P. Cabral, Transforming Prosody and Choice Quality to Generate Emotions in Speech. PhD dissertaion, Instituto Superior Tcnico (IST), Lisbon, Portugal (2006)
J.E. Cahn, The generation of affect in synthesized speech. J. Am. Voice I/O Soc. 8, 1–19 (1990)
Google Scholar
F. Charpentier, M. Stella, Diphone synthesis using an overlap add technique for speech waveforms concatenation, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing, vol. 11, pp. 2015–2018 (1986)
R. Chauhan, J. Yadav, S.G. Koolagudi, Text independent emotion recognition using spectral features, in The 4rd International Conference on Contemporary Computing. JIIT university and University of Florida. JIIT university and University of Florida (2011)
R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, J.G. Taylor, Emotion recognition in human–computer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001). doi:10.1109/79.911197
Article Google Scholar
B. C. Csji, Approximation with Artificial Neural Networks. PhD diss, Etvs Lornd University Hungary (2001)
S. Desai, A.W. Black, B. Yegnanarayana, K. Prahallad, Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18, 954–964 (2010)
Article Google Scholar
T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes, in Proceedings of Fourth International Conference on Spoken Language, vol. 3, pp. 1393–1396 (1996)
D. Erro, E. Navas, I. Herndez, I. Saratxaga, Emotion conversion based on prosodic unit selection. IEEE Trans. Audio Speech Lang. Process. 18(5), 974–983 (2010). doi:10.1109/TASL.2009.2038658
Article Google Scholar
D. Govind, S.R.M. Prasanna, Expressive speech synthesis using prosodic modification and dynamic time warping, in Proceedings on NCC, pp. 285–289 (2009)
D. Govind, S.R.M. Prasanna. Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of Interspeech, pp. 2969–2972 (2011)
R.V. Hogg, J. Ledolter, Engineering Statistics. 866 Third Avenue (Macmillan Publishing Company, New York, 1987)
Google Scholar
A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing, vol. 1, pp. 373–376 (1996)
S.G. Koolagudi, R. Reddy, J. Yadav, K.S. Rao. IITKGPSEHSC: Hindi speech corpus for emotion analysis, in International Conference on Devices and Communications, pp. 1–5 (2011)
W.D. Massaro, Auditory visual speech processing, in Eurospeech Conference (2001)
E. Moulines, J. Laroche, Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Commun. 16(2), 175–205 (1995)
Article Google Scholar
I.R. Murray, J.L. Arnott, Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Commun. 16, 369–390 (1995)
Article Google Scholar
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16, 1602–1613 (2008)
Article Google Scholar
N.P. Narendra, K. Sreenivasa Rao, K. Ghosh, V. Ramu Reddy, S. Maity, Development of syllable-based text to speech synthesis system in Bengali. Int. J. Speech Technol. 14, 167–181 (2011)
Article Google Scholar
J. Nicholson, K. Takahashi, R. Nakatsu. Emotion recognition in speech using neural networks, in Sixth International Conference on Neural Information Processing, pp. 495–501 (1999)
S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process. 17(4), 556–565 (2009)
Article Google Scholar
S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana. Fast prosody modification using instants of significant excitation, in Proceedings of Speech Prosody (2010)
K.S. Rao, B. Yegnanarayana, Duration modification using glottal closure instants and vowel onset points. Speech Commun. 51(12), 1263–1269 (2009)
Article Google Scholar
K.S. Rao, Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech Lang. 24(3), 474–494 (2010)
Article Google Scholar
K.S. Rao, A.K. Vuppala, Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Commun. 55(6), 745–756 (2013)
Article Google Scholar
K.S. Rao, B. Yegnanarayana, Prosodymodification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14(3), 972–980 (2006)
Article Google Scholar
P. Sarkar, A. Haque, D. Arup Kumar, M. Gurunath Reddy, D.M. Harikrishna, P. Dhara, N.P. Narendra, R. Verma, S.B. Sunil Kumar, J. Yadav, K. Sreenivasa Rao. Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for indian languages: Bengali, Hindi and Telugu, in Proceedings of the International Conference on Contemporary Computing (IC3 2014), pp. 473–477 (2014)
M. Schrder, R. Cowie, E. Douglas-Cowie, M. Westerdijk, S. Gielen. Acoustic correlates of emotion dimensions in view of speech synthesis, in Proceedings of Eurospeech, pp. 87–90 (2001)
M. Schroder, Can Emotions be Synthesized Without Controlling Voice Quality? University of Saarland. Phonus 4, Research Report of the Institute of Phonetics
K.S.S. Srinivas, K. Prahallad, An FIR implementation of zero frequency filtering of speech signals. IEEE Trans. Audio Speech Lang. Process. 20(9), 2613–2617 (2012). doi:10.1109/TASL.2012.2207114
Article Google Scholar
J. Tao, Y. Kang, A. Li, Prosody con- version from neutral speech to emotional speech. IEEE Trans. Audio Speech Lang. Process. 14(4), 1145–1154 (2006). doi:10.1109/TASL.2006.876113
Article Google Scholar
M. Theune, K. Meijs, D. Heylen. Generating expressive speech for storytelling applications, in IEEE Transactions on Audio, Speech and Language Processing, pp. 1137–1144 (2006)
R. Verma, P. Sarkar, K. Sreenivasa Rao. Conversion of neutral speech to storytelling style speech, in Proceedings of the Eighth IEEE International Conference on Advances in Pattern Recognition, ICAPR 2015 (2015)
A.K. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20(6), 1894–1903 (2012). doi:10.1109/TASL.2012.2191284
Article Google Scholar
B. Yegnanarayana, Artificial Neural Networks. Prentice-Hall of India. (2004). http://books.google.co.in/books?id=RTtvUVU_xL4C
T. Yoshimura, K. Tokuda, T. Kobayashi, T. Masuko, T. Kitamura. Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis (1999)

Download references

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology, Kharagpur, India
Jainath Yadav & K. Sreenivasa Rao

Authors

Jainath Yadav
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jainath Yadav.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yadav, J., Rao, K.S. Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language. Circuits Syst Signal Process 35, 139–162 (2016). https://doi.org/10.1007/s00034-015-0051-3

Download citation

Received: 07 November 2014
Revised: 30 March 2015
Accepted: 01 April 2015
Published: 17 April 2015
Issue Date: January 2016
DOI: https://doi.org/10.1007/s00034-015-0051-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language

Abstract

Access this article

Similar content being viewed by others

Acoustic analysis and perception of emotions in hindi speech using words and sentences

Emotion Recognition in Hindi Speech Using CNN-LSTM Model

Analysis of Mandarin vs English Language for Emotional Voice Conversion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language

Abstract

Access this article

Similar content being viewed by others

Acoustic analysis and perception of emotions in hindi speech using words and sentences

Emotion Recognition in Hindi Speech Using CNN-LSTM Model

Analysis of Mandarin vs English Language for Emotional Voice Conversion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation