Skip to main content
Erschienen in: International Journal of Speech Technology 1/2013

01.03.2013

Dynamic prosody modification using zero frequency filtered signal

verfasst von: D. Govind, S. R. Mahadeva Prasanna

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the required prosody modification is achieved by the interpolation of epoch intervals plot. Alternatively, this work proposes a method for prosody modification by the resampling of ZFFS. Also the existing epoch based prosody modification method is further refined for modifying the prosodic parameters at every epoch level. Thus providing more flexibility for prosody modification. The general framework for deriving the modified epoch locations can also be used for obtaining the dynamic prosody modification from existing PSOLA and epoch based prosody modification methods. The quality of the prosody modified speech is evaluated using waveforms, spectrograms and subjective studies. The usefulness of the proposed dynamic prosody modification is demonstrated for neutral to emotional conversion task. The subjective evaluations performed for the emotion conversion indicate the effectiveness of the dynamic prosody modification over the fixed prosody modification for emotion conversion. The dynamic prosody modified speech files synthesized using the proposed, epoch based and TD-PSOLA methods are available at http://​www.​iitg.​ac.​in/​eee/​emstlab/​demos/​demo5.​php.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Cabral, J. P. (2006). Transforming prosody and voice quality to generate emotions in speech. Master’s thesis, L2F-Spoken Language Systems Lab, Lisboa, Portugal. Cabral, J. P. (2006). Transforming prosody and voice quality to generate emotions in speech. Master’s thesis, L2F-Spoken Language Systems Lab, Lisboa, Portugal.
Zurück zum Zitat Cabral, J. P., & Oliveira, L. C. (2006). Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proc. INTERSPEECH. Cabral, J. P., & Oliveira, L. C. (2006). Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proc. INTERSPEECH.
Zurück zum Zitat Cahn, J. E. (1989). Generation of affect in synthesized speech. In Proc. American Voice I/O Society. Cahn, J. E. (1989). Generation of affect in synthesized speech. In Proc. American Voice I/O Society.
Zurück zum Zitat Campell, N., Hamza, W., Hog, H., & Tao, J. (2006). Editorial special section on expressive speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1097–1098. CrossRef Campell, N., Hamza, W., Hog, H., & Tao, J. (2006). Editorial special section on expressive speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1097–1098. CrossRef
Zurück zum Zitat Childers, D. G., Wu, K., & Yegnanarayana, B. (1989). Voice conversion. Speech Communication, 8, 147–158. CrossRef Childers, D. G., Wu, K., & Yegnanarayana, B. (1989). Voice conversion. Speech Communication, 8, 147–158. CrossRef
Zurück zum Zitat Dhananjaya, N., & Yegananarayana, B. (2010). Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3), 273–276. CrossRef Dhananjaya, N., & Yegananarayana, B. (2010). Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3), 273–276. CrossRef
Zurück zum Zitat Govind, D., Prasanna, S. R. M., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In Proc. INTERSPEECH 2011. Govind, D., Prasanna, S. R. M., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In Proc. INTERSPEECH 2011.
Zurück zum Zitat Gu, H. -Y. (1998). Notes for the Syllable-signal synthesis method: Tipw. In Proc. ISCSLP. Gu, H. -Y. (1998). Notes for the Syllable-signal synthesis method: Tipw. In Proc. ISCSLP.
Zurück zum Zitat Gu, H.-Y., & Shiu, W.-L. (1998). A mandarin-syllable signal synthesis method with increased flexibility in duration, tone and timbre control. Proceedings of the National Science Council, Republic of China. Part A, 22(3), 385–395. Gu, H.-Y., & Shiu, W.-L. (1998). A mandarin-syllable signal synthesis method with increased flexibility in duration, tone and timbre control. Proceedings of the National Science Council, Republic of China. Part A, 22(3), 385–395.
Zurück zum Zitat Hofer, G., Richmond, K., & Clark, B. (2005). Informed blending of databases for emotional speech synthesis. In Proc. INTERSPEECH. Hofer, G., Richmond, K., & Clark, B. (2005). Informed blending of databases for emotional speech synthesis. In Proc. INTERSPEECH.
Zurück zum Zitat Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9, 452–467. Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9, 452–467.
Zurück zum Zitat Mourlines, E., & Laroche, J. (1995). Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication, 16, 175–205. CrossRef Mourlines, E., & Laroche, J. (1995). Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication, 16, 175–205. CrossRef
Zurück zum Zitat Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1614. CrossRef Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1614. CrossRef
Zurück zum Zitat Murty, K. S. R., & Yegnanarayana, B. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472. CrossRef Murty, K. S. R., & Yegnanarayana, B. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472. CrossRef
Zurück zum Zitat Pollard, M. P., et al. (1996). Enhanced shape-invarient pitch and time-scale modification for concatenative speech synthesis. In Proc. ICSLP. Pollard, M. P., et al. (1996). Enhanced shape-invarient pitch and time-scale modification for concatenative speech synthesis. In Proc. ICSLP.
Zurück zum Zitat Portnoff, M. R. (1981). Time-scale modification of speech based on short-time Fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-29, 374–390. MathSciNetCrossRef Portnoff, M. R. (1981). Time-scale modification of speech based on short-time Fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-29, 374–390. MathSciNetCrossRef
Zurück zum Zitat Prasanna, S. R. M., & Govind, D. (2010). Analysis of excitation source information in emotional speech. In Proc. INTERSPEECH (pp. 781–784). Prasanna, S. R. M., & Govind, D. (2010). Analysis of excitation source information in emotional speech. In Proc. INTERSPEECH (pp. 781–784).
Zurück zum Zitat Prasanna, S. R. M., Govind, D., Rao, K. S., & Yenanarayana, B. (2010). Fast prosody modification using instants of significant excitation. In Proc. speech prosody. Prasanna, S. R. M., Govind, D., Rao, K. S., & Yenanarayana, B. (2010). Fast prosody modification using instants of significant excitation. In Proc. speech prosody.
Zurück zum Zitat Quatieri, T. F., & McAulay, R. J. (1992). Shape invariant time scale and pitch modification of speech. IEEE Transactions on Signal Processing, 40(3), 497–510. CrossRef Quatieri, T. F., & McAulay, R. J. (1992). Shape invariant time scale and pitch modification of speech. IEEE Transactions on Signal Processing, 40(3), 497–510. CrossRef
Zurück zum Zitat Rao, K. S., & Yegananarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51(12), 1263–1269. CrossRef Rao, K. S., & Yegananarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51(12), 1263–1269. CrossRef
Zurück zum Zitat Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14, 972–980. CrossRef Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14, 972–980. CrossRef
Zurück zum Zitat Rao, K. S., Prasanna, S. R. M., & Yegnanarayana, B. (2007). Determination of instants of significant excitation in speech using Hilbert envelope and group delay function. IEEE Signal Processing Letters, 14, 762–765. CrossRef Rao, K. S., Prasanna, S. R. M., & Yegnanarayana, B. (2007). Determination of instants of significant excitation in speech using Hilbert envelope and group delay function. IEEE Signal Processing Letters, 14, 762–765. CrossRef
Zurück zum Zitat Schroeder, M. R., Flanagan, J. L., & Lundry, E. A. (1967). Bandwidth compression of speech by analytic-signal rooting. Proceedings of the IEEE, 55(3), 396–401. CrossRef Schroeder, M. R., Flanagan, J. L., & Lundry, E. A. (1967). Bandwidth compression of speech by analytic-signal rooting. Proceedings of the IEEE, 55(3), 396–401. CrossRef
Zurück zum Zitat Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Acoustics, Speech, and Signal Processing, 4, 325–333. Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Acoustics, Speech, and Signal Processing, 4, 325–333.
Zurück zum Zitat Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1145–1154. CrossRef Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1145–1154. CrossRef
Zurück zum Zitat Taylor, P. (2009). Text to speech synthesis. Cambridge: Cambridge University Press. CrossRef Taylor, P. (2009). Text to speech synthesis. Cambridge: Cambridge University Press. CrossRef
Zurück zum Zitat Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for story telling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108. CrossRef Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for story telling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108. CrossRef
Zurück zum Zitat Thomas, M. R. P., Gudnason, J., & Naylor, P. A. (2008). Application of the dypsa algorithm to segmented time scale modification of speech. In Proc. European signal processing conference. Thomas, M. R. P., Gudnason, J., & Naylor, P. A. (2008). Application of the dypsa algorithm to segmented time scale modification of speech. In Proc. European signal processing conference.
Metadaten
Titel
Dynamic prosody modification using zero frequency filtered signal
verfasst von
D. Govind
S. R. Mahadeva Prasanna
Publikationsdatum
01.03.2013
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2013
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-012-9155-3

Weitere Artikel der Ausgabe 1/2013

International Journal of Speech Technology 1/2013 Zur Ausgabe

Neuer Inhalt