Skip to main content
Top
Published in: International Journal of Speech Technology 2/2012

01-06-2012

A HMM-WDLT framework for HNM-based voice conversion with parametric adjustment in formant bandwidth, duration and excitation

Authors: Hwai-Tsu Hu, Chu Yu

Published in: International Journal of Speech Technology | Issue 2/2012

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper presents a framework, named Hidden Markov Model—Weighted Deviation Linear Transformation (HMM-WDLT), for performing voice conversion based on the Harmonic + Noise Model (HNM). The HMM-WDLT achieves the lowest average spectral distortion in a comparative study of spectral conversion. The problem with broader formant bandwidths can be remedied by a weighting constraint and ordering check with the minimum clearance estimated from the HMM-WDLT. By jointly exploiting the dynamic time warping (DTW) and the HMM-WDLT, the conversion in duration is also feasible. Moreover, the HMM-WDLT plays a part in the conversion of excitation-related parameters such as the fundamental frequency, maximum voiced frequency, and harmonic magnitudes for critical bands below 2.7 kHz. The ability of modifying the pitch and duration concurrently allows the HMM-WDLT to carry out the prosody conversion. Listening tests reveal that the converted speech successfully catches the speaker’s individuality with satisfactory quality.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. In Proc. ICASSP (pp. 655–658). Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. In Proc. ICASSP (pp. 655–658).
go back to reference Arslan, L. M. (1999). Speaker transformation algorithm using segmental. codebooks (STASC). Speech Communication, 28, 211–226. CrossRef Arslan, L. M. (1999). Speaker transformation algorithm using segmental. codebooks (STASC). Speech Communication, 28, 211–226. CrossRef
go back to reference Chen, Y., Chu, M., Chang, E., Liu, J., & Liu, R. (2003). Voice conversion with smoothed GMM and MAP adaptation. In Proc. EUROSPEECH (pp. 2413–2416). Chen, Y., Chu, M., Chang, E., Liu, J., & Liu, R. (2003). Voice conversion with smoothed GMM and MAP adaptation. In Proc. EUROSPEECH (pp. 2413–2416).
go back to reference Gray, A. H., Jr., & Markel, J. D. (1976). Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5), 380–391. MathSciNetCrossRef Gray, A. H., Jr., & Markel, J. D. (1976). Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5), 380–391. MathSciNetCrossRef
go back to reference Hu, H. T., & Yu, C. (2009). Combining HMM and weighted deviation linear transformation for highband speech parameter estimation. IEICE Transactions on Information and Systems, E92-D(7), 1488–1490. CrossRef Hu, H. T., & Yu, C. (2009). Combining HMM and weighted deviation linear transformation for highband speech parameter estimation. IEICE Transactions on Information and Systems, E92-D(7), 1488–1490. CrossRef
go back to reference Hu, H. T., & Yu, C. (2010). Narrowband-to-wideband expansion of telephony speech using piecewise deviation linear transformation. International Journal of Electrical Engineering, 17(1), 7–17. Hu, H. T., & Yu, C. (2010). Narrowband-to-wideband expansion of telephony speech using piecewise deviation linear transformation. International Journal of Electrical Engineering, 17(1), 7–17.
go back to reference Jax, P., & Vary, P. (2003). On artificial bandwidth extension of telephone speech. Signal Processing, 83, 1707–1719. MATHCrossRef Jax, P., & Vary, P. (2003). On artificial bandwidth extension of telephone speech. Signal Processing, 83, 1707–1719. MATHCrossRef
go back to reference Kim, E. K., Lee, S., & Oh, Y. H. (1997). Hidden Markov model based voice conversion using dynamic characteristics of speaker. In Proc. EUROSPEECH (pp. 2519–2522). Kim, E. K., Lee, S., & Oh, Y. H. (1997). Hidden Markov model based voice conversion using dynamic characteristics of speaker. In Proc. EUROSPEECH (pp. 2519–2522).
go back to reference Lee, K. S. (2007). Statistical approach for voice personality transformation. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 641–651. CrossRef Lee, K. S. (2007). Statistical approach for voice personality transformation. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 641–651. CrossRef
go back to reference Li, D., & Dang, J. (2007). Speech analysis: the production-perception perspective. In C. H. Lee, H. Li, L. S. Lee, R. H. Wang & Q. Huo (Eds.), Advances in Chinese spoken language processing. Singapore: World Scientific. Li, D., & Dang, J. (2007). Speech analysis: the production-perception perspective. In C. H. Lee, H. Li, L. S. Lee, R. H. Wang & Q. Huo (Eds.), Advances in Chinese spoken language processing. Singapore: World Scientific.
go back to reference McCree, A., Truong, K., George, E. B., Barnwell, T. P., & Viswanathan, V. (1996). A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard. In Proc. ICASSP (Vol. 1, pp. 200–203). McCree, A., Truong, K., George, E. B., Barnwell, T. P., & Viswanathan, V. (1996). A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard. In Proc. ICASSP (Vol. 1, pp. 200–203).
go back to reference Mizuno, H., & Abe, M. (1995). Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectral tilt. Speech Communication, 16(2), 153–164. CrossRef Mizuno, H., & Abe, M. (1995). Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectral tilt. Speech Communication, 16(2), 153–164. CrossRef
go back to reference Narendranath, M., Hema, A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16(2), 207–216. CrossRef Narendranath, M., Hema, A., Rajendran, S., & Yegnanarayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16(2), 207–216. CrossRef
go back to reference Neuburg, E. P. (1987). Dynamic frequency warping, the dual of dynamic time warping. The Journal of the Acoustical Society of America, 81(S1), S94–S94. CrossRef Neuburg, E. P. (1987). Dynamic frequency warping, the dual of dynamic time warping. The Journal of the Acoustical Society of America, 81(S1), S94–S94. CrossRef
go back to reference Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. New York: Prentice-Hall. Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. New York: Prentice-Hall.
go back to reference Savic, M., & Nam, I. H. (1991). Voice personality transformation. Digital Signal Processing, 4, 107–110. CrossRef Savic, M., & Nam, I. H. (1991). Voice personality transformation. Digital Signal Processing, 4, 107–110. CrossRef
go back to reference Soong, F. K., & Juang, B. H. (1984). Line spectrum pair (LSP) and speech data compression. In Proc. ICASSP (pp. 1.10.1–1.10.4). Soong, F. K., & Juang, B. H. (1984). Line spectrum pair (LSP) and speech data compression. In Proc. ICASSP (pp. 1.10.1–1.10.4).
go back to reference Rentzos, D., Vaseghi, S., & Yan, Q. (2005). Parametric formant modelling and transformation in voice conversion. International Journal of Speech Technology, 8, 227–245. CrossRef Rentzos, D., Vaseghi, S., & Yan, Q. (2005). Parametric formant modelling and transformation in voice conversion. International Journal of Speech Technology, 8, 227–245. CrossRef
go back to reference Stylianou, Y. (2001a). Applying the Harmonic plus Noise model in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(1), 21–29. CrossRef Stylianou, Y. (2001a). Applying the Harmonic plus Noise model in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(1), 21–29. CrossRef
go back to reference Stylianou, Y. (2001b). Removing linear phase mismatches in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(3), 232–239. CrossRef Stylianou, Y. (2001b). Removing linear phase mismatches in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(3), 232–239. CrossRef
go back to reference Stylianou, Y., Cappe, O., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6(2), 131–142. CrossRef Stylianou, Y., Cappe, O., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6(2), 131–142. CrossRef
go back to reference Valbret, H., Moulines, E., & Tubach, J. P. (1992). Voice transformation using PSOLA techniques. Speech Communication, 11, 175–187. CrossRef Valbret, H., Moulines, E., & Tubach, J. P. (1992). Voice transformation using PSOLA techniques. Speech Communication, 11, 175–187. CrossRef
go back to reference Yue, Z., Zou, X., Jia, Y., & Wang, H. (2008). Voice conversion using HMM combined with GMM. In Congress on image and signal processing (Vol. 5, pp. 366–370). CrossRef Yue, Z., Zou, X., Jia, Y., & Wang, H. (2008). Voice conversion using HMM combined with GMM. In Congress on image and signal processing (Vol. 5, pp. 366–370). CrossRef
Metadata
Title
A HMM-WDLT framework for HNM-based voice conversion with parametric adjustment in formant bandwidth, duration and excitation
Authors
Hwai-Tsu Hu
Chu Yu
Publication date
01-06-2012
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 2/2012
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-012-9135-7

Other articles of this Issue 2/2012

International Journal of Speech Technology 2/2012 Go to the issue