Skip to main content
Erschienen in:
Buchtitelbild

2016 | OriginalPaper | Buchkapitel

Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

verfasst von : Agustin Alonso, Daniel Erro, Eva Navas, Inma Hernaez

Erschienen in: Advances in Speech and Language Technologies for Iberian Languages

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Speaker adaptation techniques use a small amount of data to modify Hidden Markov Model (HMM) based speech synthesis systems to mimic a target voice. These techniques can be used to provide personalized systems to people who suffer some speech impairment and allow them to communicate in a more natural way. Although the adaptation techniques don’t require a big quantity of data, the recording process can be tedious if the user has speaking problems. To improve the acceptance of these systems an important factor is to be able to obtain acceptable results with minimal amount of recordings. In this work we explore the performance of an adaptation method based on Frequency Warping which uses only vocalic segments according to the amount of available training data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRef Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRef
2.
Zurück zum Zitat Zen, H., Tokuda, K., Black, A.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef Zen, H., Tokuda, K., Black, A.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef
3.
Zurück zum Zitat Yamagishi, J., Nose, T., Zen, H., Ling, Z.H., Toda, T., Tokuda, K., King, S., Renals, S.: Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans. Audio Speech Lang. Process. 17(6), 1208–1230 (2009)CrossRef Yamagishi, J., Nose, T., Zen, H., Ling, Z.H., Toda, T., Tokuda, K., King, S., Renals, S.: Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans. Audio Speech Lang. Process. 17(6), 1208–1230 (2009)CrossRef
4.
Zurück zum Zitat Yamagishi, J., Veaux, C., King, S., Renals, S.: Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction. Acoust. Sci. Technol. 33(1), 1–5 (2012)CrossRef Yamagishi, J., Veaux, C., King, S., Renals, S.: Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction. Acoust. Sci. Technol. 33(1), 1–5 (2012)CrossRef
5.
Zurück zum Zitat Creer, S., Cunningham, S., Green, P., Yamagishi, J.: Building personalised synthetic voices for individuals with severe speech impairment. Comput. Speech Lang. 27(6), 1178–1193 (2013)CrossRef Creer, S., Cunningham, S., Green, P., Yamagishi, J.: Building personalised synthetic voices for individuals with severe speech impairment. Comput. Speech Lang. 27(6), 1178–1193 (2013)CrossRef
6.
Zurück zum Zitat Lanchantin, P., Veaux, C., Gales, M.J.F., King, S., Yamagishi, J.: Reconstructing voices within the multiple-average-voice-model framework. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2232–2236 (2015) Lanchantin, P., Veaux, C., Gales, M.J.F., King, S., Yamagishi, J.: Reconstructing voices within the multiple-average-voice-model framework. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2232–2236 (2015)
7.
Zurück zum Zitat Alonso, A., Erro, D., Navas, E., Hernaez, I.: Speaker adaptation using only vocalic segments via frequency warping. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2764–2768 (2015) Alonso, A., Erro, D., Navas, E., Hernaez, I.: Speaker adaptation using only vocalic segments via frequency warping. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2764–2768 (2015)
8.
Zurück zum Zitat Kawahara, H., Masuda-Katsusue, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)CrossRef Kawahara, H., Masuda-Katsusue, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)CrossRef
9.
Zurück zum Zitat Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 8(2), 184–194 (2014)CrossRef Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 8(2), 184–194 (2014)CrossRef
10.
Zurück zum Zitat Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst. E90-D(5), 825–834 (2007) Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst. E90-D(5), 825–834 (2007)
11.
Zurück zum Zitat Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis, vol. 30, pp. 1315–1318 (2000) Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis, vol. 30, pp. 1315–1318 (2000)
12.
Zurück zum Zitat Yamagishi, J.: A training method of average voice model for HMM-based speech synthesis using MLLR. IEICE Trans. Inf. Syst. 86(8), 1956–1963 (2003) Yamagishi, J.: A training method of average voice model for HMM-based speech synthesis using MLLR. IEICE Trans. Inf. Syst. 86(8), 1956–1963 (2003)
13.
Zurück zum Zitat Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef
14.
Zurück zum Zitat Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., Isogai, J.: Analysis of speaker adaptation algorthims for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 19(1), 66–83 (2009)CrossRef Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., Isogai, J.: Analysis of speaker adaptation algorthims for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 19(1), 66–83 (2009)CrossRef
15.
Zurück zum Zitat Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I.: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. 30, 3–15 (2015)CrossRef Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I.: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. 30, 3–15 (2015)CrossRef
16.
Zurück zum Zitat Erro, D., Moreno, A., Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18(5), 922–931 (2010)CrossRef Erro, D., Moreno, A., Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18(5), 922–931 (2010)CrossRef
17.
Zurück zum Zitat Zorila, T.C., Erro, D., Hernaez, I.: Improving the quality of standard GMM-based voice conversion systems by considering physically motivated linear transformations. Commun. Comput. Inf. Sci. 328, 30–39 (2012)CrossRef Zorila, T.C., Erro, D., Hernaez, I.: Improving the quality of standard GMM-based voice conversion systems by considering physically motivated linear transformations. Commun. Comput. Inf. Sci. 328, 30–39 (2012)CrossRef
18.
Zurück zum Zitat Godoy, E., Rosec, O., Chonavel, T.: Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 20(4), 1313–1323 (2012)CrossRef Godoy, E., Rosec, O., Chonavel, T.: Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 20(4), 1313–1323 (2012)CrossRef
19.
Zurück zum Zitat Erro, D., Navas, E., Hernaez, I.: Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans. Audio Speech Lang. Process. 21(3), 556–566 (2013)CrossRef Erro, D., Navas, E., Hernaez, I.: Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans. Audio Speech Lang. Process. 21(3), 556–566 (2013)CrossRef
20.
Zurück zum Zitat Pitz, M., Ney, H.: Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech Audio Process. 13, 930–944 (2005)CrossRef Pitz, M., Ney, H.: Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech Audio Process. 13, 930–944 (2005)CrossRef
21.
Zurück zum Zitat Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11(2–3), 175–187 (1992)CrossRef Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11(2–3), 175–187 (1992)CrossRef
22.
Zurück zum Zitat Cappé, O., Laroche, J., Moulines, E.: Regularized estimation of cepstrum envelope from discrete frequency points. In: IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 213–216 (1995) Cappé, O., Laroche, J., Moulines, E.: Regularized estimation of cepstrum envelope from discrete frequency points. In: IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 213–216 (1995)
23.
Zurück zum Zitat Erro, D., Hernáez, I., Navas, E., Alonso, A., Arzelus, H., Jauk, I., Hy, N.Q., Magariños, C., Pérez-Ramón, R., Sulír, M., Tian, X., Wang, X., Ye, J.: ZureTTS: online platform for obtaining personalized synthetic voices. In: Proceedings of eNTERFACE 2014 (2014) Erro, D., Hernáez, I., Navas, E., Alonso, A., Arzelus, H., Jauk, I., Hy, N.Q., Magariños, C., Pérez-Ramón, R., Sulír, M., Tian, X., Wang, X., Ye, J.: ZureTTS: online platform for obtaining personalized synthetic voices. In: Proceedings of eNTERFACE 2014 (2014)
24.
Zurück zum Zitat Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., et al.: The HTK Book, version 3.4 (2006) Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., et al.: The HTK Book, version 3.4 (2006)
Metadaten
Titel
Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping
verfasst von
Agustin Alonso
Daniel Erro
Eva Navas
Inma Hernaez
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-49169-1_1