Skip to main content

2018 | OriginalPaper | Buchkapitel

Phone-Level Embeddings for Unit Selection Speech Synthesis

verfasst von : Antoine Perquin, Gwénolé Lecorvé, Damien Lolive, Laurent Amsaleg

Erschienen in: Statistical Language and Speech Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep neural networks have become the state of the art in speech synthesis. They have been used to directly predict signal parameters or provide unsupervised speech segment descriptions through embeddings. In this paper, we present four models with two of them enabling us to extract phone-level embeddings for unit selection speech synthesis. Three of the models rely on a feed-forward DNN, the last one on an LSTM. The resulting embeddings enable replacing usual expert-based target costs by an euclidean distance in the embedding space. This work is conducted on a French corpus of an 11 h audiobook. Perceptual tests show the produced speech is preferred over a unit selection method where the target cost is defined by an expert. They also show that the embeddings are general enough to be used for different speech styles without quality loss. Furthermore, objective measures and a perceptual test on statistical parametric speech synthesis show that our models perform comparably to state-of-the-art models for parametric signal generation, in spite of necessary simplifications, namely late time integration and information compression.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Black, A.W., Zen, H., Tokuda, K.: Statistical parametric speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 1229–1232 (2007) Black, A.W., Zen, H., Tokuda, K.: Statistical parametric speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 1229–1232 (2007)
2.
Zurück zum Zitat Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 373–376 (1996) Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 373–376 (1996)
3.
Zurück zum Zitat Lolive, D., et al.: The IRISA text-to-speech system for the Blizzard challenge 2017. In: Proceedings of the Blizzard Challenge Workshop (2017) Lolive, D., et al.: The IRISA text-to-speech system for the Blizzard challenge 2017. In: Proceedings of the Blizzard Challenge Workshop (2017)
4.
Zurück zum Zitat Merritt, T., Clark, R.A., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5145–5149 (2016) Merritt, T., Clark, R.A., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5145–5149 (2016)
5.
Zurück zum Zitat Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)CrossRef Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)CrossRef
6.
Zurück zum Zitat van den Oord, A., et al.: WaveNet: a generative model for raw audio. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 125–125 (2016) van den Oord, A., et al.: WaveNet: a generative model for raw audio. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 125–125 (2016)
7.
Zurück zum Zitat Perquin, A.: Big deep voice: indexation de données massives de parole grâce à des réseaux de neurones profonds. Master’s thesis, University of Rennes 1 (2017) Perquin, A.: Big deep voice: indexation de données massives de parole grâce à des réseaux de neurones profonds. Master’s thesis, University of Rennes 1 (2017)
8.
Zurück zum Zitat Wan, V., Agiomyrgiannakis, Y., Silen, H., Vit, J.: Googles next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1143–1147 (2017) Wan, V., Agiomyrgiannakis, Y., Silen, H., Vit, J.: Googles next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1143–1147 (2017)
9.
Zurück zum Zitat Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 4006–4010 (2017) Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 4006–4010 (2017)
10.
Zurück zum Zitat Wu, Z., King, S.: Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(7), 1255–1265 (2016)CrossRef Wu, Z., King, S.: Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(7), 1255–1265 (2016)CrossRef
11.
Zurück zum Zitat Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 218–223 (2016) Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 218–223 (2016)
12.
Zurück zum Zitat Yan, Z.J., Qian, Y., Soong, F.K.: Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4798–4801 (2010) Yan, Z.J., Qian, Y., Soong, F.K.: Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4798–4801 (2010)
13.
Zurück zum Zitat Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7962–7966 (2013) Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7962–7966 (2013)
Metadaten
Titel
Phone-Level Embeddings for Unit Selection Speech Synthesis
verfasst von
Antoine Perquin
Gwénolé Lecorvé
Damien Lolive
Laurent Amsaleg
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-00810-9_3