Skip to main content

2008 | OriginalPaper | Buchkapitel

21. Corpus-Based Speech Synthesis

verfasst von : Thierry Dutoit, Prof.

Erschienen in: Springer Handbook of Speech Processing

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this chapter, we present the main trends in corpus-based speech synthesis, assuming a stream of phonemes and prosodic target as input. From the early diphone-based speech synthesizers to the state-of-the art unit-selection-based synthesizers, to the promising statistical parametric techniques, we emphasize the engineering trade-offs that arise when designing such systems.
In particular, we examine the mathematical foundations of available methods for modifying the fundamental frequency and the duration of speech units for concatenative synthesis, as well as for smoothing discontinuities at concatenation points. For each of these problems, we analyze time- and frequency-domain processing, using algorithms such as time-domain pitch-synchronous overlap-add (TD-PSOLA), multiband resynthesis overlap-add (MBROLA), and the harmonic-plus-noise model (HNM).
We then provide a comprehensive description of how and why concatenative speech synthesis has progressively adopted large speech corpora, using the principle of context-oriented clustering as a smooth transition from fixed inventory synthesis to unit selection and statistical parametric synthesis.
Our description of unit selection emphasizes important issues related to the definition of optimal target and concatenation costs, as well as to the design of the speech corpus (including memory cost issues) and the reduction of computational costs.
We conclude the chapter with the mathematical framework underlying HMM-based speech synthesis and an outline of its main perspectives.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
21.1.
Zurück zum Zitat C.M. Harris: A study of the building blocks in speech, J. Acoust. Soc. Am. 25, 962-969 (1953)CrossRef C.M. Harris: A study of the building blocks in speech, J. Acoust. Soc. Am. 25, 962-969 (1953)CrossRef
21.2.
Zurück zum Zitat R.V. Shannon, F.G. Zeng, V. Kamath, J. Wygonski, M. Ekelid: Speech recognition with primarily temporal cues, Science 13(5234), 270 (1995) R.V. Shannon, F.G. Zeng, V. Kamath, J. Wygonski, M. Ekelid: Speech recognition with primarily temporal cues, Science 13(5234), 270 (1995)
21.3.
Zurück zum Zitat N.R. Dixon, H.D. Maxey: Terminal analog synthesis of continuous speech using the diphone method of segment assembly, IEEE Trans. ASSP AU-16(1), 40-50 (1968) N.R. Dixon, H.D. Maxey: Terminal analog synthesis of continuous speech using the diphone method of segment assembly, IEEE Trans. ASSP AU-16(1), 40-50 (1968)
21.4.
Zurück zum Zitat T. Dutoit: An Introduction to Text-To-Speech Synthesis (Kluwer Academic, Dordrecht 1997)CrossRef T. Dutoit: An Introduction to Text-To-Speech Synthesis (Kluwer Academic, Dordrecht 1997)CrossRef
21.5.
Zurück zum Zitat B. Bozkurt, T. Dutoit, R. Prudon, C. dʼAlessandro, V. Pagel: Improving the quality of MBROLA synthesis for non-uniform units synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan (Prentice-Hall, Upper Saddle River 2004) B. Bozkurt, T. Dutoit, R. Prudon, C. dʼAlessandro, V. Pagel: Improving the quality of MBROLA synthesis for non-uniform units synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan (Prentice-Hall, Upper Saddle River 2004)
21.6.
Zurück zum Zitat E. Moulines, F. Charpentier: Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Commun. 9, 5-6 (1990) E. Moulines, F. Charpentier: Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Commun. 9, 5-6 (1990)
21.7.
Zurück zum Zitat Y. Stylianou: Applying the harmonic plus noise model in concatenative synthesis, IEEE Trans. Speech Audio Process. 9(1), 21-29 (2001)CrossRef Y. Stylianou: Applying the harmonic plus noise model in concatenative synthesis, IEEE Trans. Speech Audio Process. 9(1), 21-29 (2001)CrossRef
21.8.
Zurück zum Zitat W. Verhelst, M. Roelands: An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech, Proc. ICASSP 93, Vol. II (1993) pp. 554-557 W. Verhelst, M. Roelands: An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech, Proc. ICASSP 93, Vol. II (1993) pp. 554-557
21.9.
Zurück zum Zitat N. Morita, F. Itakura: Time-scale modification algorithm for speech by use of pointer interval control overlap and add (PICOLA) and its evaluation, Proc. Annu. Meeting of Acoust. Soc. Jpn., Vol. 86 (1986) pp. 9-16 N. Morita, F. Itakura: Time-scale modification algorithm for speech by use of pointer interval control overlap and add (PICOLA) and its evaluation, Proc. Annu. Meeting of Acoust. Soc. Jpn., Vol. 86 (1986) pp. 9-16
21.10.
Zurück zum Zitat R.J. Mac Aulay, T.F. Quatieri: Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. Acoust. Speech Signal Process. 34, 744-754 (1986)CrossRef R.J. Mac Aulay, T.F. Quatieri: Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. Acoust. Speech Signal Process. 34, 744-754 (1986)CrossRef
21.11.
Zurück zum Zitat J. Marques, L. Almeida: Frequency-varying sinusoidal modeling of speech, IEEE Trans. Acoust. Speech Signal Process. 37(5), 763-765 (1989)CrossRef J. Marques, L. Almeida: Frequency-varying sinusoidal modeling of speech, IEEE Trans. Acoust. Speech Signal Process. 37(5), 763-765 (1989)CrossRef
21.12.
Zurück zum Zitat M.W. Macon: Speech Synthesis Based on Sinusoidal Modeling, Ph.D. Dissertation (Georgia Institute of Technology, Atlanta 1996) M.W. Macon: Speech Synthesis Based on Sinusoidal Modeling, Ph.D. Dissertation (Georgia Institute of Technology, Atlanta 1996)
21.13.
Zurück zum Zitat T. Dutoit, H. Leich: MBR-PSOLA: Text-to-speech synthesis based on an MBE resynthesis of the segments database, Speech Commun. 13, 435-440 (1993)CrossRef T. Dutoit, H. Leich: MBR-PSOLA: Text-to-speech synthesis based on an MBE resynthesis of the segments database, Speech Commun. 13, 435-440 (1993)CrossRef
21.14.
Zurück zum Zitat A. Conkie, S. Isard: Optimal coupling of diphones, Proc. 2nd ESCA/IEEE Workshop On Speech Synthesis Mohonk, ed. by J. Olive (1994) A. Conkie, S. Isard: Optimal coupling of diphones, Proc. 2nd ESCA/IEEE Workshop On Speech Synthesis Mohonk, ed. by J. Olive (1994)
21.15.
Zurück zum Zitat J. Wouters, M.W. Macon: Control of spectral dynamics in concatenative speech synthesis, IEEE Trans. Speech Audio Process. 9(1), 30-38 (2001)CrossRef J. Wouters, M.W. Macon: Control of spectral dynamics in concatenative speech synthesis, IEEE Trans. Speech Audio Process. 9(1), 30-38 (2001)CrossRef
21.16.
Zurück zum Zitat V. Aubergé: La synthèse de la parole: des règles aux lexiques, Ph.D. Dissertation (Institut de la Communication Parlée, Grenoble 1991), in French V. Aubergé: La synthèse de la parole: des règles aux lexiques, Ph.D. Dissertation (Institut de la Communication Parlée, Grenoble 1991), in French
21.17.
Zurück zum Zitat D. Bigorne, O. Boeffard, B. Cherbonnel, F. Emerard, D. Larreur, J.L. Le Saint-Milon, I. Metayer, C. Sorin, S. White: Multilingual PSOLA text-to-speech system, Proc. Int. Conf. Acoust. Speech Signal Process., Vol. 2 (1993) pp. 187-190 D. Bigorne, O. Boeffard, B. Cherbonnel, F. Emerard, D. Larreur, J.L. Le Saint-Milon, I. Metayer, C. Sorin, S. White: Multilingual PSOLA text-to-speech system, Proc. Int. Conf. Acoust. Speech Signal Process., Vol. 2 (1993) pp. 187-190
21.18.
Zurück zum Zitat T. Portele, W. Sendlemeier, W. Hess: HADIFIX, a system for German speech synthesis based on demisyllables, diphones, and suffixes, Proc. First ESCA Workshop on Speech Synthesis (1990) pp. 161-164 T. Portele, W. Sendlemeier, W. Hess: HADIFIX, a system for German speech synthesis based on demisyllables, diphones, and suffixes, Proc. First ESCA Workshop on Speech Synthesis (1990) pp. 161-164
21.19.
Zurück zum Zitat S. Nakajima: Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering, Speech Commun. 14, 313-324 (1994)CrossRef S. Nakajima: Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering, Speech Commun. 14, 313-324 (1994)CrossRef
21.20.
Zurück zum Zitat R. Donovan, P. Woodland: Improvements in an HMM-based speech synthesizer, Proc. Eurospeech 95, Vol. 1 (1995) pp. 573-576 R. Donovan, P. Woodland: Improvements in an HMM-based speech synthesizer, Proc. Eurospeech 95, Vol. 1 (1995) pp. 573-576
21.21.
Zurück zum Zitat J.P. Olive: Rule synthesis of speech from diadic units, Proc. ICASSP, Vol. 77 (1977) pp. 568-570 J.P. Olive: Rule synthesis of speech from diadic units, Proc. ICASSP, Vol. 77 (1977) pp. 568-570
21.22.
Zurück zum Zitat Y. Sagisaka, N. Kaiki, N. Iwahashi, K. Mimura: ATR ν-TALK speech synthesis system, Proc. ICSLP 92, Vol. 1 (1992) pp. 483-486 Y. Sagisaka, N. Kaiki, N. Iwahashi, K. Mimura: ATR ν-TALK speech synthesis system, Proc. ICSLP 92, Vol. 1 (1992) pp. 483-486
21.23.
Zurück zum Zitat A.J. Hunt, A.W. Black: Unit selection in a concatenative speech synthesis system using a large speech database, Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP ʼ96), Vol. 1 (1996) pp. 373-376 A.J. Hunt, A.W. Black: Unit selection in a concatenative speech synthesis system using a large speech database, Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP ʼ96), Vol. 1 (1996) pp. 373-376
21.24.
Zurück zum Zitat N. Campbell, A. Black: Prosody and the selection of source units for concatenative synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirshberg (Springer, Berlin, Heidelberg 1995) N. Campbell, A. Black: Prosody and the selection of source units for concatenative synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirshberg (Springer, Berlin, Heidelberg 1995)
21.25.
Zurück zum Zitat M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, A. Syrdal: The AT&T next-gen TTS system, Proc. Joint Meeting of ASA (1999) M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, A. Syrdal: The AT&T next-gen TTS system, Proc. Joint Meeting of ASA (1999)
21.26.
Zurück zum Zitat M. Balestri, A. Paechiotti, S. Quazza, P.L. Salza, S. Sandri: Choose the best to modify the least: a new generation concatenative synthesis system, Proc. Eurospeech, Vol. 99 (1999) pp. 2291-2294 M. Balestri, A. Paechiotti, S. Quazza, P.L. Salza, S. Sandri: Choose the best to modify the least: a new generation concatenative synthesis system, Proc. Eurospeech, Vol. 99 (1999) pp. 2291-2294
21.27.
Zurück zum Zitat P. Taylor, A.W. Black: Speech synthesis by phonological structure matching, Proc. Eurospeech, Vol. 99 (1999) pp. 623-626 P. Taylor, A.W. Black: Speech synthesis by phonological structure matching, Proc. Eurospeech, Vol. 99 (1999) pp. 623-626
21.28.
Zurück zum Zitat J. Vepa, S. King: Join cost for unit selection speech synthesis. In: Speech Synthesis, ed. by A. Alwan, S. Narayanan (Prentice-Hall, Upper Saddle River 2004) J. Vepa, S. King: Join cost for unit selection speech synthesis. In: Speech Synthesis, ed. by A. Alwan, S. Narayanan (Prentice-Hall, Upper Saddle River 2004)
21.29.
Zurück zum Zitat B. Möbius: Rare events and closed domains: Two delicate concepts in speech synthesis, Int. J. Speech Technol. 6(1), 57-71 (2003)CrossRefMATH B. Möbius: Rare events and closed domains: Two delicate concepts in speech synthesis, Int. J. Speech Technol. 6(1), 57-71 (2003)CrossRefMATH
21.30.
Zurück zum Zitat J.P.H. van Santen: Combinatorial issues in text-to-speech synthesis, Proc. Euro. Conf. Speech Commun. Technol., Vol. 5 (1997) pp. 2511-2514 J.P.H. van Santen: Combinatorial issues in text-to-speech synthesis, Proc. Euro. Conf. Speech Commun. Technol., Vol. 5 (1997) pp. 2511-2514
21.31.
Zurück zum Zitat A. Black, K. Lenzo: Limited domain synthesis, Proc. ICSLP (2000) pp. 411-414 A. Black, K. Lenzo: Limited domain synthesis, Proc. ICSLP (2000) pp. 411-414
21.32.
Zurück zum Zitat C. Bennett, A. Black: The Blizzard Challenge 2006, Proc. Blizzard Challenge 2006 (2006) C. Bennett, A. Black: The Blizzard Challenge 2006, Proc. Blizzard Challenge 2006 (2006)
21.33.
Zurück zum Zitat X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, M. Plumpe: Whistler: A trainable text-to-speech system, Proc. ICSLP, Vol. 96 (1996) pp. 659-662 X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, M. Plumpe: Whistler: A trainable text-to-speech system, Proc. ICSLP, Vol. 96 (1996) pp. 659-662
21.34.
Zurück zum Zitat K. Tokuda, H. Zen, A. Black: An HMM-based approach to multilingual speech synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan (Prentice Hall, Upper Saddle River 2004) pp. 135-153 K. Tokuda, H. Zen, A. Black: An HMM-based approach to multilingual speech synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan (Prentice Hall, Upper Saddle River 2004) pp. 135-153
21.35.
Zurück zum Zitat Y. Zhao, M. Chu, H. Peng, E. Chang: Custom-tailoring TTS voice font - keeping the naturalness when reducing database size, Proc. Eurospeech, Vol. 2003 (2003) pp. 2957-2960 Y. Zhao, M. Chu, H. Peng, E. Chang: Custom-tailoring TTS voice font - keeping the naturalness when reducing database size, Proc. Eurospeech, Vol. 2003 (2003) pp. 2957-2960
21.36.
Zurück zum Zitat D. Chazan, R. Hoory, Z. Kons, A. Sagi, S. Shechtman, A. Sorin: Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling, Proc. Interspeech, Vol. 2005 (2005) pp. 2569-2572 D. Chazan, R. Hoory, Z. Kons, A. Sagi, S. Shechtman, A. Sorin: Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling, Proc. Interspeech, Vol. 2005 (2005) pp. 2569-2572
21.37.
Zurück zum Zitat A.W. Black, P. Taylor: Automatically clustering similar units for unit selection in speech synthesis, Proc. Eurospeech, Vol. 2 (1997) A.W. Black, P. Taylor: Automatically clustering similar units for unit selection in speech synthesis, Proc. Eurospeech, Vol. 2 (1997)
21.38.
Zurück zum Zitat R.E. Donovan, E.M. Eide: The IBM trainable speech synthesis system, Proc. Int. Conf. Spoken Lang. Process., Vol. 5 (1998) pp. 1703-1706 R.E. Donovan, E.M. Eide: The IBM trainable speech synthesis system, Proc. Int. Conf. Spoken Lang. Process., Vol. 5 (1998) pp. 1703-1706
21.39.
Zurück zum Zitat M. Beutnagel, M. Mohri, M. Riley: Rapid unit selection from a large speech corpus for concatenative speech synthesis, Proc. Eurospeech ʼ99, Vol. 2 (1999) pp. 607-610 M. Beutnagel, M. Mohri, M. Riley: Rapid unit selection from a large speech corpus for concatenative speech synthesis, Proc. Eurospeech ʼ99, Vol. 2 (1999) pp. 607-610
21.40.
Zurück zum Zitat H. Zen, K. Tokuda, T. Kitamura: An introduction of trajectory model into hmm-based speech synthesis, Proc. Speech Synthesis Workshop (2005) H. Zen, K. Tokuda, T. Kitamura: An introduction of trajectory model into hmm-based speech synthesis, Proc. Speech Synthesis Workshop (2005)
21.41.
Zurück zum Zitat A. Falaschi, M. Giustiniani, M. Verola: A hidden Markov model approach to speech synthesis, Proc. Eurospeech, Vol. 1989 (1989) pp. 2187-2190 A. Falaschi, M. Giustiniani, M. Verola: A hidden Markov model approach to speech synthesis, Proc. Eurospeech, Vol. 1989 (1989) pp. 2187-2190
21.42.
Zurück zum Zitat K. Nakamura, T. Toda, Y. Nankaku, K. Tokuda: On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum, Proc. ICASSP, Vol. 06 (2006) pp. 93-96 K. Nakamura, T. Toda, Y. Nankaku, K. Tokuda: On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum, Proc. ICASSP, Vol. 06 (2006) pp. 93-96
21.43.
Zurück zum Zitat H. Kawai, T. Toda, J. Ni, M. Tsuzaki, K. Tokuda: XIMERA: A new TTS from ATR based on corpus-based technologies, Proc. 5th ISCA Speech Synthesis Workshop (2004) pp. 179-184 H. Kawai, T. Toda, J. Ni, M. Tsuzaki, K. Tokuda: XIMERA: A new TTS from ATR based on corpus-based technologies, Proc. 5th ISCA Speech Synthesis Workshop (2004) pp. 179-184
Metadaten
Titel
Corpus-Based Speech Synthesis
verfasst von
Thierry Dutoit, Prof.
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-540-49127-9_21

Neuer Inhalt