nach oben

Erschienen in:

2008 | OriginalPaper | Buchkapitel

21. Corpus-Based Speech Synthesis

verfasst von : Thierry Dutoit, Prof.

Erschienen in: Springer Handbook of Speech Processing

Verlag: Springer Berlin Heidelberg

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this chapter, we present the main trends in corpus-based speech synthesis, assuming a stream of phonemes and prosodic target as input. From the early diphone-based speech synthesizers to the state-of-the art unit-selection-based synthesizers, to the promising statistical parametric techniques, we emphasize the engineering trade-offs that arise when designing such systems.

In particular, we examine the mathematical foundations of available methods for modifying the fundamental frequency and the duration of speech units for concatenative synthesis, as well as for smoothing discontinuities at concatenation points. For each of these problems, we analyze time- and frequency-domain processing, using algorithms such as time-domain pitch-synchronous overlap-add (TD-PSOLA), multiband resynthesis overlap-add (MBROLA), and the harmonic-plus-noise model (HNM).

We then provide a comprehensive description of how and why concatenative speech synthesis has progressively adopted large speech corpora, using the principle of context-oriented clustering as a smooth transition from fixed inventory synthesis to unit selection and statistical parametric synthesis.

Our description of unit selection emphasizes important issues related to the definition of optimal target and concatenation costs, as well as to the design of the speech corpus (including memory cost issues) and the reduction of computational costs.

We conclude the chapter with the mathematical framework underlying HMM-based speech synthesis and an outline of its main perspectives.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Rule-Based Speech Synthesis

Nächstes Kapitel Linguistic Processing for Speech Synthesis

21.1.

C.M. Harris: A study of the building blocks in speech, J. Acoust. Soc. Am. 25, 962-969 (1953)CrossRef

21.2.

R.V. Shannon, F.G. Zeng, V. Kamath, J. Wygonski, M. Ekelid: Speech recognition with primarily temporal cues, Science 13(5234), 270 (1995)

21.3.

N.R. Dixon, H.D. Maxey: Terminal analog synthesis of continuous speech using the diphone method of segment assembly, IEEE Trans. ASSP AU-16(1), 40-50 (1968)

21.4.

T. Dutoit: An Introduction to Text-To-Speech Synthesis (Kluwer Academic, Dordrecht 1997)CrossRef

21.5.

B. Bozkurt, T. Dutoit, R. Prudon, C. dʼAlessandro, V. Pagel: Improving the quality of MBROLA synthesis for non-uniform units synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan (Prentice-Hall, Upper Saddle River 2004)

21.6.

E. Moulines, F. Charpentier: Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Commun. 9, 5-6 (1990)

21.7.

Y. Stylianou: Applying the harmonic plus noise model in concatenative synthesis, IEEE Trans. Speech Audio Process. 9(1), 21-29 (2001)CrossRef

21.8.

W. Verhelst, M. Roelands: An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech, Proc. ICASSP 93, Vol. II (1993) pp. 554-557

21.9.

N. Morita, F. Itakura: Time-scale modification algorithm for speech by use of pointer interval control overlap and add (PICOLA) and its evaluation, Proc. Annu. Meeting of Acoust. Soc. Jpn., Vol. 86 (1986) pp. 9-16

21.10.

R.J. Mac Aulay, T.F. Quatieri: Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. Acoust. Speech Signal Process. 34, 744-754 (1986)CrossRef

21.11.

J. Marques, L. Almeida: Frequency-varying sinusoidal modeling of speech, IEEE Trans. Acoust. Speech Signal Process. 37(5), 763-765 (1989)CrossRef

21.12.

M.W. Macon: Speech Synthesis Based on Sinusoidal Modeling, Ph.D. Dissertation (Georgia Institute of Technology, Atlanta 1996)

21.13.

T. Dutoit, H. Leich: MBR-PSOLA: Text-to-speech synthesis based on an MBE resynthesis of the segments database, Speech Commun. 13, 435-440 (1993)CrossRef

21.14.

A. Conkie, S. Isard: Optimal coupling of diphones, Proc. 2nd ESCA/IEEE Workshop On Speech Synthesis Mohonk, ed. by J. Olive (1994)

21.15.

J. Wouters, M.W. Macon: Control of spectral dynamics in concatenative speech synthesis, IEEE Trans. Speech Audio Process. 9(1), 30-38 (2001)CrossRef

21.16.

V. Aubergé: La synthèse de la parole: des règles aux lexiques, Ph.D. Dissertation (Institut de la Communication Parlée, Grenoble 1991), in French

21.17.

D. Bigorne, O. Boeffard, B. Cherbonnel, F. Emerard, D. Larreur, J.L. Le Saint-Milon, I. Metayer, C. Sorin, S. White: Multilingual PSOLA text-to-speech system, Proc. Int. Conf. Acoust. Speech Signal Process., Vol. 2 (1993) pp. 187-190

21.18.

T. Portele, W. Sendlemeier, W. Hess: HADIFIX, a system for German speech synthesis based on demisyllables, diphones, and suffixes, Proc. First ESCA Workshop on Speech Synthesis (1990) pp. 161-164

21.19.

S. Nakajima: Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering, Speech Commun. 14, 313-324 (1994)CrossRef

21.20.

R. Donovan, P. Woodland: Improvements in an HMM-based speech synthesizer, Proc. Eurospeech 95, Vol. 1 (1995) pp. 573-576

21.21.

J.P. Olive: Rule synthesis of speech from diadic units, Proc. ICASSP, Vol. 77 (1977) pp. 568-570

21.22.

Y. Sagisaka, N. Kaiki, N. Iwahashi, K. Mimura: ATR ν-TALK speech synthesis system, Proc. ICSLP 92, Vol. 1 (1992) pp. 483-486

21.23.

A.J. Hunt, A.W. Black: Unit selection in a concatenative speech synthesis system using a large speech database, Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP ʼ96), Vol. 1 (1996) pp. 373-376

21.24.

N. Campbell, A. Black: Prosody and the selection of source units for concatenative synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirshberg (Springer, Berlin, Heidelberg 1995)

21.25.

M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, A. Syrdal: The AT&T next-gen TTS system, Proc. Joint Meeting of ASA (1999)

21.26.

M. Balestri, A. Paechiotti, S. Quazza, P.L. Salza, S. Sandri: Choose the best to modify the least: a new generation concatenative synthesis system, Proc. Eurospeech, Vol. 99 (1999) pp. 2291-2294

21.27.

P. Taylor, A.W. Black: Speech synthesis by phonological structure matching, Proc. Eurospeech, Vol. 99 (1999) pp. 623-626

21.28.

J. Vepa, S. King: Join cost for unit selection speech synthesis. In: Speech Synthesis, ed. by A. Alwan, S. Narayanan (Prentice-Hall, Upper Saddle River 2004)

21.29.

B. Möbius: Rare events and closed domains: Two delicate concepts in speech synthesis, Int. J. Speech Technol. 6(1), 57-71 (2003)CrossRefMATH

21.30.

J.P.H. van Santen: Combinatorial issues in text-to-speech synthesis, Proc. Euro. Conf. Speech Commun. Technol., Vol. 5 (1997) pp. 2511-2514

21.31.

A. Black, K. Lenzo: Limited domain synthesis, Proc. ICSLP (2000) pp. 411-414

21.32.

C. Bennett, A. Black: The Blizzard Challenge 2006, Proc. Blizzard Challenge 2006 (2006)

21.33.

X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, M. Plumpe: Whistler: A trainable text-to-speech system, Proc. ICSLP, Vol. 96 (1996) pp. 659-662

21.34.

K. Tokuda, H. Zen, A. Black: An HMM-based approach to multilingual speech synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan (Prentice Hall, Upper Saddle River 2004) pp. 135-153

21.35.

Y. Zhao, M. Chu, H. Peng, E. Chang: Custom-tailoring TTS voice font - keeping the naturalness when reducing database size, Proc. Eurospeech, Vol. 2003 (2003) pp. 2957-2960

21.36.

D. Chazan, R. Hoory, Z. Kons, A. Sagi, S. Shechtman, A. Sorin: Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling, Proc. Interspeech, Vol. 2005 (2005) pp. 2569-2572

21.37.

A.W. Black, P. Taylor: Automatically clustering similar units for unit selection in speech synthesis, Proc. Eurospeech, Vol. 2 (1997)

21.38.

R.E. Donovan, E.M. Eide: The IBM trainable speech synthesis system, Proc. Int. Conf. Spoken Lang. Process., Vol. 5 (1998) pp. 1703-1706

21.39.

M. Beutnagel, M. Mohri, M. Riley: Rapid unit selection from a large speech corpus for concatenative speech synthesis, Proc. Eurospeech ʼ99, Vol. 2 (1999) pp. 607-610

21.40.

H. Zen, K. Tokuda, T. Kitamura: An introduction of trajectory model into hmm-based speech synthesis, Proc. Speech Synthesis Workshop (2005)

21.41.

A. Falaschi, M. Giustiniani, M. Verola: A hidden Markov model approach to speech synthesis, Proc. Eurospeech, Vol. 1989 (1989) pp. 2187-2190

21.42.

K. Nakamura, T. Toda, Y. Nankaku, K. Tokuda: On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum, Proc. ICASSP, Vol. 06 (2006) pp. 93-96

21.43.

H. Kawai, T. Toda, J. Ni, M. Tsuzaki, K. Tokuda: XIMERA: A new TTS from ATR based on corpus-based technologies, Proc. 5th ISCA Speech Synthesis Workshop (2004) pp. 179-184

Titel: Corpus-Based Speech Synthesis
verfasst von: Thierry Dutoit, Prof.
Verlag: Springer Berlin Heidelberg
Buch: Springer Handbook of Speech Processing
Print ISBN: 978-3-540-49125-5

Electronic ISBN: 978-3-540-49127-9

Copyright-Jahr: 2008
DOI: https://doi.org/10.1007/978-3-540-49127-9_21

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence_ieS/© Springer Fachmedien Wiesbaden GmbH, Search Icon, Banner Hanser, Benedikt Bonnmann von Adesso/© Adesso, Teilzeit/© Fokussiert / stock.adobe.com, Hans-Joachim Lefeld/© Lucht Probst Associates GmbH, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.