Skip to main content

2014 | OriginalPaper | Buchkapitel

Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis

verfasst von : Martti Vainio

Erschienen in: Statistical Language and Speech Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text-to-speech synthesis is a task that solves many real-world problems such as providing speaking and reading ability to people who lack those capabilities. It is thus viewed mainly as an engineering problem rather than a purely scientific one. Therefore many of the solutions in speech synthesis are purely practical. However, from the point of view of phonetics, the process of producing speech from text artificially is also a scientific one. Here I argue – using an example from speech prosody, namely speech melody – that phonetics is the key discipline in helping to solve what is arguably one of the most interesting problems in machine learning.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
For a good overview of techniques used see [43].
 
2
There are interesting developments towards more articulatory control in HMM based TTS [53]. However, this can only be seen as compromise as the units are still defined acoustically and do not necessarily correspond with the actual underlying articulatory gestures.
 
Literatur
2.
Zurück zum Zitat Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)CrossRef Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)CrossRef
3.
Zurück zum Zitat Alku, P., Tiitinen, H., Näätänen, R.: A method for generating natural-sounding speech stimuli for cognitive brain research. Clin. Neurophysiol. 110, 1329–1333 (1999)CrossRef Alku, P., Tiitinen, H., Näätänen, R.: A method for generating natural-sounding speech stimuli for cognitive brain research. Clin. Neurophysiol. 110, 1329–1333 (1999)CrossRef
4.
Zurück zum Zitat Altosaar, T., Karjalainen, M.: Multiple-resolution analysis of speech signals. In: Proceedings of IEEE ICASSP-88, New York (1988) Altosaar, T., Karjalainen, M.: Multiple-resolution analysis of speech signals. In: Proceedings of IEEE ICASSP-88, New York (1988)
5.
Zurück zum Zitat Anumanchipalli, G.K., Oliveira, L.C., Black, A.W.: A statistical phrase/accent model for intonation modeling. In: INTERSPEECH, pp. 1813–1816 (2011) Anumanchipalli, G.K., Oliveira, L.C., Black, A.W.: A statistical phrase/accent model for intonation modeling. In: INTERSPEECH, pp. 1813–1816 (2011)
6.
Zurück zum Zitat Arnold, D., Wagner, P., Möbius, B.: Obtaining prominence judgments from naïve listeners-influence of rating scales, linguistic levels and normalisation. In: Proceedings of Interspeech 2012 (2012) Arnold, D., Wagner, P., Möbius, B.: Obtaining prominence judgments from naïve listeners-influence of rating scales, linguistic levels and normalisation. In: Proceedings of Interspeech 2012 (2012)
7.
Zurück zum Zitat Badino, L., Clark, R.A., Wester, M.: Towards hierarchical prosodic prominence generation in TTS synthesis. In: INTERSPEECH (2012) Badino, L., Clark, R.A., Wester, M.: Towards hierarchical prosodic prominence generation in TTS synthesis. In: INTERSPEECH (2012)
8.
Zurück zum Zitat Badino, L., D’Ausilio, A., Fadiga, L., Metta, G.: Computational validation of the motor contribution to speech perception. Top. Cogn. Sci. 6(3), 461–475 (2014)CrossRef Badino, L., D’Ausilio, A., Fadiga, L., Metta, G.: Computational validation of the motor contribution to speech perception. Top. Cogn. Sci. 6(3), 461–475 (2014)CrossRef
9.
Zurück zum Zitat Bailly, G., Holm, B.: SFC: a trainable prosodic model. Speech Commun. 46(3), 348–364 (2005)CrossRef Bailly, G., Holm, B.: SFC: a trainable prosodic model. Speech Commun. 46(3), 348–364 (2005)CrossRef
10.
Zurück zum Zitat Becker, S., Schröder, M., Barry, W.J.: Rule-based prosody prediction for german text-to-speech synthesis. In: Proceedings of Speech Prosody 2006, pp. 503–506 (2006) Becker, S., Schröder, M., Barry, W.J.: Rule-based prosody prediction for german text-to-speech synthesis. In: Proceedings of Speech Prosody 2006, pp. 503–506 (2006)
12.
Zurück zum Zitat Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 1–37. Springer, Heidelberg (2013) CrossRef Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 1–37. Springer, Heidelberg (2013) CrossRef
13.
Zurück zum Zitat Beňuš, Š.: Conversational entrainment in the use of discourse markers. In: Bassis, S., Esposito, A., Morabito, F.C. (eds.) Recent Advances of Neural Network Models and Applications, pp. 345–352. Springer, Heidelberg (2014) Beňuš, Š.: Conversational entrainment in the use of discourse markers. In: Bassis, S., Esposito, A., Morabito, F.C. (eds.) Recent Advances of Neural Network Models and Applications, pp. 345–352. Springer, Heidelberg (2014)
14.
Zurück zum Zitat Birkholz, P.: Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS One 8(4), e60603 (2013)CrossRef Birkholz, P.: Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS One 8(4), e60603 (2013)CrossRef
15.
Zurück zum Zitat Birkholz, P., Jackel, D.: A three-dimensional model of the vocal tract for speech synthesis. In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 2597–2600 (2003) Birkholz, P., Jackel, D.: A three-dimensional model of the vocal tract for speech synthesis. In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 2597–2600 (2003)
16.
Zurück zum Zitat Bolinger, D.L.: Around the edge of language: intonation. Harvard Educ. Rev. 34(2), 282–296 (1964) Bolinger, D.L.: Around the edge of language: intonation. Harvard Educ. Rev. 34(2), 282–296 (1964)
17.
Zurück zum Zitat Campbell, W.N.: CHATR: a high-definition speech re-sequencing system. In: Proceedings of 3rd ASA/ASJ Joint Meeting, pp. 1223–1228 (1996) Campbell, W.N.: CHATR: a high-definition speech re-sequencing system. In: Proceedings of 3rd ASA/ASJ Joint Meeting, pp. 1223–1228 (1996)
18.
Zurück zum Zitat Cole, J., Mo, Y., Hasegawa-Johnson, M.: Signal-based and expectation-based factors in the perception of prosodic prominence. Lab. Phonology 1(2), 425–452 (2010)CrossRef Cole, J., Mo, Y., Hasegawa-Johnson, M.: Signal-based and expectation-based factors in the perception of prosodic prominence. Lab. Phonology 1(2), 425–452 (2010)CrossRef
19.
Zurück zum Zitat Cooper, F.S.: Speech synthesizers. In: Proceedings of 4th International Congress of Phonetic Sciences (ICPhS’61), pp. 3–13 (1962) Cooper, F.S.: Speech synthesizers. In: Proceedings of 4th International Congress of Phonetic Sciences (ICPhS’61), pp. 3–13 (1962)
20.
Zurück zum Zitat D’Ausilio, A., Maffongelli, L., Bartoli, E., Campanella, M., Ferrari, E., Berry, J., Fadiga, L.: Listening to speech recruits specific tongue motor synergies as revealed by transcranial magnetic stimulation and tissue-doppler ultrasound imaging. Philos. Trans. R. Soc. B: Biol. Sci. 369(1644), 20130418 (2014)CrossRef D’Ausilio, A., Maffongelli, L., Bartoli, E., Campanella, M., Ferrari, E., Berry, J., Fadiga, L.: Listening to speech recruits specific tongue motor synergies as revealed by transcranial magnetic stimulation and tissue-doppler ultrasound imaging. Philos. Trans. R. Soc. B: Biol. Sci. 369(1644), 20130418 (2014)CrossRef
21.
Zurück zum Zitat Denes, P.B., Pinson, E.N.: The Speech Chain, p. 121. Bell Laboratory Educational Publication, New York (1963) Denes, P.B., Pinson, E.N.: The Speech Chain, p. 121. Bell Laboratory Educational Publication, New York (1963)
22.
Zurück zum Zitat Deng, L.: A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)CrossRef Deng, L.: A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)CrossRef
23.
Zurück zum Zitat Deng, L., Li, X.: Machine learning paradigms for speech recognition: an overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef Deng, L., Li, X.: Machine learning paradigms for speech recognition: an overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef
24.
Zurück zum Zitat Dutoit, T.: An Introduction to Text-to-Speech Synthesis, vol. 3. Springer, New York (1997) Dutoit, T.: An Introduction to Text-to-Speech Synthesis, vol. 3. Springer, New York (1997)
25.
Zurück zum Zitat Eriksson, A., Thunberg, G.C., Traunmüller, H.: Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing. In: Proceedings of European Conference on Speech Communication and Technology Aalborg, vol. 1, pp. 399–402, September 2001 Eriksson, A., Thunberg, G.C., Traunmüller, H.: Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing. In: Proceedings of European Conference on Speech Communication and Technology Aalborg, vol. 1, pp. 399–402, September 2001
26.
Zurück zum Zitat Fant, C.G.M., Martony, J., Rengman, U., Risberg, A.: OVE II synthesis strategy. In: Proceedings of the Speech Communication Seminar F, vol. 5 (1962) Fant, C.G.M., Martony, J., Rengman, U., Risberg, A.: OVE II synthesis strategy. In: Proceedings of the Speech Communication Seminar F, vol. 5 (1962)
27.
Zurück zum Zitat Farouk, M.H.: Application of Wavelets in Speech Processing. Springer, New York (2014)CrossRefMATH Farouk, M.H.: Application of Wavelets in Speech Processing. Springer, New York (2014)CrossRefMATH
28.
Zurück zum Zitat Flanagan, J.L.: Speech Analysis, Synthesis and Perception, vol. 1, 2nd edn. Springer, Heidelberg (1972)CrossRef Flanagan, J.L.: Speech Analysis, Synthesis and Perception, vol. 1, 2nd edn. Springer, Heidelberg (1972)CrossRef
29.
Zurück zum Zitat Flanagan, J.L.: Note on the design of “terminal-analog” speech synthesizers. J. Acoust. Soc. Am. 29(2), 306–310 (1957)CrossRefMathSciNet Flanagan, J.L.: Note on the design of “terminal-analog” speech synthesizers. J. Acoust. Soc. Am. 29(2), 306–310 (1957)CrossRefMathSciNet
30.
Zurück zum Zitat Frank, S.L., Bod, R., Christiansen, M.H.: How hierarchical is language use? Proc. R. Soc. B: Biol. Sci. 279, 4522–4531 (2012)CrossRef Frank, S.L., Bod, R., Christiansen, M.H.: How hierarchical is language use? Proc. R. Soc. B: Biol. Sci. 279, 4522–4531 (2012)CrossRef
31.
Zurück zum Zitat Fujisaki, H., Hirose, K.: Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5(4), 233–241 (1984)CrossRef Fujisaki, H., Hirose, K.: Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5(4), 233–241 (1984)CrossRef
32.
Zurück zum Zitat Fujisaki, H., Sudo, H.: A generative model for the prosody of connected speech in japanese. Annu. Rep. Eng. Res. Inst. 30, 75–80 (1971) Fujisaki, H., Sudo, H.: A generative model for the prosody of connected speech in japanese. Annu. Rep. Eng. Res. Inst. 30, 75–80 (1971)
33.
Zurück zum Zitat Fukui, K., Ishikawa, Y., Sawa, T., Shintaku, E., Honda, M., Takanishi, A.: New anthropomorphic talking robot having a three-dimensional articulation mechanism and improved pitch range. In: 2007 IEEE International Conference on Robotics and Automation pp. 2922–2927. IEEE (2007) Fukui, K., Ishikawa, Y., Sawa, T., Shintaku, E., Honda, M., Takanishi, A.: New anthropomorphic talking robot having a three-dimensional articulation mechanism and improved pitch range. In: 2007 IEEE International Conference on Robotics and Automation pp. 2922–2927. IEEE (2007)
34.
Zurück zum Zitat Goldsmith, J.A.: Autosegmental and Metrical Phonology, vol. 11. Blackwell, Oxford (1990) Goldsmith, J.A.: Autosegmental and Metrical Phonology, vol. 11. Blackwell, Oxford (1990)
35.
Zurück zum Zitat Grossman, A., Morlet, J.: Decomposition of functions into wavelets of constant shape, and related transforms. Math. Phys. Lect. Recent Results 11, 135–165 (1985)CrossRef Grossman, A., Morlet, J.: Decomposition of functions into wavelets of constant shape, and related transforms. Math. Phys. Lect. Recent Results 11, 135–165 (1985)CrossRef
36.
Zurück zum Zitat Halle, M., Vergnaud, J.R.: Three dimensional phonology. J. Linguist. Res. 1(1), 83–105 (1980) Halle, M., Vergnaud, J.R.: Three dimensional phonology. J. Linguist. Res. 1(1), 83–105 (1980)
37.
Zurück zum Zitat Halle, M., Vergnaud, J.R., et al.: Metrical Structures in Phonology. MIT, Cambridge (1978) Halle, M., Vergnaud, J.R., et al.: Metrical Structures in Phonology. MIT, Cambridge (1978)
38.
Zurück zum Zitat Hannukainen, A., Lukkari, T., Malinen, J., Palo, P.: Vowel formants from the wave equation. J. Acoust. Soc. Am. 122(1), EL1–EL7 (2007)CrossRef Hannukainen, A., Lukkari, T., Malinen, J., Palo, P.: Vowel formants from the wave equation. J. Acoust. Soc. Am. 122(1), EL1–EL7 (2007)CrossRef
39.
Zurück zum Zitat Hertz, S.R.: From text to speech with SRS. J. Acoust. Soc. Am. 72(4), 1155–1170 (1982)CrossRef Hertz, S.R.: From text to speech with SRS. J. Acoust. Soc. Am. 72(4), 1155–1170 (1982)CrossRef
40.
Zurück zum Zitat Hertz, S.R., Kadin, J., Karplus, K.J.: The delta rule development system for speech synthesis from text. Proc. IEEE 73(11), 1589–1601 (1985)CrossRef Hertz, S.R., Kadin, J., Karplus, K.J.: The delta rule development system for speech synthesis from text. Proc. IEEE 73(11), 1589–1601 (1985)CrossRef
41.
Zurück zum Zitat Hirschberg, J.: Pitch accent in context: predicting intonational prominence from text. Artif. Intell. 63(1–2), 305–340 (1993)CrossRef Hirschberg, J.: Pitch accent in context: predicting intonational prominence from text. Artif. Intell. 63(1–2), 305–340 (1993)CrossRef
42.
Zurück zum Zitat Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 1, pp. 373–376. IEEE (1996) Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 1, pp. 373–376. IEEE (1996)
43.
Zurück zum Zitat King, S.: Measuring a decade of progress in text-to-speech. Loguens 1(1) (2014) King, S.: Measuring a decade of progress in text-to-speech. Loguens 1(1) (2014)
44.
Zurück zum Zitat Klatt, D.H.: Review of text-to-speech conversion for english. J. Acoust. Soc. Am. 82(3), 737–793 (1987)CrossRef Klatt, D.H.: Review of text-to-speech conversion for english. J. Acoust. Soc. Am. 82(3), 737–793 (1987)CrossRef
45.
Zurück zum Zitat Klatt, D.: Acoustic theory of terminal analog speech synthesis. In: Proceedings of 1972 International Conference on Speech Communication Processing, Boston, MA (1972) Klatt, D.: Acoustic theory of terminal analog speech synthesis. In: Proceedings of 1972 International Conference on Speech Communication Processing, Boston, MA (1972)
46.
Zurück zum Zitat Kleijn, W.B.: Principles of speech coding. In: Benesty, J., Sondhi, M.M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 283–306. Springer, Heidelberg (2008) CrossRef Kleijn, W.B.: Principles of speech coding. In: Benesty, J., Sondhi, M.M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 283–306. Springer, Heidelberg (2008) CrossRef
47.
Zurück zum Zitat Kochanski, G., Shih, C.: Stem-ml: language-independent prosody description. In: INTERSPEECH, pp. 239–242 (2000) Kochanski, G., Shih, C.: Stem-ml: language-independent prosody description. In: INTERSPEECH, pp. 239–242 (2000)
48.
Zurück zum Zitat Kochanski, G., Shih, C.: Prosody modeling with soft templates. Speech Commun. 39(3), 311–352 (2003)CrossRefMATH Kochanski, G., Shih, C.: Prosody modeling with soft templates. Speech Commun. 39(3), 311–352 (2003)CrossRefMATH
49.
Zurück zum Zitat Kruschke, H., Lenz, M.: Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis. In: INTERSPEECH (2003) Kruschke, H., Lenz, M.: Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis. In: INTERSPEECH (2003)
50.
Zurück zum Zitat Lei, M., Wu, Y.J., Soong, F.K., Ling, Z.H., Dai, L.R.: A hierarchical f0 modeling method for HMM-based speech synthesis. In: INTERSPEECH, pp. 2170–2173 (2010) Lei, M., Wu, Y.J., Soong, F.K., Ling, Z.H., Dai, L.R.: A hierarchical f0 modeling method for HMM-based speech synthesis. In: INTERSPEECH, pp. 2170–2173 (2010)
51.
Zurück zum Zitat Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74(6), 431 (1967)CrossRef Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74(6), 431 (1967)CrossRef
52.
Zurück zum Zitat Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21(1), 1–36 (1985)CrossRef Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21(1), 1–36 (1985)CrossRef
53.
Zurück zum Zitat Ling, Z.H., Richmond, K., Yamagishi, J.: Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. IEEE Trans. Audio Speech Lang. Process. 21(1), 207–219 (2013)CrossRef Ling, Z.H., Richmond, K., Yamagishi, J.: Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. IEEE Trans. Audio Speech Lang. Process. 21(1), 207–219 (2013)CrossRef
54.
Zurück zum Zitat Mallat, S.: A wavelet tour of signal processing. Access Online via Elsevier (1999) Mallat, S.: A wavelet tour of signal processing. Access Online via Elsevier (1999)
55.
Zurück zum Zitat Mishra, T., Santen, J.V., Klabbers, E.: Decomposition of pitch curves in the general superpositional intonation model. In: Speech Prosody, Dresden, Germany (2006) Mishra, T., Santen, J.V., Klabbers, E.: Decomposition of pitch curves in the general superpositional intonation model. In: Speech Prosody, Dresden, Germany (2006)
56.
Zurück zum Zitat Moro, E.B.: A 19th-century speaking machine: the tecnefón of severino perez y vazquez. Historiographia Linguistica 34(1), 19–36 (2007)CrossRefMathSciNet Moro, E.B.: A 19th-century speaking machine: the tecnefón of severino perez y vazquez. Historiographia Linguistica 34(1), 19–36 (2007)CrossRefMathSciNet
57.
Zurück zum Zitat Nishikawa, K., Asama, K., Hayashi, K., Takanobu, H., Takanishi, A.: Development of a talking robot. In: Proceedings of 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems 2000 (IROS 2000), vol. 3, pp. 1760–1765. IEEE (2000) Nishikawa, K., Asama, K., Hayashi, K., Takanobu, H., Takanishi, A.: Development of a talking robot. In: Proceedings of 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems 2000 (IROS 2000), vol. 3, pp. 1760–1765. IEEE (2000)
58.
Zurück zum Zitat Öhman, S.: Word and sentence intonation: a quantitative model. Speech Transmission Laboratory, Department of Speech Communication, Royal Institute of Technology (1967) Öhman, S.: Word and sentence intonation: a quantitative model. Speech Transmission Laboratory, Department of Speech Communication, Royal Institute of Technology (1967)
59.
Zurück zum Zitat Pfeifer, R., Lungarella, M., Iida, F.: Self-organization, embodiment, and biologically inspired robotics. Science 318(5853), 1088–1093 (2007)CrossRef Pfeifer, R., Lungarella, M., Iida, F.: Self-organization, embodiment, and biologically inspired robotics. Science 318(5853), 1088–1093 (2007)CrossRef
60.
Zurück zum Zitat Raitio, T., Lu, H., Kane, J., Suni, A., Vainio, M., King, S., Alku, P.: Voice source modelling using deep neural networks for statistical parametric speech synthesis. In: 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014 (accepted) Raitio, T., Lu, H., Kane, J., Suni, A., Vainio, M., King, S., Alku, P.: Voice source modelling using deep neural networks for statistical parametric speech synthesis. In: 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014 (accepted)
61.
Zurück zum Zitat Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P.: Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. In: Proceedings of Interspeech, Singapore, accepted: September 2014 Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P.: Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. In: Proceedings of Interspeech, Singapore, accepted: September 2014
62.
Zurück zum Zitat Raitio, T., Suni, A., Pohjalainen, J., Airaksinen, M., Vainio, M., Alku, P.: Analysis and synthesis of shouted speech. In: Interspeech, Lyon, France, pp. 1544–1548, August 2013 Raitio, T., Suni, A., Pohjalainen, J., Airaksinen, M., Vainio, M., Alku, P.: Analysis and synthesis of shouted speech. In: Interspeech, Lyon, France, pp. 1544–1548, August 2013
63.
Zurück zum Zitat Raitio, T., Suni, A., Vainio, M., Alku, P.: Analysis of HMM-based lombard speech synthesis. In: Interspeech, Florence, Italy, pp. 2781–2784, August 2011 Raitio, T., Suni, A., Vainio, M., Alku, P.: Analysis of HMM-based lombard speech synthesis. In: Interspeech, Florence, Italy, pp. 2781–2784, August 2011
64.
Zurück zum Zitat Raitio, T., Suni, A., Vainio, M., Alku, P.: Synthesis and perception of breathy, normal, and lombard speech in the presence of noise. Comput. Speech Lang. 28(2), 648–664 (2014)CrossRef Raitio, T., Suni, A., Vainio, M., Alku, P.: Synthesis and perception of breathy, normal, and lombard speech in the presence of noise. Comput. Speech Lang. 28(2), 648–664 (2014)CrossRef
65.
Zurück zum Zitat Ramachandran, R., Mammone, R.: Modern Methods of Speech Processing. Springer, New York (1995)CrossRef Ramachandran, R., Mammone, R.: Modern Methods of Speech Processing. Springer, New York (1995)CrossRef
66.
Zurück zum Zitat Riley, M.D.: Speech Time-Frequency Representation, vol. 63. Springer, New York (1989) Riley, M.D.: Speech Time-Frequency Representation, vol. 63. Springer, New York (1989)
67.
Zurück zum Zitat van Rooij, J.C., Plomp, R.: The effect of linguistic entropy on speech perception in noise in young and elderly listeners. J. Acoust. Soc. Am. 90(6), 2985–2991 (1991)CrossRef van Rooij, J.C., Plomp, R.: The effect of linguistic entropy on speech perception in noise in young and elderly listeners. J. Acoust. Soc. Am. 90(6), 2985–2991 (1991)CrossRef
68.
Zurück zum Zitat van Santen, J.P., Mishra, T., Klabbers, E.: Estimating phrase curves in the general superpositional intonation model. In: Fifth ISCA Workshop on Speech Synthesis (2004) van Santen, J.P., Mishra, T., Klabbers, E.: Estimating phrase curves in the general superpositional intonation model. In: Fifth ISCA Workshop on Speech Synthesis (2004)
70.
Zurück zum Zitat Simko, J., Cummins, F.: Embodied task dynamics. Psychol. Rev. 117(4), 1229 (2010)CrossRef Simko, J., Cummins, F.: Embodied task dynamics. Psychol. Rev. 117(4), 1229 (2010)CrossRef
71.
Zurück zum Zitat Šimko, J., O’Dell, M., Vainio, M.: Emergent consonantal quantity contrast and context-dependence of gestural phasing. J. Phonetics 44, 130–151 (2014)CrossRef Šimko, J., O’Dell, M., Vainio, M.: Emergent consonantal quantity contrast and context-dependence of gestural phasing. J. Phonetics 44, 130–151 (2014)CrossRef
72.
Zurück zum Zitat Sondhi, M.M., Schroeter, J.: A hybrid time-frequency domain articulatory speech synthesizer. IEEE Trans. Acoust. Speech Signal Process. 35(7), 955–967 (1987)CrossRef Sondhi, M.M., Schroeter, J.: A hybrid time-frequency domain articulatory speech synthesizer. IEEE Trans. Acoust. Speech Signal Process. 35(7), 955–967 (1987)CrossRef
73.
Zurück zum Zitat Sproat, R.W.: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers, Boston (1997) Sproat, R.W.: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers, Boston (1997)
74.
Zurück zum Zitat Story, B.H.: A parametric model of the vocal tract area function for vowel and consonant simulation. J. Acoust. Soc. Am. 117(5), 3231–3254 (2005)CrossRef Story, B.H.: A parametric model of the vocal tract area function for vowel and consonant simulation. J. Acoust. Soc. Am. 117(5), 3231–3254 (2005)CrossRef
75.
Zurück zum Zitat Suni, A., Aalto, D., Raitio, T., Alku, P., Vainio, M.: Wavelets for intonation modeling in HMM speech synthesis. In: 8th ISCA Speech Synthesis Workshop (SSW8), Barcelona, Spain, pp. 285–290, August-September 2013 Suni, A., Aalto, D., Raitio, T., Alku, P., Vainio, M.: Wavelets for intonation modeling in HMM speech synthesis. In: 8th ISCA Speech Synthesis Workshop (SSW8), Barcelona, Spain, pp. 285–290, August-September 2013
76.
Zurück zum Zitat Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM speech synthesis entry for Blizzard Challenge 2010. In: Blizzard Challenge 2010 Workshop, Kyoto, Japan, September 2010 Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM speech synthesis entry for Blizzard Challenge 2010. In: Blizzard Challenge 2010 Workshop, Kyoto, Japan, September 2010
77.
Zurück zum Zitat Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2011: utilizing source unit selection in HMM-based speech synthesis for improved excitation generation. In: Blizzard Challenge 2011 Workshop, Florence, Italy, September 2011 Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2011: utilizing source unit selection in HMM-based speech synthesis for improved excitation generation. In: Blizzard Challenge 2011 Workshop, Florence, Italy, September 2011
78.
Zurück zum Zitat Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2012 - hybrid approach. In: Blizzard Challenge 2012 Workshop, Portland, Oregon, September 2012 Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2012 - hybrid approach. In: Blizzard Challenge 2012 Workshop, Portland, Oregon, September 2012
79.
Zurück zum Zitat Suni, A., Simko, J., Aalto, D., Vainio, M.: Continuous wavelet transform in text-to-speech synthesis prosody control (in preparation) Suni, A., Simko, J., Aalto, D., Vainio, M.: Continuous wavelet transform in text-to-speech synthesis prosody control (in preparation)
80.
Zurück zum Zitat Suni, A.S., Aalto, D., Raitio, T., Alku, P., Vainio, M., et al.: Wavelets for intonation modeling in HMM speech synthesis. In: Proceedings of 8th ISCA Workshop on Speech Synthesis, Barcelona, 31 August-2 September 2013 Suni, A.S., Aalto, D., Raitio, T., Alku, P., Vainio, M., et al.: Wavelets for intonation modeling in HMM speech synthesis. In: Proceedings of 8th ISCA Workshop on Speech Synthesis, Barcelona, 31 August-2 September 2013
81.
Zurück zum Zitat Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRef Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRef
82.
Zurück zum Zitat Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from HMM using dynamic features. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95, vol. 1, pp. 660–663. IEEE (1995) Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from HMM using dynamic features. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95, vol. 1, pp. 660–663. IEEE (1995)
83.
Zurück zum Zitat Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’00, vol. 3, pp. 1315–1318. IEEE (2000) Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’00, vol. 3, pp. 1315–1318. IEEE (2000)
84.
Zurück zum Zitat Vainio, L., Tiainen, M., Tiippana, K., Vainio, M.: Shared processing of planning articulatory gestures and grasping. Exp. Brain Res. 232(7), 2359–2368 (2014)CrossRef Vainio, L., Tiainen, M., Tiippana, K., Vainio, M.: Shared processing of planning articulatory gestures and grasping. Exp. Brain Res. 232(7), 2359–2368 (2014)CrossRef
85.
Zurück zum Zitat Vainio, L., Schulman, M., Tiippana, K., Vainio, M.: Effect of syllable articulation on precision and power grip performance. PloS One 8(1), e53061 (2013)CrossRef Vainio, L., Schulman, M., Tiippana, K., Vainio, M.: Effect of syllable articulation on precision and power grip performance. PloS One 8(1), e53061 (2013)CrossRef
86.
Zurück zum Zitat Vainio, M., Järvikivi, J.: Tonal features, intensity, and word order in the perception of prominence. J. Phonetics 34, 319–342 (2006)CrossRef Vainio, M., Järvikivi, J.: Tonal features, intensity, and word order in the perception of prominence. J. Phonetics 34, 319–342 (2006)CrossRef
87.
Zurück zum Zitat Vainio, M., Suni, A., Aalto, D.: Continuous wavelet transform for analysis of speech prosody. In: Proceedings of TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, An Interspeech 2013 Satellite Event, August 30 2013, Laboratoire Parole et Language, Aix-en-Provence, France (2013) Vainio, M., Suni, A., Aalto, D.: Continuous wavelet transform for analysis of speech prosody. In: Proceedings of TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, An Interspeech 2013 Satellite Event, August 30 2013, Laboratoire Parole et Language, Aix-en-Provence, France (2013)
88.
Zurück zum Zitat Vainio, M., Suni, A., Aalto, D.: Emphasis, word prominence, and continuous wavelet transform in the control of HMM based synthesis. In: Speech Prosody in Speech Synthesis - Modeling, Realizing, Converting Prosody for High Quality and Flexible Speech Synthesis, Prosody, Phonology and Phonetics. Springer (2015) Vainio, M., Suni, A., Aalto, D.: Emphasis, word prominence, and continuous wavelet transform in the control of HMM based synthesis. In: Speech Prosody in Speech Synthesis - Modeling, Realizing, Converting Prosody for High Quality and Flexible Speech Synthesis, Prosody, Phonology and Phonetics. Springer (2015)
89.
Zurück zum Zitat Vainio, M., Suni, A., Raitio, T., Nurminen, J., Järvikivi, J., Alku, P.: New method for delexicalization and its application to prosodic tagging for text-to-speech synthesis. In: Interspeech, Brighton, UK, pp. 1703–1706, September 2009 Vainio, M., Suni, A., Raitio, T., Nurminen, J., Järvikivi, J., Alku, P.: New method for delexicalization and its application to prosodic tagging for text-to-speech synthesis. In: Interspeech, Brighton, UK, pp. 1703–1706, September 2009
90.
Zurück zum Zitat Vainio, M., Suni, A., Sirjola, P.: Developing a finnish concept-to-speech system. In: Langemets, M., Penjam, P. (eds.) Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, pp. 201–206, 4–5 April 2005 Vainio, M., Suni, A., Sirjola, P.: Developing a finnish concept-to-speech system. In: Langemets, M., Penjam, P. (eds.) Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, pp. 201–206, 4–5 April 2005
91.
Zurück zum Zitat von Kempelen, W., de Pázmánd, W.K., Autriche, M.: Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. bei JV Degen (1791) von Kempelen, W., de Pázmánd, W.K., Autriche, M.: Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. bei JV Degen (1791)
92.
Zurück zum Zitat Watts, O.S.: Unsupervised learning for text-to-speech synthesis. Ph.D. thesis (2013) Watts, O.S.: Unsupervised learning for text-to-speech synthesis. Ph.D. thesis (2013)
93.
Zurück zum Zitat Zen, H., Braunschweiler, N.: Context-dependent additive log f_0 model for HMM-based speech synthesis. In: INTERSPEECH, pp. 2091–2094 (2009) Zen, H., Braunschweiler, N.: Context-dependent additive log f_0 model for HMM-based speech synthesis. In: INTERSPEECH, pp. 2091–2094 (2009)
94.
Zurück zum Zitat Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef
Metadaten
Titel
Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis
verfasst von
Martti Vainio
Copyright-Jahr
2014
DOI
https://doi.org/10.1007/978-3-319-11397-5_3