Skip to main content
Top

2014 | OriginalPaper | Chapter

Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Text-to-speech synthesis is a task that solves many real-world problems such as providing speaking and reading ability to people who lack those capabilities. It is thus viewed mainly as an engineering problem rather than a purely scientific one. Therefore many of the solutions in speech synthesis are purely practical. However, from the point of view of phonetics, the process of producing speech from text artificially is also a scientific one. Here I argue – using an example from speech prosody, namely speech melody – that phonetics is the key discipline in helping to solve what is arguably one of the most interesting problems in machine learning.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
For a good overview of techniques used see [43].
 
2
There are interesting developments towards more articulatory control in HMM based TTS [53]. However, this can only be seen as compromise as the units are still defined acoustically and do not necessarily correspond with the actual underlying articulatory gestures.
 
Literature
2.
go back to reference Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)CrossRef Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)CrossRef
3.
go back to reference Alku, P., Tiitinen, H., Näätänen, R.: A method for generating natural-sounding speech stimuli for cognitive brain research. Clin. Neurophysiol. 110, 1329–1333 (1999)CrossRef Alku, P., Tiitinen, H., Näätänen, R.: A method for generating natural-sounding speech stimuli for cognitive brain research. Clin. Neurophysiol. 110, 1329–1333 (1999)CrossRef
4.
go back to reference Altosaar, T., Karjalainen, M.: Multiple-resolution analysis of speech signals. In: Proceedings of IEEE ICASSP-88, New York (1988) Altosaar, T., Karjalainen, M.: Multiple-resolution analysis of speech signals. In: Proceedings of IEEE ICASSP-88, New York (1988)
5.
go back to reference Anumanchipalli, G.K., Oliveira, L.C., Black, A.W.: A statistical phrase/accent model for intonation modeling. In: INTERSPEECH, pp. 1813–1816 (2011) Anumanchipalli, G.K., Oliveira, L.C., Black, A.W.: A statistical phrase/accent model for intonation modeling. In: INTERSPEECH, pp. 1813–1816 (2011)
6.
go back to reference Arnold, D., Wagner, P., Möbius, B.: Obtaining prominence judgments from naïve listeners-influence of rating scales, linguistic levels and normalisation. In: Proceedings of Interspeech 2012 (2012) Arnold, D., Wagner, P., Möbius, B.: Obtaining prominence judgments from naïve listeners-influence of rating scales, linguistic levels and normalisation. In: Proceedings of Interspeech 2012 (2012)
7.
go back to reference Badino, L., Clark, R.A., Wester, M.: Towards hierarchical prosodic prominence generation in TTS synthesis. In: INTERSPEECH (2012) Badino, L., Clark, R.A., Wester, M.: Towards hierarchical prosodic prominence generation in TTS synthesis. In: INTERSPEECH (2012)
8.
go back to reference Badino, L., D’Ausilio, A., Fadiga, L., Metta, G.: Computational validation of the motor contribution to speech perception. Top. Cogn. Sci. 6(3), 461–475 (2014)CrossRef Badino, L., D’Ausilio, A., Fadiga, L., Metta, G.: Computational validation of the motor contribution to speech perception. Top. Cogn. Sci. 6(3), 461–475 (2014)CrossRef
9.
go back to reference Bailly, G., Holm, B.: SFC: a trainable prosodic model. Speech Commun. 46(3), 348–364 (2005)CrossRef Bailly, G., Holm, B.: SFC: a trainable prosodic model. Speech Commun. 46(3), 348–364 (2005)CrossRef
10.
go back to reference Becker, S., Schröder, M., Barry, W.J.: Rule-based prosody prediction for german text-to-speech synthesis. In: Proceedings of Speech Prosody 2006, pp. 503–506 (2006) Becker, S., Schröder, M., Barry, W.J.: Rule-based prosody prediction for german text-to-speech synthesis. In: Proceedings of Speech Prosody 2006, pp. 503–506 (2006)
12.
go back to reference Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 1–37. Springer, Heidelberg (2013) CrossRef Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 1–37. Springer, Heidelberg (2013) CrossRef
13.
go back to reference Beňuš, Š.: Conversational entrainment in the use of discourse markers. In: Bassis, S., Esposito, A., Morabito, F.C. (eds.) Recent Advances of Neural Network Models and Applications, pp. 345–352. Springer, Heidelberg (2014) Beňuš, Š.: Conversational entrainment in the use of discourse markers. In: Bassis, S., Esposito, A., Morabito, F.C. (eds.) Recent Advances of Neural Network Models and Applications, pp. 345–352. Springer, Heidelberg (2014)
14.
go back to reference Birkholz, P.: Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS One 8(4), e60603 (2013)CrossRef Birkholz, P.: Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS One 8(4), e60603 (2013)CrossRef
15.
go back to reference Birkholz, P., Jackel, D.: A three-dimensional model of the vocal tract for speech synthesis. In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 2597–2600 (2003) Birkholz, P., Jackel, D.: A three-dimensional model of the vocal tract for speech synthesis. In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 2597–2600 (2003)
16.
go back to reference Bolinger, D.L.: Around the edge of language: intonation. Harvard Educ. Rev. 34(2), 282–296 (1964) Bolinger, D.L.: Around the edge of language: intonation. Harvard Educ. Rev. 34(2), 282–296 (1964)
17.
go back to reference Campbell, W.N.: CHATR: a high-definition speech re-sequencing system. In: Proceedings of 3rd ASA/ASJ Joint Meeting, pp. 1223–1228 (1996) Campbell, W.N.: CHATR: a high-definition speech re-sequencing system. In: Proceedings of 3rd ASA/ASJ Joint Meeting, pp. 1223–1228 (1996)
18.
go back to reference Cole, J., Mo, Y., Hasegawa-Johnson, M.: Signal-based and expectation-based factors in the perception of prosodic prominence. Lab. Phonology 1(2), 425–452 (2010)CrossRef Cole, J., Mo, Y., Hasegawa-Johnson, M.: Signal-based and expectation-based factors in the perception of prosodic prominence. Lab. Phonology 1(2), 425–452 (2010)CrossRef
19.
go back to reference Cooper, F.S.: Speech synthesizers. In: Proceedings of 4th International Congress of Phonetic Sciences (ICPhS’61), pp. 3–13 (1962) Cooper, F.S.: Speech synthesizers. In: Proceedings of 4th International Congress of Phonetic Sciences (ICPhS’61), pp. 3–13 (1962)
20.
go back to reference D’Ausilio, A., Maffongelli, L., Bartoli, E., Campanella, M., Ferrari, E., Berry, J., Fadiga, L.: Listening to speech recruits specific tongue motor synergies as revealed by transcranial magnetic stimulation and tissue-doppler ultrasound imaging. Philos. Trans. R. Soc. B: Biol. Sci. 369(1644), 20130418 (2014)CrossRef D’Ausilio, A., Maffongelli, L., Bartoli, E., Campanella, M., Ferrari, E., Berry, J., Fadiga, L.: Listening to speech recruits specific tongue motor synergies as revealed by transcranial magnetic stimulation and tissue-doppler ultrasound imaging. Philos. Trans. R. Soc. B: Biol. Sci. 369(1644), 20130418 (2014)CrossRef
21.
go back to reference Denes, P.B., Pinson, E.N.: The Speech Chain, p. 121. Bell Laboratory Educational Publication, New York (1963) Denes, P.B., Pinson, E.N.: The Speech Chain, p. 121. Bell Laboratory Educational Publication, New York (1963)
22.
go back to reference Deng, L.: A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)CrossRef Deng, L.: A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)CrossRef
23.
go back to reference Deng, L., Li, X.: Machine learning paradigms for speech recognition: an overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef Deng, L., Li, X.: Machine learning paradigms for speech recognition: an overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef
24.
go back to reference Dutoit, T.: An Introduction to Text-to-Speech Synthesis, vol. 3. Springer, New York (1997) Dutoit, T.: An Introduction to Text-to-Speech Synthesis, vol. 3. Springer, New York (1997)
25.
go back to reference Eriksson, A., Thunberg, G.C., Traunmüller, H.: Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing. In: Proceedings of European Conference on Speech Communication and Technology Aalborg, vol. 1, pp. 399–402, September 2001 Eriksson, A., Thunberg, G.C., Traunmüller, H.: Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing. In: Proceedings of European Conference on Speech Communication and Technology Aalborg, vol. 1, pp. 399–402, September 2001
26.
go back to reference Fant, C.G.M., Martony, J., Rengman, U., Risberg, A.: OVE II synthesis strategy. In: Proceedings of the Speech Communication Seminar F, vol. 5 (1962) Fant, C.G.M., Martony, J., Rengman, U., Risberg, A.: OVE II synthesis strategy. In: Proceedings of the Speech Communication Seminar F, vol. 5 (1962)
27.
28.
go back to reference Flanagan, J.L.: Speech Analysis, Synthesis and Perception, vol. 1, 2nd edn. Springer, Heidelberg (1972)CrossRef Flanagan, J.L.: Speech Analysis, Synthesis and Perception, vol. 1, 2nd edn. Springer, Heidelberg (1972)CrossRef
29.
go back to reference Flanagan, J.L.: Note on the design of “terminal-analog” speech synthesizers. J. Acoust. Soc. Am. 29(2), 306–310 (1957)CrossRefMathSciNet Flanagan, J.L.: Note on the design of “terminal-analog” speech synthesizers. J. Acoust. Soc. Am. 29(2), 306–310 (1957)CrossRefMathSciNet
30.
go back to reference Frank, S.L., Bod, R., Christiansen, M.H.: How hierarchical is language use? Proc. R. Soc. B: Biol. Sci. 279, 4522–4531 (2012)CrossRef Frank, S.L., Bod, R., Christiansen, M.H.: How hierarchical is language use? Proc. R. Soc. B: Biol. Sci. 279, 4522–4531 (2012)CrossRef
31.
go back to reference Fujisaki, H., Hirose, K.: Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5(4), 233–241 (1984)CrossRef Fujisaki, H., Hirose, K.: Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5(4), 233–241 (1984)CrossRef
32.
go back to reference Fujisaki, H., Sudo, H.: A generative model for the prosody of connected speech in japanese. Annu. Rep. Eng. Res. Inst. 30, 75–80 (1971) Fujisaki, H., Sudo, H.: A generative model for the prosody of connected speech in japanese. Annu. Rep. Eng. Res. Inst. 30, 75–80 (1971)
33.
go back to reference Fukui, K., Ishikawa, Y., Sawa, T., Shintaku, E., Honda, M., Takanishi, A.: New anthropomorphic talking robot having a three-dimensional articulation mechanism and improved pitch range. In: 2007 IEEE International Conference on Robotics and Automation pp. 2922–2927. IEEE (2007) Fukui, K., Ishikawa, Y., Sawa, T., Shintaku, E., Honda, M., Takanishi, A.: New anthropomorphic talking robot having a three-dimensional articulation mechanism and improved pitch range. In: 2007 IEEE International Conference on Robotics and Automation pp. 2922–2927. IEEE (2007)
34.
go back to reference Goldsmith, J.A.: Autosegmental and Metrical Phonology, vol. 11. Blackwell, Oxford (1990) Goldsmith, J.A.: Autosegmental and Metrical Phonology, vol. 11. Blackwell, Oxford (1990)
35.
go back to reference Grossman, A., Morlet, J.: Decomposition of functions into wavelets of constant shape, and related transforms. Math. Phys. Lect. Recent Results 11, 135–165 (1985)CrossRef Grossman, A., Morlet, J.: Decomposition of functions into wavelets of constant shape, and related transforms. Math. Phys. Lect. Recent Results 11, 135–165 (1985)CrossRef
36.
go back to reference Halle, M., Vergnaud, J.R.: Three dimensional phonology. J. Linguist. Res. 1(1), 83–105 (1980) Halle, M., Vergnaud, J.R.: Three dimensional phonology. J. Linguist. Res. 1(1), 83–105 (1980)
37.
go back to reference Halle, M., Vergnaud, J.R., et al.: Metrical Structures in Phonology. MIT, Cambridge (1978) Halle, M., Vergnaud, J.R., et al.: Metrical Structures in Phonology. MIT, Cambridge (1978)
38.
go back to reference Hannukainen, A., Lukkari, T., Malinen, J., Palo, P.: Vowel formants from the wave equation. J. Acoust. Soc. Am. 122(1), EL1–EL7 (2007)CrossRef Hannukainen, A., Lukkari, T., Malinen, J., Palo, P.: Vowel formants from the wave equation. J. Acoust. Soc. Am. 122(1), EL1–EL7 (2007)CrossRef
39.
go back to reference Hertz, S.R.: From text to speech with SRS. J. Acoust. Soc. Am. 72(4), 1155–1170 (1982)CrossRef Hertz, S.R.: From text to speech with SRS. J. Acoust. Soc. Am. 72(4), 1155–1170 (1982)CrossRef
40.
go back to reference Hertz, S.R., Kadin, J., Karplus, K.J.: The delta rule development system for speech synthesis from text. Proc. IEEE 73(11), 1589–1601 (1985)CrossRef Hertz, S.R., Kadin, J., Karplus, K.J.: The delta rule development system for speech synthesis from text. Proc. IEEE 73(11), 1589–1601 (1985)CrossRef
41.
go back to reference Hirschberg, J.: Pitch accent in context: predicting intonational prominence from text. Artif. Intell. 63(1–2), 305–340 (1993)CrossRef Hirschberg, J.: Pitch accent in context: predicting intonational prominence from text. Artif. Intell. 63(1–2), 305–340 (1993)CrossRef
42.
go back to reference Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 1, pp. 373–376. IEEE (1996) Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 1, pp. 373–376. IEEE (1996)
43.
go back to reference King, S.: Measuring a decade of progress in text-to-speech. Loguens 1(1) (2014) King, S.: Measuring a decade of progress in text-to-speech. Loguens 1(1) (2014)
44.
go back to reference Klatt, D.H.: Review of text-to-speech conversion for english. J. Acoust. Soc. Am. 82(3), 737–793 (1987)CrossRef Klatt, D.H.: Review of text-to-speech conversion for english. J. Acoust. Soc. Am. 82(3), 737–793 (1987)CrossRef
45.
go back to reference Klatt, D.: Acoustic theory of terminal analog speech synthesis. In: Proceedings of 1972 International Conference on Speech Communication Processing, Boston, MA (1972) Klatt, D.: Acoustic theory of terminal analog speech synthesis. In: Proceedings of 1972 International Conference on Speech Communication Processing, Boston, MA (1972)
46.
go back to reference Kleijn, W.B.: Principles of speech coding. In: Benesty, J., Sondhi, M.M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 283–306. Springer, Heidelberg (2008) CrossRef Kleijn, W.B.: Principles of speech coding. In: Benesty, J., Sondhi, M.M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 283–306. Springer, Heidelberg (2008) CrossRef
47.
go back to reference Kochanski, G., Shih, C.: Stem-ml: language-independent prosody description. In: INTERSPEECH, pp. 239–242 (2000) Kochanski, G., Shih, C.: Stem-ml: language-independent prosody description. In: INTERSPEECH, pp. 239–242 (2000)
48.
go back to reference Kochanski, G., Shih, C.: Prosody modeling with soft templates. Speech Commun. 39(3), 311–352 (2003)CrossRefMATH Kochanski, G., Shih, C.: Prosody modeling with soft templates. Speech Commun. 39(3), 311–352 (2003)CrossRefMATH
49.
go back to reference Kruschke, H., Lenz, M.: Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis. In: INTERSPEECH (2003) Kruschke, H., Lenz, M.: Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis. In: INTERSPEECH (2003)
50.
go back to reference Lei, M., Wu, Y.J., Soong, F.K., Ling, Z.H., Dai, L.R.: A hierarchical f0 modeling method for HMM-based speech synthesis. In: INTERSPEECH, pp. 2170–2173 (2010) Lei, M., Wu, Y.J., Soong, F.K., Ling, Z.H., Dai, L.R.: A hierarchical f0 modeling method for HMM-based speech synthesis. In: INTERSPEECH, pp. 2170–2173 (2010)
51.
go back to reference Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74(6), 431 (1967)CrossRef Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74(6), 431 (1967)CrossRef
52.
go back to reference Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21(1), 1–36 (1985)CrossRef Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21(1), 1–36 (1985)CrossRef
53.
go back to reference Ling, Z.H., Richmond, K., Yamagishi, J.: Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. IEEE Trans. Audio Speech Lang. Process. 21(1), 207–219 (2013)CrossRef Ling, Z.H., Richmond, K., Yamagishi, J.: Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. IEEE Trans. Audio Speech Lang. Process. 21(1), 207–219 (2013)CrossRef
54.
go back to reference Mallat, S.: A wavelet tour of signal processing. Access Online via Elsevier (1999) Mallat, S.: A wavelet tour of signal processing. Access Online via Elsevier (1999)
55.
go back to reference Mishra, T., Santen, J.V., Klabbers, E.: Decomposition of pitch curves in the general superpositional intonation model. In: Speech Prosody, Dresden, Germany (2006) Mishra, T., Santen, J.V., Klabbers, E.: Decomposition of pitch curves in the general superpositional intonation model. In: Speech Prosody, Dresden, Germany (2006)
56.
go back to reference Moro, E.B.: A 19th-century speaking machine: the tecnefón of severino perez y vazquez. Historiographia Linguistica 34(1), 19–36 (2007)CrossRefMathSciNet Moro, E.B.: A 19th-century speaking machine: the tecnefón of severino perez y vazquez. Historiographia Linguistica 34(1), 19–36 (2007)CrossRefMathSciNet
57.
go back to reference Nishikawa, K., Asama, K., Hayashi, K., Takanobu, H., Takanishi, A.: Development of a talking robot. In: Proceedings of 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems 2000 (IROS 2000), vol. 3, pp. 1760–1765. IEEE (2000) Nishikawa, K., Asama, K., Hayashi, K., Takanobu, H., Takanishi, A.: Development of a talking robot. In: Proceedings of 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems 2000 (IROS 2000), vol. 3, pp. 1760–1765. IEEE (2000)
58.
go back to reference Öhman, S.: Word and sentence intonation: a quantitative model. Speech Transmission Laboratory, Department of Speech Communication, Royal Institute of Technology (1967) Öhman, S.: Word and sentence intonation: a quantitative model. Speech Transmission Laboratory, Department of Speech Communication, Royal Institute of Technology (1967)
59.
go back to reference Pfeifer, R., Lungarella, M., Iida, F.: Self-organization, embodiment, and biologically inspired robotics. Science 318(5853), 1088–1093 (2007)CrossRef Pfeifer, R., Lungarella, M., Iida, F.: Self-organization, embodiment, and biologically inspired robotics. Science 318(5853), 1088–1093 (2007)CrossRef
60.
go back to reference Raitio, T., Lu, H., Kane, J., Suni, A., Vainio, M., King, S., Alku, P.: Voice source modelling using deep neural networks for statistical parametric speech synthesis. In: 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014 (accepted) Raitio, T., Lu, H., Kane, J., Suni, A., Vainio, M., King, S., Alku, P.: Voice source modelling using deep neural networks for statistical parametric speech synthesis. In: 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014 (accepted)
61.
go back to reference Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P.: Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. In: Proceedings of Interspeech, Singapore, accepted: September 2014 Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P.: Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. In: Proceedings of Interspeech, Singapore, accepted: September 2014
62.
go back to reference Raitio, T., Suni, A., Pohjalainen, J., Airaksinen, M., Vainio, M., Alku, P.: Analysis and synthesis of shouted speech. In: Interspeech, Lyon, France, pp. 1544–1548, August 2013 Raitio, T., Suni, A., Pohjalainen, J., Airaksinen, M., Vainio, M., Alku, P.: Analysis and synthesis of shouted speech. In: Interspeech, Lyon, France, pp. 1544–1548, August 2013
63.
go back to reference Raitio, T., Suni, A., Vainio, M., Alku, P.: Analysis of HMM-based lombard speech synthesis. In: Interspeech, Florence, Italy, pp. 2781–2784, August 2011 Raitio, T., Suni, A., Vainio, M., Alku, P.: Analysis of HMM-based lombard speech synthesis. In: Interspeech, Florence, Italy, pp. 2781–2784, August 2011
64.
go back to reference Raitio, T., Suni, A., Vainio, M., Alku, P.: Synthesis and perception of breathy, normal, and lombard speech in the presence of noise. Comput. Speech Lang. 28(2), 648–664 (2014)CrossRef Raitio, T., Suni, A., Vainio, M., Alku, P.: Synthesis and perception of breathy, normal, and lombard speech in the presence of noise. Comput. Speech Lang. 28(2), 648–664 (2014)CrossRef
65.
go back to reference Ramachandran, R., Mammone, R.: Modern Methods of Speech Processing. Springer, New York (1995)CrossRef Ramachandran, R., Mammone, R.: Modern Methods of Speech Processing. Springer, New York (1995)CrossRef
66.
go back to reference Riley, M.D.: Speech Time-Frequency Representation, vol. 63. Springer, New York (1989) Riley, M.D.: Speech Time-Frequency Representation, vol. 63. Springer, New York (1989)
67.
go back to reference van Rooij, J.C., Plomp, R.: The effect of linguistic entropy on speech perception in noise in young and elderly listeners. J. Acoust. Soc. Am. 90(6), 2985–2991 (1991)CrossRef van Rooij, J.C., Plomp, R.: The effect of linguistic entropy on speech perception in noise in young and elderly listeners. J. Acoust. Soc. Am. 90(6), 2985–2991 (1991)CrossRef
68.
go back to reference van Santen, J.P., Mishra, T., Klabbers, E.: Estimating phrase curves in the general superpositional intonation model. In: Fifth ISCA Workshop on Speech Synthesis (2004) van Santen, J.P., Mishra, T., Klabbers, E.: Estimating phrase curves in the general superpositional intonation model. In: Fifth ISCA Workshop on Speech Synthesis (2004)
70.
go back to reference Simko, J., Cummins, F.: Embodied task dynamics. Psychol. Rev. 117(4), 1229 (2010)CrossRef Simko, J., Cummins, F.: Embodied task dynamics. Psychol. Rev. 117(4), 1229 (2010)CrossRef
71.
go back to reference Šimko, J., O’Dell, M., Vainio, M.: Emergent consonantal quantity contrast and context-dependence of gestural phasing. J. Phonetics 44, 130–151 (2014)CrossRef Šimko, J., O’Dell, M., Vainio, M.: Emergent consonantal quantity contrast and context-dependence of gestural phasing. J. Phonetics 44, 130–151 (2014)CrossRef
72.
go back to reference Sondhi, M.M., Schroeter, J.: A hybrid time-frequency domain articulatory speech synthesizer. IEEE Trans. Acoust. Speech Signal Process. 35(7), 955–967 (1987)CrossRef Sondhi, M.M., Schroeter, J.: A hybrid time-frequency domain articulatory speech synthesizer. IEEE Trans. Acoust. Speech Signal Process. 35(7), 955–967 (1987)CrossRef
73.
go back to reference Sproat, R.W.: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers, Boston (1997) Sproat, R.W.: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers, Boston (1997)
74.
go back to reference Story, B.H.: A parametric model of the vocal tract area function for vowel and consonant simulation. J. Acoust. Soc. Am. 117(5), 3231–3254 (2005)CrossRef Story, B.H.: A parametric model of the vocal tract area function for vowel and consonant simulation. J. Acoust. Soc. Am. 117(5), 3231–3254 (2005)CrossRef
75.
go back to reference Suni, A., Aalto, D., Raitio, T., Alku, P., Vainio, M.: Wavelets for intonation modeling in HMM speech synthesis. In: 8th ISCA Speech Synthesis Workshop (SSW8), Barcelona, Spain, pp. 285–290, August-September 2013 Suni, A., Aalto, D., Raitio, T., Alku, P., Vainio, M.: Wavelets for intonation modeling in HMM speech synthesis. In: 8th ISCA Speech Synthesis Workshop (SSW8), Barcelona, Spain, pp. 285–290, August-September 2013
76.
go back to reference Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM speech synthesis entry for Blizzard Challenge 2010. In: Blizzard Challenge 2010 Workshop, Kyoto, Japan, September 2010 Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM speech synthesis entry for Blizzard Challenge 2010. In: Blizzard Challenge 2010 Workshop, Kyoto, Japan, September 2010
77.
go back to reference Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2011: utilizing source unit selection in HMM-based speech synthesis for improved excitation generation. In: Blizzard Challenge 2011 Workshop, Florence, Italy, September 2011 Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2011: utilizing source unit selection in HMM-based speech synthesis for improved excitation generation. In: Blizzard Challenge 2011 Workshop, Florence, Italy, September 2011
78.
go back to reference Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2012 - hybrid approach. In: Blizzard Challenge 2012 Workshop, Portland, Oregon, September 2012 Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2012 - hybrid approach. In: Blizzard Challenge 2012 Workshop, Portland, Oregon, September 2012
79.
go back to reference Suni, A., Simko, J., Aalto, D., Vainio, M.: Continuous wavelet transform in text-to-speech synthesis prosody control (in preparation) Suni, A., Simko, J., Aalto, D., Vainio, M.: Continuous wavelet transform in text-to-speech synthesis prosody control (in preparation)
80.
go back to reference Suni, A.S., Aalto, D., Raitio, T., Alku, P., Vainio, M., et al.: Wavelets for intonation modeling in HMM speech synthesis. In: Proceedings of 8th ISCA Workshop on Speech Synthesis, Barcelona, 31 August-2 September 2013 Suni, A.S., Aalto, D., Raitio, T., Alku, P., Vainio, M., et al.: Wavelets for intonation modeling in HMM speech synthesis. In: Proceedings of 8th ISCA Workshop on Speech Synthesis, Barcelona, 31 August-2 September 2013
81.
go back to reference Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRef Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRef
82.
go back to reference Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from HMM using dynamic features. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95, vol. 1, pp. 660–663. IEEE (1995) Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from HMM using dynamic features. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95, vol. 1, pp. 660–663. IEEE (1995)
83.
go back to reference Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’00, vol. 3, pp. 1315–1318. IEEE (2000) Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’00, vol. 3, pp. 1315–1318. IEEE (2000)
84.
go back to reference Vainio, L., Tiainen, M., Tiippana, K., Vainio, M.: Shared processing of planning articulatory gestures and grasping. Exp. Brain Res. 232(7), 2359–2368 (2014)CrossRef Vainio, L., Tiainen, M., Tiippana, K., Vainio, M.: Shared processing of planning articulatory gestures and grasping. Exp. Brain Res. 232(7), 2359–2368 (2014)CrossRef
85.
go back to reference Vainio, L., Schulman, M., Tiippana, K., Vainio, M.: Effect of syllable articulation on precision and power grip performance. PloS One 8(1), e53061 (2013)CrossRef Vainio, L., Schulman, M., Tiippana, K., Vainio, M.: Effect of syllable articulation on precision and power grip performance. PloS One 8(1), e53061 (2013)CrossRef
86.
go back to reference Vainio, M., Järvikivi, J.: Tonal features, intensity, and word order in the perception of prominence. J. Phonetics 34, 319–342 (2006)CrossRef Vainio, M., Järvikivi, J.: Tonal features, intensity, and word order in the perception of prominence. J. Phonetics 34, 319–342 (2006)CrossRef
87.
go back to reference Vainio, M., Suni, A., Aalto, D.: Continuous wavelet transform for analysis of speech prosody. In: Proceedings of TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, An Interspeech 2013 Satellite Event, August 30 2013, Laboratoire Parole et Language, Aix-en-Provence, France (2013) Vainio, M., Suni, A., Aalto, D.: Continuous wavelet transform for analysis of speech prosody. In: Proceedings of TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, An Interspeech 2013 Satellite Event, August 30 2013, Laboratoire Parole et Language, Aix-en-Provence, France (2013)
88.
go back to reference Vainio, M., Suni, A., Aalto, D.: Emphasis, word prominence, and continuous wavelet transform in the control of HMM based synthesis. In: Speech Prosody in Speech Synthesis - Modeling, Realizing, Converting Prosody for High Quality and Flexible Speech Synthesis, Prosody, Phonology and Phonetics. Springer (2015) Vainio, M., Suni, A., Aalto, D.: Emphasis, word prominence, and continuous wavelet transform in the control of HMM based synthesis. In: Speech Prosody in Speech Synthesis - Modeling, Realizing, Converting Prosody for High Quality and Flexible Speech Synthesis, Prosody, Phonology and Phonetics. Springer (2015)
89.
go back to reference Vainio, M., Suni, A., Raitio, T., Nurminen, J., Järvikivi, J., Alku, P.: New method for delexicalization and its application to prosodic tagging for text-to-speech synthesis. In: Interspeech, Brighton, UK, pp. 1703–1706, September 2009 Vainio, M., Suni, A., Raitio, T., Nurminen, J., Järvikivi, J., Alku, P.: New method for delexicalization and its application to prosodic tagging for text-to-speech synthesis. In: Interspeech, Brighton, UK, pp. 1703–1706, September 2009
90.
go back to reference Vainio, M., Suni, A., Sirjola, P.: Developing a finnish concept-to-speech system. In: Langemets, M., Penjam, P. (eds.) Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, pp. 201–206, 4–5 April 2005 Vainio, M., Suni, A., Sirjola, P.: Developing a finnish concept-to-speech system. In: Langemets, M., Penjam, P. (eds.) Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, pp. 201–206, 4–5 April 2005
91.
go back to reference von Kempelen, W., de Pázmánd, W.K., Autriche, M.: Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. bei JV Degen (1791) von Kempelen, W., de Pázmánd, W.K., Autriche, M.: Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. bei JV Degen (1791)
92.
go back to reference Watts, O.S.: Unsupervised learning for text-to-speech synthesis. Ph.D. thesis (2013) Watts, O.S.: Unsupervised learning for text-to-speech synthesis. Ph.D. thesis (2013)
93.
go back to reference Zen, H., Braunschweiler, N.: Context-dependent additive log f_0 model for HMM-based speech synthesis. In: INTERSPEECH, pp. 2091–2094 (2009) Zen, H., Braunschweiler, N.: Context-dependent additive log f_0 model for HMM-based speech synthesis. In: INTERSPEECH, pp. 2091–2094 (2009)
94.
go back to reference Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef
Metadata
Title
Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis
Author
Martti Vainio
Copyright Year
2014
DOI
https://doi.org/10.1007/978-3-319-11397-5_3

Premium Partner