Top

Published in:

2014 | OriginalPaper | Chapter

Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis

Author : Martti Vainio

Published in: Statistical Language and Speech Processing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Text-to-speech synthesis is a task that solves many real-world problems such as providing speaking and reading ability to people who lack those capabilities. It is thus viewed mainly as an engineering problem rather than a purely scientific one. Therefore many of the solutions in speech synthesis are purely practical. However, from the point of view of phonetics, the process of producing speech from text artificially is also a scientific one. Here I argue – using an example from speech prosody, namely speech melody – that phonetics is the key discipline in helping to solve what is arguably one of the most interesting problems in machine learning.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Spoken Language Processing: Time to Look Outside?

next chapter A Hybrid Approach to Compiling Bilingual Dictionaries of Medical Terms from Parallel Corpora

For a good overview of techniques used see [43].

There are interesting developments towards more articulatory control in HMM based TTS [53]. However, this can only be seen as compromise as the units are still defined acoustically and do not necessarily correspond with the actual underlying articulatory gestures.

(2014). http://www.simple4all.org

Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)CrossRef

Alku, P., Tiitinen, H., Näätänen, R.: A method for generating natural-sounding speech stimuli for cognitive brain research. Clin. Neurophysiol. 110, 1329–1333 (1999)CrossRef

Altosaar, T., Karjalainen, M.: Multiple-resolution analysis of speech signals. In: Proceedings of IEEE ICASSP-88, New York (1988)

Anumanchipalli, G.K., Oliveira, L.C., Black, A.W.: A statistical phrase/accent model for intonation modeling. In: INTERSPEECH, pp. 1813–1816 (2011)

Arnold, D., Wagner, P., Möbius, B.: Obtaining prominence judgments from naïve listeners-influence of rating scales, linguistic levels and normalisation. In: Proceedings of Interspeech 2012 (2012)

Badino, L., Clark, R.A., Wester, M.: Towards hierarchical prosodic prominence generation in TTS synthesis. In: INTERSPEECH (2012)

Badino, L., D’Ausilio, A., Fadiga, L., Metta, G.: Computational validation of the motor contribution to speech perception. Top. Cogn. Sci. 6(3), 461–475 (2014)CrossRef

Bailly, G., Holm, B.: SFC: a trainable prosodic model. Speech Commun. 46(3), 348–364 (2005)CrossRef

10.

Becker, S., Schröder, M., Barry, W.J.: Rule-based prosody prediction for german text-to-speech synthesis. In: Proceedings of Speech Prosody 2006, pp. 503–506 (2006)

11.

Bengio, Y.: Evolving culture vs local minima. arXiv preprint arXiv:1203.2990 (2012)

12.

Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 1–37. Springer, Heidelberg (2013) CrossRef

13.

Beňuš, Š.: Conversational entrainment in the use of discourse markers. In: Bassis, S., Esposito, A., Morabito, F.C. (eds.) Recent Advances of Neural Network Models and Applications, pp. 345–352. Springer, Heidelberg (2014)

14.

Birkholz, P.: Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS One 8(4), e60603 (2013)CrossRef

15.

Birkholz, P., Jackel, D.: A three-dimensional model of the vocal tract for speech synthesis. In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 2597–2600 (2003)

16.

Bolinger, D.L.: Around the edge of language: intonation. Harvard Educ. Rev. 34(2), 282–296 (1964)

17.

Campbell, W.N.: CHATR: a high-definition speech re-sequencing system. In: Proceedings of 3rd ASA/ASJ Joint Meeting, pp. 1223–1228 (1996)

18.

Cole, J., Mo, Y., Hasegawa-Johnson, M.: Signal-based and expectation-based factors in the perception of prosodic prominence. Lab. Phonology 1(2), 425–452 (2010)CrossRef

19.

Cooper, F.S.: Speech synthesizers. In: Proceedings of 4th International Congress of Phonetic Sciences (ICPhS’61), pp. 3–13 (1962)

20.

D’Ausilio, A., Maffongelli, L., Bartoli, E., Campanella, M., Ferrari, E., Berry, J., Fadiga, L.: Listening to speech recruits specific tongue motor synergies as revealed by transcranial magnetic stimulation and tissue-doppler ultrasound imaging. Philos. Trans. R. Soc. B: Biol. Sci. 369(1644), 20130418 (2014)CrossRef

21.

Denes, P.B., Pinson, E.N.: The Speech Chain, p. 121. Bell Laboratory Educational Publication, New York (1963)

22.

Deng, L.: A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)CrossRef

23.

Deng, L., Li, X.: Machine learning paradigms for speech recognition: an overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef

24.

Dutoit, T.: An Introduction to Text-to-Speech Synthesis, vol. 3. Springer, New York (1997)

25.

Eriksson, A., Thunberg, G.C., Traunmüller, H.: Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing. In: Proceedings of European Conference on Speech Communication and Technology Aalborg, vol. 1, pp. 399–402, September 2001

26.

Fant, C.G.M., Martony, J., Rengman, U., Risberg, A.: OVE II synthesis strategy. In: Proceedings of the Speech Communication Seminar F, vol. 5 (1962)

27.

Farouk, M.H.: Application of Wavelets in Speech Processing. Springer, New York (2014)CrossRefMATH

28.

Flanagan, J.L.: Speech Analysis, Synthesis and Perception, vol. 1, 2nd edn. Springer, Heidelberg (1972)CrossRef

29.

Flanagan, J.L.: Note on the design of “terminal-analog” speech synthesizers. J. Acoust. Soc. Am. 29(2), 306–310 (1957)CrossRefMathSciNet

30.

Frank, S.L., Bod, R., Christiansen, M.H.: How hierarchical is language use? Proc. R. Soc. B: Biol. Sci. 279, 4522–4531 (2012)CrossRef

31.

Fujisaki, H., Hirose, K.: Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5(4), 233–241 (1984)CrossRef

32.

Fujisaki, H., Sudo, H.: A generative model for the prosody of connected speech in japanese. Annu. Rep. Eng. Res. Inst. 30, 75–80 (1971)

33.

Fukui, K., Ishikawa, Y., Sawa, T., Shintaku, E., Honda, M., Takanishi, A.: New anthropomorphic talking robot having a three-dimensional articulation mechanism and improved pitch range. In: 2007 IEEE International Conference on Robotics and Automation pp. 2922–2927. IEEE (2007)

34.

Goldsmith, J.A.: Autosegmental and Metrical Phonology, vol. 11. Blackwell, Oxford (1990)

35.

Grossman, A., Morlet, J.: Decomposition of functions into wavelets of constant shape, and related transforms. Math. Phys. Lect. Recent Results 11, 135–165 (1985)CrossRef

36.

Halle, M., Vergnaud, J.R.: Three dimensional phonology. J. Linguist. Res. 1(1), 83–105 (1980)

37.

Halle, M., Vergnaud, J.R., et al.: Metrical Structures in Phonology. MIT, Cambridge (1978)

38.

Hannukainen, A., Lukkari, T., Malinen, J., Palo, P.: Vowel formants from the wave equation. J. Acoust. Soc. Am. 122(1), EL1–EL7 (2007)CrossRef

39.

Hertz, S.R.: From text to speech with SRS. J. Acoust. Soc. Am. 72(4), 1155–1170 (1982)CrossRef

40.

Hertz, S.R., Kadin, J., Karplus, K.J.: The delta rule development system for speech synthesis from text. Proc. IEEE 73(11), 1589–1601 (1985)CrossRef

41.

Hirschberg, J.: Pitch accent in context: predicting intonational prominence from text. Artif. Intell. 63(1–2), 305–340 (1993)CrossRef

42.

Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 1, pp. 373–376. IEEE (1996)

43.

King, S.: Measuring a decade of progress in text-to-speech. Loguens 1(1) (2014)

44.

Klatt, D.H.: Review of text-to-speech conversion for english. J. Acoust. Soc. Am. 82(3), 737–793 (1987)CrossRef

45.

Klatt, D.: Acoustic theory of terminal analog speech synthesis. In: Proceedings of 1972 International Conference on Speech Communication Processing, Boston, MA (1972)

46.

Kleijn, W.B.: Principles of speech coding. In: Benesty, J., Sondhi, M.M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 283–306. Springer, Heidelberg (2008) CrossRef

47.

Kochanski, G., Shih, C.: Stem-ml: language-independent prosody description. In: INTERSPEECH, pp. 239–242 (2000)

48.

Kochanski, G., Shih, C.: Prosody modeling with soft templates. Speech Commun. 39(3), 311–352 (2003)CrossRefMATH

49.

Kruschke, H., Lenz, M.: Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis. In: INTERSPEECH (2003)

50.

Lei, M., Wu, Y.J., Soong, F.K., Ling, Z.H., Dai, L.R.: A hierarchical f0 modeling method for HMM-based speech synthesis. In: INTERSPEECH, pp. 2170–2173 (2010)

51.

Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74(6), 431 (1967)CrossRef

52.

Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21(1), 1–36 (1985)CrossRef

53.

Ling, Z.H., Richmond, K., Yamagishi, J.: Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. IEEE Trans. Audio Speech Lang. Process. 21(1), 207–219 (2013)CrossRef

54.

Mallat, S.: A wavelet tour of signal processing. Access Online via Elsevier (1999)

55.

Mishra, T., Santen, J.V., Klabbers, E.: Decomposition of pitch curves in the general superpositional intonation model. In: Speech Prosody, Dresden, Germany (2006)

56.

Moro, E.B.: A 19th-century speaking machine: the tecnefón of severino perez y vazquez. Historiographia Linguistica 34(1), 19–36 (2007)CrossRefMathSciNet

57.

Nishikawa, K., Asama, K., Hayashi, K., Takanobu, H., Takanishi, A.: Development of a talking robot. In: Proceedings of 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems 2000 (IROS 2000), vol. 3, pp. 1760–1765. IEEE (2000)

58.

Öhman, S.: Word and sentence intonation: a quantitative model. Speech Transmission Laboratory, Department of Speech Communication, Royal Institute of Technology (1967)

59.

Pfeifer, R., Lungarella, M., Iida, F.: Self-organization, embodiment, and biologically inspired robotics. Science 318(5853), 1088–1093 (2007)CrossRef

60.

Raitio, T., Lu, H., Kane, J., Suni, A., Vainio, M., King, S., Alku, P.: Voice source modelling using deep neural networks for statistical parametric speech synthesis. In: 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014 (accepted)

61.

Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P.: Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. In: Proceedings of Interspeech, Singapore, accepted: September 2014

62.

Raitio, T., Suni, A., Pohjalainen, J., Airaksinen, M., Vainio, M., Alku, P.: Analysis and synthesis of shouted speech. In: Interspeech, Lyon, France, pp. 1544–1548, August 2013

63.

Raitio, T., Suni, A., Vainio, M., Alku, P.: Analysis of HMM-based lombard speech synthesis. In: Interspeech, Florence, Italy, pp. 2781–2784, August 2011

64.

Raitio, T., Suni, A., Vainio, M., Alku, P.: Synthesis and perception of breathy, normal, and lombard speech in the presence of noise. Comput. Speech Lang. 28(2), 648–664 (2014)CrossRef

65.

Ramachandran, R., Mammone, R.: Modern Methods of Speech Processing. Springer, New York (1995)CrossRef

66.

Riley, M.D.: Speech Time-Frequency Representation, vol. 63. Springer, New York (1989)

67.

van Rooij, J.C., Plomp, R.: The effect of linguistic entropy on speech perception in noise in young and elderly listeners. J. Acoust. Soc. Am. 90(6), 2985–2991 (1991)CrossRef

68.

van Santen, J.P., Mishra, T., Klabbers, E.: Estimating phrase curves in the general superpositional intonation model. In: Fifth ISCA Workshop on Speech Synthesis (2004)

69.

Schroeder, M.R.: A brief history of synthetic speech. Speech Commun. 13(1), 231–237 (1993)CrossRefMathSciNet

70.

Simko, J., Cummins, F.: Embodied task dynamics. Psychol. Rev. 117(4), 1229 (2010)CrossRef

71.

Šimko, J., O’Dell, M., Vainio, M.: Emergent consonantal quantity contrast and context-dependence of gestural phasing. J. Phonetics 44, 130–151 (2014)CrossRef

72.

Sondhi, M.M., Schroeter, J.: A hybrid time-frequency domain articulatory speech synthesizer. IEEE Trans. Acoust. Speech Signal Process. 35(7), 955–967 (1987)CrossRef

73.

Sproat, R.W.: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers, Boston (1997)

74.

Story, B.H.: A parametric model of the vocal tract area function for vowel and consonant simulation. J. Acoust. Soc. Am. 117(5), 3231–3254 (2005)CrossRef

75.

Suni, A., Aalto, D., Raitio, T., Alku, P., Vainio, M.: Wavelets for intonation modeling in HMM speech synthesis. In: 8th ISCA Speech Synthesis Workshop (SSW8), Barcelona, Spain, pp. 285–290, August-September 2013

76.

Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM speech synthesis entry for Blizzard Challenge 2010. In: Blizzard Challenge 2010 Workshop, Kyoto, Japan, September 2010

77.

Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2011: utilizing source unit selection in HMM-based speech synthesis for improved excitation generation. In: Blizzard Challenge 2011 Workshop, Florence, Italy, September 2011

78.

Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2012 - hybrid approach. In: Blizzard Challenge 2012 Workshop, Portland, Oregon, September 2012

79.

Suni, A., Simko, J., Aalto, D., Vainio, M.: Continuous wavelet transform in text-to-speech synthesis prosody control (in preparation)

80.

Suni, A.S., Aalto, D., Raitio, T., Alku, P., Vainio, M., et al.: Wavelets for intonation modeling in HMM speech synthesis. In: Proceedings of 8th ISCA Workshop on Speech Synthesis, Barcelona, 31 August-2 September 2013

81.

Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRef

82.

Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from HMM using dynamic features. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95, vol. 1, pp. 660–663. IEEE (1995)

83.

Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’00, vol. 3, pp. 1315–1318. IEEE (2000)

84.

Vainio, L., Tiainen, M., Tiippana, K., Vainio, M.: Shared processing of planning articulatory gestures and grasping. Exp. Brain Res. 232(7), 2359–2368 (2014)CrossRef

85.

Vainio, L., Schulman, M., Tiippana, K., Vainio, M.: Effect of syllable articulation on precision and power grip performance. PloS One 8(1), e53061 (2013)CrossRef

86.

Vainio, M., Järvikivi, J.: Tonal features, intensity, and word order in the perception of prominence. J. Phonetics 34, 319–342 (2006)CrossRef

87.

Vainio, M., Suni, A., Aalto, D.: Continuous wavelet transform for analysis of speech prosody. In: Proceedings of TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, An Interspeech 2013 Satellite Event, August 30 2013, Laboratoire Parole et Language, Aix-en-Provence, France (2013)

88.

Vainio, M., Suni, A., Aalto, D.: Emphasis, word prominence, and continuous wavelet transform in the control of HMM based synthesis. In: Speech Prosody in Speech Synthesis - Modeling, Realizing, Converting Prosody for High Quality and Flexible Speech Synthesis, Prosody, Phonology and Phonetics. Springer (2015)

89.

Vainio, M., Suni, A., Raitio, T., Nurminen, J., Järvikivi, J., Alku, P.: New method for delexicalization and its application to prosodic tagging for text-to-speech synthesis. In: Interspeech, Brighton, UK, pp. 1703–1706, September 2009

90.

Vainio, M., Suni, A., Sirjola, P.: Developing a finnish concept-to-speech system. In: Langemets, M., Penjam, P. (eds.) Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, pp. 201–206, 4–5 April 2005

91.

von Kempelen, W., de Pázmánd, W.K., Autriche, M.: Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. bei JV Degen (1791)

92.

Watts, O.S.: Unsupervised learning for text-to-speech synthesis. Ph.D. thesis (2013)

93.

Zen, H., Braunschweiler, N.: Context-dependent additive log f_0 model for HMM-based speech synthesis. In: INTERSPEECH, pp. 2091–2094 (2009)

94.

Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef

Title: Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis
Author: Martti Vainio
Publisher: Springer International Publishing
Book: Statistical Language and Speech Processing
Print ISBN: 978-3-319-11396-8

Electronic ISBN: 978-3-319-11397-5

Copyright Year: 2014
DOI: https://doi.org/10.1007/978-3-319-11397-5_3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner