Abstract
Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then converted into speech. This paper describes a text-to-speech synthesis system for modern standard Arabic based on artificial neural networks and residual excited LPC coder. The networks offer a storage-efficient means of synthesis without the need for explicit rule enumeration. These neural networks require large prosodically labeled continuous speech databases in their training stage. As such databases are not available for the Arabic language, we have developed one for this purpose. Thus, we discuss various stages undertaken for this development process. In addition to interpolation capabilities of neural networks, a linear interpolation of the coder parameters is performed to create smooth transitions at segment boundaries. A residual-excited all pole vocal tract model and a prosodic-information synthesizer based on neural networks are also described in this paper.
Similar content being viewed by others
References
Dixon N.R. and Maxey H.D. (1976). Terminal analog synthesis of continuous speech using the diphone method of segment assembly. IEEE Trans. Audio Electroacoust. AU-16: 40–50
Holmes J.N. et al. (1964). Speech synthesis by rule. Lang. Speech 7: 127–143
Guerti, M.: Contribution à la synthèse de la parole en Arabe Standard. XVIèmes Journées d’Etudes sur la Parole (JEP), Société Française d’Acoustique, Hammamet, Tunisie, 5–9 Octobre 1987, pp. 290–293 (1987)
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP’96, pp. 373–376 (1996)
Guerti, M.: Speech synthesis by rule. In: 8th International conference on computer theory and applications ICCTA’98, IEEE (Alexandra Chapter), Alexandria-EGYPT, 15–17 September 1998, III.12–III.15 (1998)
Tuerk, C., Robinson, T.: Speech Synthesis using neural networks trained on cepstral coefficients. In: Proceedings of Eurospeech’93, Berlin, pp. 1713–1716 (1993)
Weijters, T., Thole, J.: Speech synthesis with artificial neural networks. In: Proceedings of ICNN’93, San Francisco, pp. 1764–1769 (1993)
Karaali, O., Corrigan, G., Gerson, I., Massey, N.: Text-to-speech conversion with neural networks: a recurrent TDNN approach. In: Proceedings of Eurospeech’97, Rhodes, Greece, pp. 561–564 (1997)
Xiang, Z., BI, G.: A neural network model for Chinese speech synthesis. In: Proceedings of IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1859–1862 (1990)
Cawley, G.C.: The application of neural networks to phonetic modelling. PhD thesis, University of Essex (1996)
Tao, J., Cai, L., Tropf, H.: An optimised neural network based prosody model of Chinese speech synthesis system. In: Proceedings of IEEE TENCON’02, pp. 477–480 (2002)
Farrokhi, A., Ghammaghami, S.: Predication of prosodic data in Persian text-to-speech systems using recurrent neural network. In: Electronics Letters IEE 2003, vol. 39, no. 25 (2003)
Teixeira, J.P., Freitas, D.: Segmental durations predicted with a neural network. Eurospeech 2003-Geneva, pp. 169–172 (2003)
Vainio, M.: Artificial neural network based prosody models for Finnish text-to-speech synthesis. University of Helsinki, Department of Phonetics, Finland (2001)
Chen S.H., Hwang S.H. and Wang Y.R. (1998). An RNN-based prosodic information synthesizer for Chinese text-to-speech. IEEE Trans. Speech Audio Process. 6: 226–239
Erdem, C., Zimmermman, H.G.: A data-driven method for input feature selection within neural prosody generation. In: Proceedings of ICASSP 2002, vol. 1, pp. 477–480
Baloul, S.: Développement d’un système automatique de synthèse de la parole à partir du texte arabe standard voyellé. Thèse de doctorat, université du Maine, Le Mans, France (2003)
Malfrère, F., Deroo, O., Dutoit, T.: Phonetic alignment: Speech synthesis based vs. hybrid HMM/ANN. In: Proceedings of ICSLP 98, Sydney, Australia, pp. 1571–1574
Malfrère, F., Dutoit, T.: Speech synthesis for text-to-speech alignment and prosodic feature extraction. In: Proceedings of ISCAS’ 97, Hong-Kong, pp. 2637–2640 (1997)
Nouza J. (1997). Spectral variation functions applied to acoustic–phonetic segmentation of speech signal. In: Wodarz, H.-W. (eds) Speech Processing., pp 43–58. Forum Phoneticum, 63, Frankfurt amndt Hand
Dutoit T. (1997). An Introduction to text-to-speech Synthesis. Kluwer, The Netherlands
Chappell, D.T., Hansen, J.H.L.: A comparison of spectral smoothing methods for segment concatenation based speech synthesis. In: Speech Communication, 36. pp. 343–374, Elsevier, Amsterdam (2002)
Itakura, F. (1975) Line spectrum representation of linear prediction coefficients of speech signals. J. Acoust. Soc. Am. 57: 535 (abstract)
Kleijn W.B., Paliwal K.K. (eds). (1995). Speech Coding and Synthesis. Elsevier, Amsterdam
Sejnowski T.J. and Rosenberg C.R. (1987). Parallel networks that learn to pronounce English text. Complex Syst. 1: 145–168
Vepa J. and King S. (2006). Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Trans. Speech Audio Process. 14(5): 1763–1771
Daniel H., Di Christo A. and Espesser R. (2000). Levels of representation and levels of analysis for intonation. In: Horne, M. (eds) Prosody: Theory and Experiement., pp 51–87. Kluwer, Dordrecht
Fant G., Kruckenberg A.: Intonation analysis and synthesis with reference to Swedish. In: International Somposium on Tonal Aspects of languages: with Emphasis on Tone languages, Beijing, China, 28–31 (2004)
Moulines E. and Charpentier F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9: 453–467
Chouireb, F., Guerti, M.: Etude et Application des techniques LPC et TD-PSOLA pour l’analyse/modification/synthèse de la parole. International Conference on Electrical and Electronics Engineering-ICEEE’2004, Université Amar Telidji-Laghouat (Algérie), Special issue IASN 1112–4652, 24–26 April, pp. 244–250 (2004)
Edgington, M., Lowry, A.: Residual-based speech modification algorithm for text-to-speech synthesis. In: ICLSP’96, Philadelphia, PA, USA, 3–6, October, pp. 1425–1428 (1996)
Giménez de los Galanes, F.M., Savoji, M.H., Pardo, J.M.: New Algorithm for spectral smoothing and envelope modification for LP-PSOLA synthesis. In: Proceedings of ICASSP, vol. 1, pp. 573–576 (1994)
Hart J., Collier R. and Cohen A. (1990). A perceptual study of intonation. Cambridge University Press, Cambridge
Conkie A. and Isard S. (1997). Optimal coupling of diphones. In: Van Santen, J., Sproat, R., Olive, J. and Hirschberg, J. (eds) Progress in Speech Synthesis., pp 293–304. Springer-Verlag, New York
Black, A., Campbell, N.: Optimising selection of units from speech databases for concatenative synthesis. In: EUROSPEECH ’95, Madrid, Spain, pp. 581–584 (1995)
Hansen J.H.L. and Chappell D.T. (1998). An auditory-based distortion measure with application to concatenative speech synthesis. IEEE Trans. Speech Audio Process. 6(5): 489–495
Dutoit, T., Cernak, M.: TTSBOX: A MATLAB toolbox for teaching text-to-speech synthesis. In: ICASSP’05, Philadelphia, 18–23 March 2005, vol. 5, pp: v/537–v/540 (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chouireb, F., Guerti, M. Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model. SIViP 2, 73–87 (2008). https://doi.org/10.1007/s11760-007-0038-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-007-0038-z