Skip to main content
Log in

Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then converted into speech. This paper describes a text-to-speech synthesis system for modern standard Arabic based on artificial neural networks and residual excited LPC coder. The networks offer a storage-efficient means of synthesis without the need for explicit rule enumeration. These neural networks require large prosodically labeled continuous speech databases in their training stage. As such databases are not available for the Arabic language, we have developed one for this purpose. Thus, we discuss various stages undertaken for this development process. In addition to interpolation capabilities of neural networks, a linear interpolation of the coder parameters is performed to create smooth transitions at segment boundaries. A residual-excited all pole vocal tract model and a prosodic-information synthesizer based on neural networks are also described in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Dixon N.R. and Maxey H.D. (1976). Terminal analog synthesis of continuous speech using the diphone method of segment assembly. IEEE Trans. Audio Electroacoust. AU-16: 40–50

    Google Scholar 

  2. Holmes J.N. et al. (1964). Speech synthesis by rule. Lang. Speech 7: 127–143

    Google Scholar 

  3. Guerti, M.: Contribution à la synthèse de la parole en Arabe Standard. XVIèmes Journées d’Etudes sur la Parole (JEP), Société Française d’Acoustique, Hammamet, Tunisie, 5–9 Octobre 1987, pp. 290–293 (1987)

  4. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP’96, pp. 373–376 (1996)

  5. Guerti, M.: Speech synthesis by rule. In: 8th International conference on computer theory and applications ICCTA’98, IEEE (Alexandra Chapter), Alexandria-EGYPT, 15–17 September 1998, III.12–III.15 (1998)

  6. Tuerk, C., Robinson, T.: Speech Synthesis using neural networks trained on cepstral coefficients. In: Proceedings of Eurospeech’93, Berlin, pp. 1713–1716 (1993)

  7. Weijters, T., Thole, J.: Speech synthesis with artificial neural networks. In: Proceedings of ICNN’93, San Francisco, pp. 1764–1769 (1993)

  8. Karaali, O., Corrigan, G., Gerson, I., Massey, N.: Text-to-speech conversion with neural networks: a recurrent TDNN approach. In: Proceedings of Eurospeech’97, Rhodes, Greece, pp. 561–564 (1997)

  9. Xiang, Z., BI, G.: A neural network model for Chinese speech synthesis. In: Proceedings of IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1859–1862 (1990)

  10. Cawley, G.C.: The application of neural networks to phonetic modelling. PhD thesis, University of Essex (1996)

  11. Tao, J., Cai, L., Tropf, H.: An optimised neural network based prosody model of Chinese speech synthesis system. In: Proceedings of IEEE TENCON’02, pp. 477–480 (2002)

  12. Farrokhi, A., Ghammaghami, S.: Predication of prosodic data in Persian text-to-speech systems using recurrent neural network. In: Electronics Letters IEE 2003, vol. 39, no. 25 (2003)

  13. Teixeira, J.P., Freitas, D.: Segmental durations predicted with a neural network. Eurospeech 2003-Geneva, pp. 169–172 (2003)

  14. Vainio, M.: Artificial neural network based prosody models for Finnish text-to-speech synthesis. University of Helsinki, Department of Phonetics, Finland (2001)

  15. Chen S.H., Hwang S.H. and Wang Y.R. (1998). An RNN-based prosodic information synthesizer for Chinese text-to-speech. IEEE Trans. Speech Audio Process. 6: 226–239

    Article  Google Scholar 

  16. Erdem, C., Zimmermman, H.G.: A data-driven method for input feature selection within neural prosody generation. In: Proceedings of ICASSP 2002, vol. 1, pp. 477–480

  17. Baloul, S.: Développement d’un système automatique de synthèse de la parole à partir du texte arabe standard voyellé. Thèse de doctorat, université du Maine, Le Mans, France (2003)

  18. Malfrère, F., Deroo, O., Dutoit, T.: Phonetic alignment: Speech synthesis based vs. hybrid HMM/ANN. In: Proceedings of ICSLP 98, Sydney, Australia, pp. 1571–1574

  19. Malfrère, F., Dutoit, T.: Speech synthesis for text-to-speech alignment and prosodic feature extraction. In: Proceedings of ISCAS’ 97, Hong-Kong, pp. 2637–2640 (1997)

  20. Nouza J. (1997). Spectral variation functions applied to acoustic–phonetic segmentation of speech signal. In: Wodarz, H.-W. (eds) Speech Processing., pp 43–58. Forum Phoneticum, 63, Frankfurt amndt Hand

    Google Scholar 

  21. Dutoit T. (1997). An Introduction to text-to-speech Synthesis. Kluwer, The Netherlands

    Google Scholar 

  22. Chappell, D.T., Hansen, J.H.L.: A comparison of spectral smoothing methods for segment concatenation based speech synthesis. In: Speech Communication, 36. pp. 343–374, Elsevier, Amsterdam (2002)

  23. Itakura, F. (1975) Line spectrum representation of linear prediction coefficients of speech signals. J. Acoust. Soc. Am. 57: 535 (abstract)

    Article  Google Scholar 

  24. Kleijn W.B., Paliwal K.K. (eds). (1995). Speech Coding and Synthesis. Elsevier, Amsterdam

    Google Scholar 

  25. Sejnowski T.J. and Rosenberg C.R. (1987). Parallel networks that learn to pronounce English text. Complex Syst. 1: 145–168

    MATH  Google Scholar 

  26. Vepa J. and King S. (2006). Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Trans. Speech Audio Process. 14(5): 1763–1771

    Article  Google Scholar 

  27. Daniel H., Di Christo A. and Espesser R. (2000). Levels of representation and levels of analysis for intonation. In: Horne, M. (eds) Prosody: Theory and Experiement., pp 51–87. Kluwer, Dordrecht

    Google Scholar 

  28. Fant G., Kruckenberg A.: Intonation analysis and synthesis with reference to Swedish. In: International Somposium on Tonal Aspects of languages: with Emphasis on Tone languages, Beijing, China, 28–31 (2004)

  29. Moulines E. and Charpentier F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9: 453–467

    Article  Google Scholar 

  30. Chouireb, F., Guerti, M.: Etude et Application des techniques LPC et TD-PSOLA pour l’analyse/modification/synthèse de la parole. International Conference on Electrical and Electronics Engineering-ICEEE’2004, Université Amar Telidji-Laghouat (Algérie), Special issue IASN 1112–4652, 24–26 April, pp. 244–250 (2004)

  31. Edgington, M., Lowry, A.: Residual-based speech modification algorithm for text-to-speech synthesis. In: ICLSP’96, Philadelphia, PA, USA, 3–6, October, pp. 1425–1428 (1996)

  32. Giménez de los Galanes, F.M., Savoji, M.H., Pardo, J.M.: New Algorithm for spectral smoothing and envelope modification for LP-PSOLA synthesis. In: Proceedings of ICASSP, vol. 1, pp. 573–576 (1994)

  33. Hart J., Collier R. and Cohen A. (1990). A perceptual study of intonation. Cambridge University Press, Cambridge

    Google Scholar 

  34. Conkie A. and Isard S. (1997). Optimal coupling of diphones. In: Van Santen, J., Sproat, R., Olive, J. and Hirschberg, J. (eds) Progress in Speech Synthesis., pp 293–304. Springer-Verlag, New York

    Google Scholar 

  35. Black, A., Campbell, N.: Optimising selection of units from speech databases for concatenative synthesis. In: EUROSPEECH ’95, Madrid, Spain, pp. 581–584 (1995)

  36. Hansen J.H.L. and Chappell D.T. (1998). An auditory-based distortion measure with application to concatenative speech synthesis. IEEE Trans. Speech Audio Process. 6(5): 489–495

    Article  Google Scholar 

  37. Dutoit, T., Cernak, M.: TTSBOX: A MATLAB toolbox for teaching text-to-speech synthesis. In: ICASSP’05, Philadelphia, 18–23 March 2005, vol. 5, pp: v/537–v/540 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fatima Chouireb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chouireb, F., Guerti, M. Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model. SIViP 2, 73–87 (2008). https://doi.org/10.1007/s11760-007-0038-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-007-0038-z

Keywords

Navigation