Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model

Chouireb, Fatima; Guerti, Mhania

doi:10.1007/s11760-007-0038-z

Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model

Original Paper
Published: 18 October 2007

Volume 2, pages 73–87, (2008)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Fatima Chouireb¹ &
Mhania Guerti²

206 Accesses
8 Citations
Explore all metrics

Abstract

Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then converted into speech. This paper describes a text-to-speech synthesis system for modern standard Arabic based on artificial neural networks and residual excited LPC coder. The networks offer a storage-efficient means of synthesis without the need for explicit rule enumeration. These neural networks require large prosodically labeled continuous speech databases in their training stage. As such databases are not available for the Arabic language, we have developed one for this purpose. Thus, we discuss various stages undertaken for this development process. In addition to interpolation capabilities of neural networks, a linear interpolation of the coder parameters is performed to create smooth transitions at segment boundaries. A residual-excited all pole vocal tract model and a prosodic-information synthesizer based on neural networks are also described in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Development of an automatic phonetization system for Arabic text-to-speech synthesis

Article 19 July 2014

Automatic Speech Recognition

Arabic speech synthesis and diacritic recognition

Article 18 May 2016

References

Dixon N.R. and Maxey H.D. (1976). Terminal analog synthesis of continuous speech using the diphone method of segment assembly. IEEE Trans. Audio Electroacoust. AU-16: 40–50
Google Scholar
Holmes J.N. et al. (1964). Speech synthesis by rule. Lang. Speech 7: 127–143
Google Scholar
Guerti, M.: Contribution à la synthèse de la parole en Arabe Standard. XVIèmes Journées d’Etudes sur la Parole (JEP), Société Française d’Acoustique, Hammamet, Tunisie, 5–9 Octobre 1987, pp. 290–293 (1987)
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP’96, pp. 373–376 (1996)
Guerti, M.: Speech synthesis by rule. In: 8th International conference on computer theory and applications ICCTA’98, IEEE (Alexandra Chapter), Alexandria-EGYPT, 15–17 September 1998, III.12–III.15 (1998)
Tuerk, C., Robinson, T.: Speech Synthesis using neural networks trained on cepstral coefficients. In: Proceedings of Eurospeech’93, Berlin, pp. 1713–1716 (1993)
Weijters, T., Thole, J.: Speech synthesis with artificial neural networks. In: Proceedings of ICNN’93, San Francisco, pp. 1764–1769 (1993)
Karaali, O., Corrigan, G., Gerson, I., Massey, N.: Text-to-speech conversion with neural networks: a recurrent TDNN approach. In: Proceedings of Eurospeech’97, Rhodes, Greece, pp. 561–564 (1997)
Xiang, Z., BI, G.: A neural network model for Chinese speech synthesis. In: Proceedings of IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1859–1862 (1990)
Cawley, G.C.: The application of neural networks to phonetic modelling. PhD thesis, University of Essex (1996)
Tao, J., Cai, L., Tropf, H.: An optimised neural network based prosody model of Chinese speech synthesis system. In: Proceedings of IEEE TENCON’02, pp. 477–480 (2002)
Farrokhi, A., Ghammaghami, S.: Predication of prosodic data in Persian text-to-speech systems using recurrent neural network. In: Electronics Letters IEE 2003, vol. 39, no. 25 (2003)
Teixeira, J.P., Freitas, D.: Segmental durations predicted with a neural network. Eurospeech 2003-Geneva, pp. 169–172 (2003)
Vainio, M.: Artificial neural network based prosody models for Finnish text-to-speech synthesis. University of Helsinki, Department of Phonetics, Finland (2001)
Chen S.H., Hwang S.H. and Wang Y.R. (1998). An RNN-based prosodic information synthesizer for Chinese text-to-speech. IEEE Trans. Speech Audio Process. 6: 226–239
Article Google Scholar
Erdem, C., Zimmermman, H.G.: A data-driven method for input feature selection within neural prosody generation. In: Proceedings of ICASSP 2002, vol. 1, pp. 477–480
Baloul, S.: Développement d’un système automatique de synthèse de la parole à partir du texte arabe standard voyellé. Thèse de doctorat, université du Maine, Le Mans, France (2003)
Malfrère, F., Deroo, O., Dutoit, T.: Phonetic alignment: Speech synthesis based vs. hybrid HMM/ANN. In: Proceedings of ICSLP 98, Sydney, Australia, pp. 1571–1574
Malfrère, F., Dutoit, T.: Speech synthesis for text-to-speech alignment and prosodic feature extraction. In: Proceedings of ISCAS’ 97, Hong-Kong, pp. 2637–2640 (1997)
Nouza J. (1997). Spectral variation functions applied to acoustic–phonetic segmentation of speech signal. In: Wodarz, H.-W. (eds) Speech Processing., pp 43–58. Forum Phoneticum, 63, Frankfurt amndt Hand
Google Scholar
Dutoit T. (1997). An Introduction to text-to-speech Synthesis. Kluwer, The Netherlands
Google Scholar
Chappell, D.T., Hansen, J.H.L.: A comparison of spectral smoothing methods for segment concatenation based speech synthesis. In: Speech Communication, 36. pp. 343–374, Elsevier, Amsterdam (2002)
Itakura, F. (1975) Line spectrum representation of linear prediction coefficients of speech signals. J. Acoust. Soc. Am. 57: 535 (abstract)
Article Google Scholar
Kleijn W.B., Paliwal K.K. (eds). (1995). Speech Coding and Synthesis. Elsevier, Amsterdam
Google Scholar
Sejnowski T.J. and Rosenberg C.R. (1987). Parallel networks that learn to pronounce English text. Complex Syst. 1: 145–168
MATH Google Scholar
Vepa J. and King S. (2006). Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Trans. Speech Audio Process. 14(5): 1763–1771
Article Google Scholar
Daniel H., Di Christo A. and Espesser R. (2000). Levels of representation and levels of analysis for intonation. In: Horne, M. (eds) Prosody: Theory and Experiement., pp 51–87. Kluwer, Dordrecht
Google Scholar
Fant G., Kruckenberg A.: Intonation analysis and synthesis with reference to Swedish. In: International Somposium on Tonal Aspects of languages: with Emphasis on Tone languages, Beijing, China, 28–31 (2004)
Moulines E. and Charpentier F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9: 453–467
Article Google Scholar
Chouireb, F., Guerti, M.: Etude et Application des techniques LPC et TD-PSOLA pour l’analyse/modification/synthèse de la parole. International Conference on Electrical and Electronics Engineering-ICEEE’2004, Université Amar Telidji-Laghouat (Algérie), Special issue IASN 1112–4652, 24–26 April, pp. 244–250 (2004)
Edgington, M., Lowry, A.: Residual-based speech modification algorithm for text-to-speech synthesis. In: ICLSP’96, Philadelphia, PA, USA, 3–6, October, pp. 1425–1428 (1996)
Giménez de los Galanes, F.M., Savoji, M.H., Pardo, J.M.: New Algorithm for spectral smoothing and envelope modification for LP-PSOLA synthesis. In: Proceedings of ICASSP, vol. 1, pp. 573–576 (1994)
Hart J., Collier R. and Cohen A. (1990). A perceptual study of intonation. Cambridge University Press, Cambridge
Google Scholar
Conkie A. and Isard S. (1997). Optimal coupling of diphones. In: Van Santen, J., Sproat, R., Olive, J. and Hirschberg, J. (eds) Progress in Speech Synthesis., pp 293–304. Springer-Verlag, New York
Google Scholar
Black, A., Campbell, N.: Optimising selection of units from speech databases for concatenative synthesis. In: EUROSPEECH ’95, Madrid, Spain, pp. 581–584 (1995)
Hansen J.H.L. and Chappell D.T. (1998). An auditory-based distortion measure with application to concatenative speech synthesis. IEEE Trans. Speech Audio Process. 6(5): 489–495
Article Google Scholar
Dutoit, T., Cernak, M.: TTSBOX: A MATLAB toolbox for teaching text-to-speech synthesis. In: ICASSP’05, Philadelphia, 18–23 March 2005, vol. 5, pp: v/537–v/540 (2005)

Download references

Author information

Authors and Affiliations

Département de Génie Electrique, Université Amar Telidji de Laghouat, Route de Ghardaia, BP 37G, 03000, Laghouat, Algeria
Fatima Chouireb
Ecole Nationale Polytechnique, B.P. 182, El Harrach, 16200, Algiers, Algeria
Mhania Guerti

Authors

Fatima Chouireb
View author publications
You can also search for this author in PubMed Google Scholar
Mhania Guerti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatima Chouireb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chouireb, F., Guerti, M. Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model. SIViP 2, 73–87 (2008). https://doi.org/10.1007/s11760-007-0038-z

Download citation

Received: 28 October 2006
Revised: 23 September 2007
Accepted: 24 September 2007
Published: 18 October 2007
Issue Date: January 2008
DOI: https://doi.org/10.1007/s11760-007-0038-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model

Abstract

Access this article

Similar content being viewed by others

Development of an automatic phonetization system for Arabic text-to-speech synthesis

Automatic Speech Recognition

Arabic speech synthesis and diacritic recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model

Abstract

Access this article

Similar content being viewed by others

Development of an automatic phonetization system for Arabic text-to-speech synthesis

Automatic Speech Recognition

Arabic speech synthesis and diacritic recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation