Skip to main content
Top
Published in: International Journal of Speech Technology 3/2016

18-05-2016

Arabic speech synthesis and diacritic recognition

Authors: Ilyes Rebai, Yassine BenAyed

Published in: International Journal of Speech Technology | Issue 3/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Although the softmax activation function is popular in DNN-based classification, our preliminary experiments showed that the DNN with the tangent sigmoid activation function at the output layer consistently outperformed those with the softmax one.
 
Literature
go back to reference Al-Said, G., & Abdallah, M. (2009). An Arabic text-to-speech system based on artificial neural networks. Journal of Computer Science, 5, 207–213.CrossRef Al-Said, G., & Abdallah, M. (2009). An Arabic text-to-speech system based on artificial neural networks. Journal of Computer Science, 5, 207–213.CrossRef
go back to reference Alghamdi, M., Zeeshan, M., & Hazim, A. (2010). Automatic restoration of Arabic diacritics: A simple, purely statistical approach. The Arabian Journal for Science and Engineering, 35, 125–135. Alghamdi, M., Zeeshan, M., & Hazim, A. (2010). Automatic restoration of Arabic diacritics: A simple, purely statistical approach. The Arabian Journal for Science and Engineering, 35, 125–135.
go back to reference Attia, M. (2005). Theory and implementation of a large-scale Arabic phonetic transcriptor, and applications. PhD thesis, Department of Electronics and Electrical Communications, Faculty of Engineering, Cairo. Attia, M. (2005). Theory and implementation of a large-scale Arabic phonetic transcriptor, and applications. PhD thesis, Department of Electronics and Electrical Communications, Faculty of Engineering, Cairo.
go back to reference Badrashiny, M, (2009). Automatic diacritizer for Arabic texts. PhD thesis, University of Cairo, Cairo Badrashiny, M, (2009). Automatic diacritizer for Arabic texts. PhD thesis, University of Cairo, Cairo
go back to reference Ben Sassi, S., Braham, R., & Belghith, A. (2001). Neural speech synthesis system for Arabic language using CELP algorithm. In ACS/IEEE International Conference on Computer Systems and Applications (pp. 119–121) Ben Sassi, S., Braham, R., & Belghith, A. (2001). Neural speech synthesis system for Arabic language using CELP algorithm. In ACS/IEEE International Conference on Computer Systems and Applications (pp. 119–121)
go back to reference Chouireb, F., & Guerti, M. (2008). Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model. Signal, Image and Video Processing, 2, 73–87.CrossRefMATH Chouireb, F., & Guerti, M. (2008). Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model. Signal, Image and Video Processing, 2, 73–87.CrossRefMATH
go back to reference Ciresan, D., Meier, U., Masci, J., Gambardella, L., & Schmidhuber, J. (2011). High-performance neural networks for visual object classification. Computing Research Repository abs/1102.0183:1–11 Ciresan, D., Meier, U., Masci, J., Gambardella, L., & Schmidhuber, J. (2011). High-performance neural networks for visual object classification. Computing Research Repository abs/1102.0183:1–11
go back to reference Elshafei, M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2002). Techniques for high quality Arabic speech synthesis. Information Sciences, 140, 255–267.CrossRefMATH Elshafei, M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2002). Techniques for high quality Arabic speech synthesis. Information Sciences, 140, 255–267.CrossRefMATH
go back to reference Elshafei, M., Almuhtasib, H., & Alghamdi, M. (2006). Machine generation of Arabic diacritical marks. In The 2006 World Congress in Computer Science Computer Engineering, and Applied Computing, pp. 128–133. Elshafei, M., Almuhtasib, H., & Alghamdi, M. (2006). Machine generation of Arabic diacritical marks. In The 2006 World Congress in Computer Science Computer Engineering, and Applied Computing, pp. 128–133.
go back to reference Fares, T., Khalil, A., & Hegazy, A. (2008). Usage of the HMM-based speech synthesis for intelligent Arabic voice. In International Conference on Computers and Their Applications (pp. 93–98). Fares, T., Khalil, A., & Hegazy, A. (2008). Usage of the HMM-based speech synthesis for intelligent Arabic voice. In International Conference on Computers and Their Applications (pp. 93–98).
go back to reference Forti, M., & Nistri, P. (2003). Global convergence of neural networks with discontinuous neuron activations. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 50, 1421–1435.MathSciNetCrossRef Forti, M., & Nistri, P. (2003). Global convergence of neural networks with discontinuous neuron activations. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 50, 1421–1435.MathSciNetCrossRef
go back to reference Hamad, M., & Hussain, M. (2011). Arabic text-to-speech synthesizer. In IEEE Student Conference on Research and Development (pp. 409–414). Hamad, M., & Hussain, M. (2011). Arabic text-to-speech synthesizer. In IEEE Student Conference on Research and Development (pp. 409–414).
go back to reference Harrat, S., Meftouh, K., Abbas, M., & Smaili, K. (2014). Grapheme to phoneme conversion: An Arabic dialect case. In ISCA Tutorial and Research Workshop on Non Linear Speech Processing (pp. 1–6). Harrat, S., Meftouh, K., Abbas, M., & Smaili, K. (2014). Grapheme to phoneme conversion: An Arabic dialect case. In ISCA Tutorial and Research Workshop on Non Linear Speech Processing (pp. 1–6).
go back to reference Imai, S., Sumita, K., & Furuichi, C. (2007). Investigating an Arabic text to speech system based on diphone concatenation. International Journal of Intelligent Computing and Information Sciences, 7, 49–69. Imai, S., Sumita, K., & Furuichi, C. (2007). Investigating an Arabic text to speech system based on diphone concatenation. International Journal of Intelligent Computing and Information Sciences, 7, 49–69.
go back to reference Kantabutra, V. (2006). Towards reliable convergence in the training of neural networks—the streamlined glide algorithm and the LM Glide algorithm. In International Conference on Machine Learning: Models, Technologies and Applications (pp. 80–87). Kantabutra, V. (2006). Towards reliable convergence in the training of neural networks—the streamlined glide algorithm and the LM Glide algorithm. In International Conference on Machine Learning: Models, Technologies and Applications (pp. 80–87).
go back to reference Khalil, K., & Adnan, C. (2013). Arabic HMM-based speech synthesis. In International Conference on Electrical Engineering and Software Applications (pp. 1–5). Khalil, K., & Adnan, C. (2013). Arabic HMM-based speech synthesis. In International Conference on Electrical Engineering and Software Applications (pp. 1–5).
go back to reference Khorsheed, M. (2012). A HMM-based system to diacritize Arabic text. Journal of Software Engineering and Applications, 5, 124–127.CrossRef Khorsheed, M. (2012). A HMM-based system to diacritize Arabic text. Journal of Software Engineering and Applications, 5, 124–127.CrossRef
go back to reference Kominek, J., Schultz., T., & Black, A. (2008). Synthesizer voice quality of new languages calibrated with mean Mel Cepstral Distortion. In Workshop on Spoken Language Technologies for Under-Resourced Languages (pp. 1–6). Kominek, J., Schultz., T., & Black, A. (2008). Synthesizer voice quality of new languages calibrated with mean Mel Cepstral Distortion. In Workshop on Spoken Language Technologies for Under-Resourced Languages (pp. 1–6).
go back to reference Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Image Net classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Image Net classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105).
go back to reference Martin, K., Grezl, F., Hannemann, M., Vesely, K., & Cernocky, J. (2013). BUT BABEL system for spontaneous Cantonese. In Interspeech (pp. 2589–2593). Martin, K., Grezl, F., Hannemann, M., Vesely, K., & Cernocky, J. (2013). BUT BABEL system for spontaneous Cantonese. In Interspeech (pp. 2589–2593).
go back to reference Mnasri, Z., Boukadida, F., & Ellouze, N. (2010). F0 contour modeling for Arabic text-to-speech synthesis using Fujisaki parameters and neural networks. Signal Processing: An International Journal, 4, 352–369. Mnasri, Z., Boukadida, F., & Ellouze, N. (2010). F0 contour modeling for Arabic text-to-speech synthesis using Fujisaki parameters and neural networks. Signal Processing: An International Journal, 4, 352–369.
go back to reference Raghavendra, E., Vijayaditya, P., & Prahallad, K. (2010). Speech synthesis using artificial neural networks. In National Conference on Communications (pp. 1–5). Raghavendra, E., Vijayaditya, P., & Prahallad, K. (2010). Speech synthesis using artificial neural networks. In National Conference on Communications (pp. 1–5).
go back to reference Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., & Vainio, M., et al. (2011). Hmm-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19, 153–165.CrossRef Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., & Vainio, M., et al. (2011). Hmm-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19, 153–165.CrossRef
go back to reference Rebai, I., & BenAyed, Y. (2013). Arabic text to speech synthesis based on neural networks for MFCC estimation. In International Conference on Artificial Intelligence (pp. 1–5). Rebai, I., & BenAyed, Y. (2013). Arabic text to speech synthesis based on neural networks for MFCC estimation. In International Conference on Artificial Intelligence (pp. 1–5).
go back to reference Vinyals, O., Jia, Y., Deng, L., & Darrell, T. (2012). Learning with recursive perceptual representations. In 26th Annual Conference on Neural Information Processing Systems 2012 (pp. 2834–2842). Vinyals, O., Jia, Y., Deng, L., & Darrell, T. (2012). Learning with recursive perceptual representations. In 26th Annual Conference on Neural Information Processing Systems 2012 (pp. 2834–2842).
go back to reference Yousif, A. (2004). Phonetization of Arabic: Rules and algorithms. Computer Speech and Language, 18, 339–373.CrossRef Yousif, A. (2004). Phonetization of Arabic: Rules and algorithms. Computer Speech and Language, 18, 339–373.CrossRef
go back to reference Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In International Conference on Acoustics, Speech, and Signal Processing (pp. 7962–7966). Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In International Conference on Acoustics, Speech, and Signal Processing (pp. 7962–7966).
go back to reference Zitouni, I., & Sarikaya, R. (2009). Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language, 23, 257–276.CrossRef Zitouni, I., & Sarikaya, R. (2009). Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language, 23, 257–276.CrossRef
Metadata
Title
Arabic speech synthesis and diacritic recognition
Authors
Ilyes Rebai
Yassine BenAyed
Publication date
18-05-2016
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 3/2016
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9342-8

Other articles of this Issue 3/2016

International Journal of Speech Technology 3/2016 Go to the issue