Skip to main content
Erschienen in: International Journal of Speech Technology 3/2016

18.05.2016

Arabic speech synthesis and diacritic recognition

verfasst von: Ilyes Rebai, Yassine BenAyed

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Although the softmax activation function is popular in DNN-based classification, our preliminary experiments showed that the DNN with the tangent sigmoid activation function at the output layer consistently outperformed those with the softmax one.
 
Literatur
Zurück zum Zitat Al-Said, G., & Abdallah, M. (2009). An Arabic text-to-speech system based on artificial neural networks. Journal of Computer Science, 5, 207–213.CrossRef Al-Said, G., & Abdallah, M. (2009). An Arabic text-to-speech system based on artificial neural networks. Journal of Computer Science, 5, 207–213.CrossRef
Zurück zum Zitat Alghamdi, M., Zeeshan, M., & Hazim, A. (2010). Automatic restoration of Arabic diacritics: A simple, purely statistical approach. The Arabian Journal for Science and Engineering, 35, 125–135. Alghamdi, M., Zeeshan, M., & Hazim, A. (2010). Automatic restoration of Arabic diacritics: A simple, purely statistical approach. The Arabian Journal for Science and Engineering, 35, 125–135.
Zurück zum Zitat Attia, M. (2005). Theory and implementation of a large-scale Arabic phonetic transcriptor, and applications. PhD thesis, Department of Electronics and Electrical Communications, Faculty of Engineering, Cairo. Attia, M. (2005). Theory and implementation of a large-scale Arabic phonetic transcriptor, and applications. PhD thesis, Department of Electronics and Electrical Communications, Faculty of Engineering, Cairo.
Zurück zum Zitat Badrashiny, M, (2009). Automatic diacritizer for Arabic texts. PhD thesis, University of Cairo, Cairo Badrashiny, M, (2009). Automatic diacritizer for Arabic texts. PhD thesis, University of Cairo, Cairo
Zurück zum Zitat Ben Sassi, S., Braham, R., & Belghith, A. (2001). Neural speech synthesis system for Arabic language using CELP algorithm. In ACS/IEEE International Conference on Computer Systems and Applications (pp. 119–121) Ben Sassi, S., Braham, R., & Belghith, A. (2001). Neural speech synthesis system for Arabic language using CELP algorithm. In ACS/IEEE International Conference on Computer Systems and Applications (pp. 119–121)
Zurück zum Zitat Chouireb, F., & Guerti, M. (2008). Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model. Signal, Image and Video Processing, 2, 73–87.CrossRefMATH Chouireb, F., & Guerti, M. (2008). Towards a high quality Arabic speech synthesis system based on neural networks and residual excited vocal tract model. Signal, Image and Video Processing, 2, 73–87.CrossRefMATH
Zurück zum Zitat Ciresan, D., Meier, U., Masci, J., Gambardella, L., & Schmidhuber, J. (2011). High-performance neural networks for visual object classification. Computing Research Repository abs/1102.0183:1–11 Ciresan, D., Meier, U., Masci, J., Gambardella, L., & Schmidhuber, J. (2011). High-performance neural networks for visual object classification. Computing Research Repository abs/1102.0183:1–11
Zurück zum Zitat Elshafei, M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2002). Techniques for high quality Arabic speech synthesis. Information Sciences, 140, 255–267.CrossRefMATH Elshafei, M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2002). Techniques for high quality Arabic speech synthesis. Information Sciences, 140, 255–267.CrossRefMATH
Zurück zum Zitat Elshafei, M., Almuhtasib, H., & Alghamdi, M. (2006). Machine generation of Arabic diacritical marks. In The 2006 World Congress in Computer Science Computer Engineering, and Applied Computing, pp. 128–133. Elshafei, M., Almuhtasib, H., & Alghamdi, M. (2006). Machine generation of Arabic diacritical marks. In The 2006 World Congress in Computer Science Computer Engineering, and Applied Computing, pp. 128–133.
Zurück zum Zitat Fares, T., Khalil, A., & Hegazy, A. (2008). Usage of the HMM-based speech synthesis for intelligent Arabic voice. In International Conference on Computers and Their Applications (pp. 93–98). Fares, T., Khalil, A., & Hegazy, A. (2008). Usage of the HMM-based speech synthesis for intelligent Arabic voice. In International Conference on Computers and Their Applications (pp. 93–98).
Zurück zum Zitat Forti, M., & Nistri, P. (2003). Global convergence of neural networks with discontinuous neuron activations. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 50, 1421–1435.MathSciNetCrossRef Forti, M., & Nistri, P. (2003). Global convergence of neural networks with discontinuous neuron activations. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 50, 1421–1435.MathSciNetCrossRef
Zurück zum Zitat Hamad, M., & Hussain, M. (2011). Arabic text-to-speech synthesizer. In IEEE Student Conference on Research and Development (pp. 409–414). Hamad, M., & Hussain, M. (2011). Arabic text-to-speech synthesizer. In IEEE Student Conference on Research and Development (pp. 409–414).
Zurück zum Zitat Harrat, S., Meftouh, K., Abbas, M., & Smaili, K. (2014). Grapheme to phoneme conversion: An Arabic dialect case. In ISCA Tutorial and Research Workshop on Non Linear Speech Processing (pp. 1–6). Harrat, S., Meftouh, K., Abbas, M., & Smaili, K. (2014). Grapheme to phoneme conversion: An Arabic dialect case. In ISCA Tutorial and Research Workshop on Non Linear Speech Processing (pp. 1–6).
Zurück zum Zitat Imai, S., Sumita, K., & Furuichi, C. (2007). Investigating an Arabic text to speech system based on diphone concatenation. International Journal of Intelligent Computing and Information Sciences, 7, 49–69. Imai, S., Sumita, K., & Furuichi, C. (2007). Investigating an Arabic text to speech system based on diphone concatenation. International Journal of Intelligent Computing and Information Sciences, 7, 49–69.
Zurück zum Zitat Kantabutra, V. (2006). Towards reliable convergence in the training of neural networks—the streamlined glide algorithm and the LM Glide algorithm. In International Conference on Machine Learning: Models, Technologies and Applications (pp. 80–87). Kantabutra, V. (2006). Towards reliable convergence in the training of neural networks—the streamlined glide algorithm and the LM Glide algorithm. In International Conference on Machine Learning: Models, Technologies and Applications (pp. 80–87).
Zurück zum Zitat Khalil, K., & Adnan, C. (2013). Arabic HMM-based speech synthesis. In International Conference on Electrical Engineering and Software Applications (pp. 1–5). Khalil, K., & Adnan, C. (2013). Arabic HMM-based speech synthesis. In International Conference on Electrical Engineering and Software Applications (pp. 1–5).
Zurück zum Zitat Khorsheed, M. (2012). A HMM-based system to diacritize Arabic text. Journal of Software Engineering and Applications, 5, 124–127.CrossRef Khorsheed, M. (2012). A HMM-based system to diacritize Arabic text. Journal of Software Engineering and Applications, 5, 124–127.CrossRef
Zurück zum Zitat Kominek, J., Schultz., T., & Black, A. (2008). Synthesizer voice quality of new languages calibrated with mean Mel Cepstral Distortion. In Workshop on Spoken Language Technologies for Under-Resourced Languages (pp. 1–6). Kominek, J., Schultz., T., & Black, A. (2008). Synthesizer voice quality of new languages calibrated with mean Mel Cepstral Distortion. In Workshop on Spoken Language Technologies for Under-Resourced Languages (pp. 1–6).
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Image Net classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Image Net classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105).
Zurück zum Zitat Martin, K., Grezl, F., Hannemann, M., Vesely, K., & Cernocky, J. (2013). BUT BABEL system for spontaneous Cantonese. In Interspeech (pp. 2589–2593). Martin, K., Grezl, F., Hannemann, M., Vesely, K., & Cernocky, J. (2013). BUT BABEL system for spontaneous Cantonese. In Interspeech (pp. 2589–2593).
Zurück zum Zitat Mnasri, Z., Boukadida, F., & Ellouze, N. (2010). F0 contour modeling for Arabic text-to-speech synthesis using Fujisaki parameters and neural networks. Signal Processing: An International Journal, 4, 352–369. Mnasri, Z., Boukadida, F., & Ellouze, N. (2010). F0 contour modeling for Arabic text-to-speech synthesis using Fujisaki parameters and neural networks. Signal Processing: An International Journal, 4, 352–369.
Zurück zum Zitat Raghavendra, E., Vijayaditya, P., & Prahallad, K. (2010). Speech synthesis using artificial neural networks. In National Conference on Communications (pp. 1–5). Raghavendra, E., Vijayaditya, P., & Prahallad, K. (2010). Speech synthesis using artificial neural networks. In National Conference on Communications (pp. 1–5).
Zurück zum Zitat Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., & Vainio, M., et al. (2011). Hmm-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19, 153–165.CrossRef Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., & Vainio, M., et al. (2011). Hmm-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19, 153–165.CrossRef
Zurück zum Zitat Rebai, I., & BenAyed, Y. (2013). Arabic text to speech synthesis based on neural networks for MFCC estimation. In International Conference on Artificial Intelligence (pp. 1–5). Rebai, I., & BenAyed, Y. (2013). Arabic text to speech synthesis based on neural networks for MFCC estimation. In International Conference on Artificial Intelligence (pp. 1–5).
Zurück zum Zitat Vinyals, O., Jia, Y., Deng, L., & Darrell, T. (2012). Learning with recursive perceptual representations. In 26th Annual Conference on Neural Information Processing Systems 2012 (pp. 2834–2842). Vinyals, O., Jia, Y., Deng, L., & Darrell, T. (2012). Learning with recursive perceptual representations. In 26th Annual Conference on Neural Information Processing Systems 2012 (pp. 2834–2842).
Zurück zum Zitat Yousif, A. (2004). Phonetization of Arabic: Rules and algorithms. Computer Speech and Language, 18, 339–373.CrossRef Yousif, A. (2004). Phonetization of Arabic: Rules and algorithms. Computer Speech and Language, 18, 339–373.CrossRef
Zurück zum Zitat Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In International Conference on Acoustics, Speech, and Signal Processing (pp. 7962–7966). Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In International Conference on Acoustics, Speech, and Signal Processing (pp. 7962–7966).
Zurück zum Zitat Zitouni, I., & Sarikaya, R. (2009). Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language, 23, 257–276.CrossRef Zitouni, I., & Sarikaya, R. (2009). Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language, 23, 257–276.CrossRef
Metadaten
Titel
Arabic speech synthesis and diacritic recognition
verfasst von
Ilyes Rebai
Yassine BenAyed
Publikationsdatum
18.05.2016
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 3/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9342-8

Weitere Artikel der Ausgabe 3/2016

International Journal of Speech Technology 3/2016 Zur Ausgabe

Neuer Inhalt