Skip to main content
Log in

Duration modelling and evaluation for Arabic statistical parametric speech synthesis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For example, in Arabic, gemination and vowel quantity are two important characteristics of the language. Therefore, accurate duration modelling is crucial for Arabic TTS systems. This paper is interested in improving the modelling of phone duration for Arabic statistical parametric speech synthesis using DNN-based models. In fact, since a few years, DNN have been frequently used for parametric speech synthesis, instead of HMM. Therefore, several variants of DNN-based duration models for Arabic are investigated. The novelty consists in training a specific DNN model for each class of sounds, i.e. short vowels, long vowels, simple consonants and geminated consonants. The main idea behind this choice is the improvement that we already achieved in the quality of Arabic parametric speech synthesis by the introduction of two specific features of Arabic, i.e. gemination and vowel quantity into the standard HTS feature set. Both objective and subjective evaluations show that using a specific model for each class of sounds leads to a more accurate modelling of the phone duration in Arabic parametric speech synthesis, outperforming the state-of-the-art duration modelling systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abdelhamid O, Abdou SM, Rashwan M (2006) Improving Arabic HMM-based speech synthesis quality. In: Proceeding international conference on spoken language processing, Pittsburgh, Pennsylvania, pp 1332–1335

  2. Abdelmalek R, Mnasri Z (2016) High quality Arabic text-to-speech synthesis using unit selection. In: Proceeding IEEE international multi-conference on systems, signals, signals & devices, Leipzig, Germany, pp 1–5

  3. Arabic Speech Corpus (2020) http://en.arabicspeechcorpus.com/. Accessed Aug 2020

  4. Boukadida F, Ellouze N (2005) Modélisation Statistique de la durée des Voyelles en Parole Arabe. In: Proceeding science of electronics, telecommunications and information technology conference, Tunisia, pp 1–4

  5. Campbell WN (1993) Predicting segmental durations for accommodation within a syllable-level timing framework. In: Proceeding european conference on speech communication and technology, Berlin, Germany, pp 1332–1335

  6. Chen B, Bian T, Yu K (2017) Discrete duration model for speech synthesis. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 789–793

  7. Chen B, Lai J, Yu K (2017) Comparison of modeling target in LSTM-RNN duration model. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 794–798

  8. Dimolitsas S, Corcoran FL, Ravishankar C (1995) Dependence of opinion scores on listening sets used in degradation category rating assessments. IEEE Trans Speech Audio Process 3(5):421–424

    Article  Google Scholar 

  9. Dutoit T, Leich H (1993) MBR-PSOLA: Text-to-speech Synthesis based on an MBE re-synthesis of the segments database. Speech Comm 13(3-4):435–440

    Article  Google Scholar 

  10. Fernandez R, Rendel A, Ramabhadran B, Hoory R (2014) Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceeding annual conference of the international speech communication association, Singapore, pp 2268–2272

  11. Gao B, Qian Y, Wu Z, Soong FK (2008) Duration refinement by jointly optimizing state and longer unit likelihood. In: Proceeding annual conference of the international speech communication association, Brisbane, Australia, pp 2266–2269

  12. Griffin DW, Lim JS (1988) Multiband excitation vocoder. IEEE Trans Acoust Speech Signal Process 36(8):1223–1235

    Article  Google Scholar 

  13. HTS toolkit (2018) http://hts.sp.nitech.ac.jp. Accessed Nov 2018

  14. Halabi N (2016) Modern standard arabic phonetics for speech synthesis, Dissertation, University of Southamtpon

  15. Halabi N, Wald M (2016) Phonetic inventory for an Arabic speech corpus. In: Proceeding international conference on language resources and evaluation, Portoroz, Slovenia, pp 734–738

  16. Henter GE, Ronanki S, Watts O, Wester M, Wu Z, King S (2016) Robust TTS duration modelling using DNNs. In: Proceeding international conference on acoustics, speech and signal processing, Shanghai, China, pp 5130–5134

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  18. Houidhek A, Colotte V, Mnasri Z, Jouvet D, Zangar I (2017) Statistical modelling of speech units in HMM-based speech synthesis for Arabic. In: Proceeding language & technology conference, Poznan, Poland, pp 1–6

  19. Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceeding IEEE international conference on acoustics, speech and signal processing, atlanta, GA, USA, pp 373–376

  20. Imai S, Sumita K, Furuichi C (1983) Mel log spectrum approximation (MLSA) filter for speech synthesis. Electron Commun Japan (Part I:, Commun) 66 (2):10–18

    Article  Google Scholar 

  21. Ishimatsu Y (2001) Investigation of state duration model based on gamma distribution for HMM-based speech synthesis. IEICE Technical Report, SP2001-81

  22. Kawahara H (1997) Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In: Proceeding international conference on acoustics, speech and signal processing, Munich, Germany, pp 1303–1306

  23. Klatt DH (1976) Linguistic uses of segmental duration in english: Acoustic and perceptual evidence. J Acoust Soc Am 59(5):1208–1221

    Article  Google Scholar 

  24. Klatt DH, William EC (1975) Perception of segment duration in sentence contexts. In: Structure and process in speech perception. Springer, Berlin, pp 69–89

  25. Lazaridis A, Honnet PE, Garner PN (2014) SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis, Technical Report No.EPFL-REPORT-198140

  26. Lu H, Wu YJ, Tokuda K, Dai LR, Wang RH (2009) Full covariance state duration modeling for HMM-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, Taipei, Taiwan, pp 4033–4036

  27. MERLIN toolkit (2018) https://github.com/CSTR-Edinburgh/Merlin. Accessed Nov 2018

  28. Mixdorff H (2002) An integrated approach to modeling German prosody, Doktor-Ingenieur habilitatus Dissertation, Technische Universitaet Dresden

  29. Mnasri Z, Boukadida F, Ellouze N (2009) Segmental duration modeling using non parametric statistical learning. Int Rev Comput Softw 4(5):533–542

    Google Scholar 

  30. Morise M, Yokomori F, Ozawa K (2016) WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inform Syst 99(7):1877–1884

    Article  Google Scholar 

  31. Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Comm 9 (5-6):453–467

    Article  Google Scholar 

  32. Moungsri D, Koriyama T, Kobayashi T (2017) Duration prediction using multiple Gaussian process experts for GPR-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, New Orleans, USA, pp 5495–5499

  33. Newman D (1984) The phonetics of Arabic. J Am Orient Soc 44:1–6

    Google Scholar 

  34. Ogbureke U, Cabral J, Berndsen J (2012) Explicit duration modelling in HMM-based speech synthesis using a hybrid hidden Markov model-multilayer perceptron. In: Proceeding SAPA-SCALE conference, workshops on statistical and perceptual audition speech communication with adaptive learning portland, OR, USA

  35. Pan S, Tao J, Wang Y (2011) A state duration generation algorithm considering global variance for HMM-based speech synthesis. In: Proceeding annual summit and conference asia pacific signal and information processing association, Xi’an, China

  36. Rao KS, Yegnanarayana B (2007) Modeling durations of syllables using neural networks. Comput Speech Lang 21(2):282–295

    Article  Google Scholar 

  37. Riedi M (1997) Modeling segmental duration with multivariate adaptive regression splines. In: Proceeding european conference on speech communication and technology, Rhodes, Greece, pp 2627–2630

  38. Riley MD (1990) Tree-based modelling for speech synthesis. In: Proceeding ESCA workshop on speech synthesis, Autrans, France, pp 229–232

  39. Rosen KM (2005) Analysis of speech segment duration with the lognormal distribution: a basis for unification and comparison. J Phon 33(4):411–426

    Article  Google Scholar 

  40. Rubin P, Baer T, Mermelstein P (1981) An articulatory synthesizer for perceptual research. J Acoust Soc Am 70(2):321–328

    Article  Google Scholar 

  41. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan Rj, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceeding international conference on acoustics, speech and signal processing, calgary, Alberta, Canada, pp 4779–4783

  42. Shinoda K, Watanabe T (1997) Acoustic modeling based on the MDL principle for speech recognition. In: Proceeding european conference on speech communication and technology, Rhodes, Greece, pp 99–102

  43. Silén H, Helander E, Nurminen J, Gabbouj M (2010) Analysis of duration prediction accuracy in HMM-based speech synthesis. In: Proceeding international conference on speech prosody, chicago, IL, USA, pp 1–4

  44. Sola J, Sevilla J (1997) Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Trans Nucl Sci 44(3):1464–1468

    Article  Google Scholar 

  45. Thorpe LA, Shelton BR (1993) Subjective test methodology: MOS vs. DMOS in evaluation of speech coding algorithms. In: Proceeding IEEE workshop on speech coding for telecommunications, pp 73–74

  46. Van Santen JP (1994) Assignment of segmentalduration in text-to-speech synthesis. Comput Speech Lang 8(2):95–128

    Article  Google Scholar 

  47. Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. (2017) Tacotron: towards end-to-end speech synthesis. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 4006–4010

  48. Wavenet (2020) https://deepmind.com/blog/article/wavenet-generative-model-raw-audio. Accessed Aug 2020

  49. Wu Z, Watts O, King S (2016) MERLIN: an open source neural network speech synthesis system. In: Proceeding ISCA workshop on speech synthesis, Sunnyvale, USA, pp 202–207

  50. Yijian W, Renhua W (2006) HMM-Based trainable speech synthesis for chinese. J Chinese Inform Process 20(4):75–81

    Google Scholar 

  51. Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1998) Duration modeling for HMM-based speech synthesis. In: Proceeding international conference on spoken language processing, Sydney, Australia, pp 29–32

  52. Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proceeding European conference on speech communication and technology, Budapest, Hungary, pp 2347–2350

  53. Yu K, Mairesse F, Young S (2010) Word-level emphasis modelling in HMM-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, dallas, TX, USA, pp 4238–4241

  54. Zaki A, Rajouani A, Najim M (2002) Un modèle prédictif de la durée segmentale pour la synthèse de la parole arabe à partir du texte. In: Proceeding Journées d’etudes sur la parole, Nancy, France, pp 89–92

  55. Zangar I, Colotte V, Mnasri Z, Jouvet D, Houidhek A (2018) Duration modelling using DNN for Arabic speech synthesis. In: Proceeding international conference on speech prosody, Poznan, Poland, pp 597–601

  56. Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, Brisbane, Australia, pp 4470–4474

  57. Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceeding IEEE international conference on acoustics, speech and signal processing, Vancouver, Canada, pp 7962–7966

  58. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Comm 51(11):1039–1064

    Article  Google Scholar 

  59. Zen H, Tokuda K, Masuko T, Kobayashi T, Kitamura T (2004) Hidden semi-Markov model based speech synthesis. In: Proceeding international conference on spoken language processing, Jeju Island, Korea, pp 1393–1396

Download references

Acknowledgments

This research work was conducted in the framework of PHC-Utique Program, financed by CMCU (Comité mixte de coopération universitaire), grant No.15G1405.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zied Mnasri.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zangar, I., Mnasri, Z., Colotte, V. et al. Duration modelling and evaluation for Arabic statistical parametric speech synthesis. Multimed Tools Appl 80, 8331–8353 (2021). https://doi.org/10.1007/s11042-020-09901-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09901-7

Keywords

Navigation