Abstract
Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For example, in Arabic, gemination and vowel quantity are two important characteristics of the language. Therefore, accurate duration modelling is crucial for Arabic TTS systems. This paper is interested in improving the modelling of phone duration for Arabic statistical parametric speech synthesis using DNN-based models. In fact, since a few years, DNN have been frequently used for parametric speech synthesis, instead of HMM. Therefore, several variants of DNN-based duration models for Arabic are investigated. The novelty consists in training a specific DNN model for each class of sounds, i.e. short vowels, long vowels, simple consonants and geminated consonants. The main idea behind this choice is the improvement that we already achieved in the quality of Arabic parametric speech synthesis by the introduction of two specific features of Arabic, i.e. gemination and vowel quantity into the standard HTS feature set. Both objective and subjective evaluations show that using a specific model for each class of sounds leads to a more accurate modelling of the phone duration in Arabic parametric speech synthesis, outperforming the state-of-the-art duration modelling systems.
Similar content being viewed by others
References
Abdelhamid O, Abdou SM, Rashwan M (2006) Improving Arabic HMM-based speech synthesis quality. In: Proceeding international conference on spoken language processing, Pittsburgh, Pennsylvania, pp 1332–1335
Abdelmalek R, Mnasri Z (2016) High quality Arabic text-to-speech synthesis using unit selection. In: Proceeding IEEE international multi-conference on systems, signals, signals & devices, Leipzig, Germany, pp 1–5
Arabic Speech Corpus (2020) http://en.arabicspeechcorpus.com/. Accessed Aug 2020
Boukadida F, Ellouze N (2005) Modélisation Statistique de la durée des Voyelles en Parole Arabe. In: Proceeding science of electronics, telecommunications and information technology conference, Tunisia, pp 1–4
Campbell WN (1993) Predicting segmental durations for accommodation within a syllable-level timing framework. In: Proceeding european conference on speech communication and technology, Berlin, Germany, pp 1332–1335
Chen B, Bian T, Yu K (2017) Discrete duration model for speech synthesis. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 789–793
Chen B, Lai J, Yu K (2017) Comparison of modeling target in LSTM-RNN duration model. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 794–798
Dimolitsas S, Corcoran FL, Ravishankar C (1995) Dependence of opinion scores on listening sets used in degradation category rating assessments. IEEE Trans Speech Audio Process 3(5):421–424
Dutoit T, Leich H (1993) MBR-PSOLA: Text-to-speech Synthesis based on an MBE re-synthesis of the segments database. Speech Comm 13(3-4):435–440
Fernandez R, Rendel A, Ramabhadran B, Hoory R (2014) Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceeding annual conference of the international speech communication association, Singapore, pp 2268–2272
Gao B, Qian Y, Wu Z, Soong FK (2008) Duration refinement by jointly optimizing state and longer unit likelihood. In: Proceeding annual conference of the international speech communication association, Brisbane, Australia, pp 2266–2269
Griffin DW, Lim JS (1988) Multiband excitation vocoder. IEEE Trans Acoust Speech Signal Process 36(8):1223–1235
HTS toolkit (2018) http://hts.sp.nitech.ac.jp. Accessed Nov 2018
Halabi N (2016) Modern standard arabic phonetics for speech synthesis, Dissertation, University of Southamtpon
Halabi N, Wald M (2016) Phonetic inventory for an Arabic speech corpus. In: Proceeding international conference on language resources and evaluation, Portoroz, Slovenia, pp 734–738
Henter GE, Ronanki S, Watts O, Wester M, Wu Z, King S (2016) Robust TTS duration modelling using DNNs. In: Proceeding international conference on acoustics, speech and signal processing, Shanghai, China, pp 5130–5134
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Houidhek A, Colotte V, Mnasri Z, Jouvet D, Zangar I (2017) Statistical modelling of speech units in HMM-based speech synthesis for Arabic. In: Proceeding language & technology conference, Poznan, Poland, pp 1–6
Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceeding IEEE international conference on acoustics, speech and signal processing, atlanta, GA, USA, pp 373–376
Imai S, Sumita K, Furuichi C (1983) Mel log spectrum approximation (MLSA) filter for speech synthesis. Electron Commun Japan (Part I:, Commun) 66 (2):10–18
Ishimatsu Y (2001) Investigation of state duration model based on gamma distribution for HMM-based speech synthesis. IEICE Technical Report, SP2001-81
Kawahara H (1997) Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In: Proceeding international conference on acoustics, speech and signal processing, Munich, Germany, pp 1303–1306
Klatt DH (1976) Linguistic uses of segmental duration in english: Acoustic and perceptual evidence. J Acoust Soc Am 59(5):1208–1221
Klatt DH, William EC (1975) Perception of segment duration in sentence contexts. In: Structure and process in speech perception. Springer, Berlin, pp 69–89
Lazaridis A, Honnet PE, Garner PN (2014) SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis, Technical Report No.EPFL-REPORT-198140
Lu H, Wu YJ, Tokuda K, Dai LR, Wang RH (2009) Full covariance state duration modeling for HMM-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, Taipei, Taiwan, pp 4033–4036
MERLIN toolkit (2018) https://github.com/CSTR-Edinburgh/Merlin. Accessed Nov 2018
Mixdorff H (2002) An integrated approach to modeling German prosody, Doktor-Ingenieur habilitatus Dissertation, Technische Universitaet Dresden
Mnasri Z, Boukadida F, Ellouze N (2009) Segmental duration modeling using non parametric statistical learning. Int Rev Comput Softw 4(5):533–542
Morise M, Yokomori F, Ozawa K (2016) WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inform Syst 99(7):1877–1884
Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Comm 9 (5-6):453–467
Moungsri D, Koriyama T, Kobayashi T (2017) Duration prediction using multiple Gaussian process experts for GPR-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, New Orleans, USA, pp 5495–5499
Newman D (1984) The phonetics of Arabic. J Am Orient Soc 44:1–6
Ogbureke U, Cabral J, Berndsen J (2012) Explicit duration modelling in HMM-based speech synthesis using a hybrid hidden Markov model-multilayer perceptron. In: Proceeding SAPA-SCALE conference, workshops on statistical and perceptual audition speech communication with adaptive learning portland, OR, USA
Pan S, Tao J, Wang Y (2011) A state duration generation algorithm considering global variance for HMM-based speech synthesis. In: Proceeding annual summit and conference asia pacific signal and information processing association, Xi’an, China
Rao KS, Yegnanarayana B (2007) Modeling durations of syllables using neural networks. Comput Speech Lang 21(2):282–295
Riedi M (1997) Modeling segmental duration with multivariate adaptive regression splines. In: Proceeding european conference on speech communication and technology, Rhodes, Greece, pp 2627–2630
Riley MD (1990) Tree-based modelling for speech synthesis. In: Proceeding ESCA workshop on speech synthesis, Autrans, France, pp 229–232
Rosen KM (2005) Analysis of speech segment duration with the lognormal distribution: a basis for unification and comparison. J Phon 33(4):411–426
Rubin P, Baer T, Mermelstein P (1981) An articulatory synthesizer for perceptual research. J Acoust Soc Am 70(2):321–328
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan Rj, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceeding international conference on acoustics, speech and signal processing, calgary, Alberta, Canada, pp 4779–4783
Shinoda K, Watanabe T (1997) Acoustic modeling based on the MDL principle for speech recognition. In: Proceeding european conference on speech communication and technology, Rhodes, Greece, pp 99–102
Silén H, Helander E, Nurminen J, Gabbouj M (2010) Analysis of duration prediction accuracy in HMM-based speech synthesis. In: Proceeding international conference on speech prosody, chicago, IL, USA, pp 1–4
Sola J, Sevilla J (1997) Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Trans Nucl Sci 44(3):1464–1468
Thorpe LA, Shelton BR (1993) Subjective test methodology: MOS vs. DMOS in evaluation of speech coding algorithms. In: Proceeding IEEE workshop on speech coding for telecommunications, pp 73–74
Van Santen JP (1994) Assignment of segmentalduration in text-to-speech synthesis. Comput Speech Lang 8(2):95–128
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. (2017) Tacotron: towards end-to-end speech synthesis. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 4006–4010
Wavenet (2020) https://deepmind.com/blog/article/wavenet-generative-model-raw-audio. Accessed Aug 2020
Wu Z, Watts O, King S (2016) MERLIN: an open source neural network speech synthesis system. In: Proceeding ISCA workshop on speech synthesis, Sunnyvale, USA, pp 202–207
Yijian W, Renhua W (2006) HMM-Based trainable speech synthesis for chinese. J Chinese Inform Process 20(4):75–81
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1998) Duration modeling for HMM-based speech synthesis. In: Proceeding international conference on spoken language processing, Sydney, Australia, pp 29–32
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proceeding European conference on speech communication and technology, Budapest, Hungary, pp 2347–2350
Yu K, Mairesse F, Young S (2010) Word-level emphasis modelling in HMM-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, dallas, TX, USA, pp 4238–4241
Zaki A, Rajouani A, Najim M (2002) Un modèle prédictif de la durée segmentale pour la synthèse de la parole arabe à partir du texte. In: Proceeding Journées d’etudes sur la parole, Nancy, France, pp 89–92
Zangar I, Colotte V, Mnasri Z, Jouvet D, Houidhek A (2018) Duration modelling using DNN for Arabic speech synthesis. In: Proceeding international conference on speech prosody, Poznan, Poland, pp 597–601
Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, Brisbane, Australia, pp 4470–4474
Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceeding IEEE international conference on acoustics, speech and signal processing, Vancouver, Canada, pp 7962–7966
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Comm 51(11):1039–1064
Zen H, Tokuda K, Masuko T, Kobayashi T, Kitamura T (2004) Hidden semi-Markov model based speech synthesis. In: Proceeding international conference on spoken language processing, Jeju Island, Korea, pp 1393–1396
Acknowledgments
This research work was conducted in the framework of PHC-Utique Program, financed by CMCU (Comité mixte de coopération universitaire), grant No.15G1405.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zangar, I., Mnasri, Z., Colotte, V. et al. Duration modelling and evaluation for Arabic statistical parametric speech synthesis. Multimed Tools Appl 80, 8331–8353 (2021). https://doi.org/10.1007/s11042-020-09901-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09901-7