Duration modelling and evaluation for Arabic statistical parametric speech synthesis

Zangar, Imene; Mnasri, Zied; Colotte, Vincent; Jouvet, Denis

doi:10.1007/s11042-020-09901-7

Duration modelling and evaluation for Arabic statistical parametric speech synthesis

Published: 02 November 2020

Volume 80, pages 8331–8353, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Imene Zangar¹,
Zied Mnasri ORCID: orcid.org/0000-0002-8929-3609¹,
Vincent Colotte² &
…
Denis Jouvet²

205 Accesses
4 Citations
Explore all metrics

Abstract

Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For example, in Arabic, gemination and vowel quantity are two important characteristics of the language. Therefore, accurate duration modelling is crucial for Arabic TTS systems. This paper is interested in improving the modelling of phone duration for Arabic statistical parametric speech synthesis using DNN-based models. In fact, since a few years, DNN have been frequently used for parametric speech synthesis, instead of HMM. Therefore, several variants of DNN-based duration models for Arabic are investigated. The novelty consists in training a specific DNN model for each class of sounds, i.e. short vowels, long vowels, simple consonants and geminated consonants. The main idea behind this choice is the improvement that we already achieved in the quality of Arabic parametric speech synthesis by the introduction of two specific features of Arabic, i.e. gemination and vowel quantity into the standard HTS feature set. Both objective and subjective evaluations show that using a specific model for each class of sounds leads to a more accurate modelling of the phone duration in Arabic parametric speech synthesis, outperforming the state-of-the-art duration modelling systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

Article 22 November 2018

Statistical Analysis of the Prosodic Parameters of a Spontaneous Arabic Speech Corpus for Speech Synthesis

References

Abdelhamid O, Abdou SM, Rashwan M (2006) Improving Arabic HMM-based speech synthesis quality. In: Proceeding international conference on spoken language processing, Pittsburgh, Pennsylvania, pp 1332–1335
Abdelmalek R, Mnasri Z (2016) High quality Arabic text-to-speech synthesis using unit selection. In: Proceeding IEEE international multi-conference on systems, signals, signals & devices, Leipzig, Germany, pp 1–5
Arabic Speech Corpus (2020) http://en.arabicspeechcorpus.com/. Accessed Aug 2020
Boukadida F, Ellouze N (2005) Modélisation Statistique de la durée des Voyelles en Parole Arabe. In: Proceeding science of electronics, telecommunications and information technology conference, Tunisia, pp 1–4
Campbell WN (1993) Predicting segmental durations for accommodation within a syllable-level timing framework. In: Proceeding european conference on speech communication and technology, Berlin, Germany, pp 1332–1335
Chen B, Bian T, Yu K (2017) Discrete duration model for speech synthesis. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 789–793
Chen B, Lai J, Yu K (2017) Comparison of modeling target in LSTM-RNN duration model. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 794–798
Dimolitsas S, Corcoran FL, Ravishankar C (1995) Dependence of opinion scores on listening sets used in degradation category rating assessments. IEEE Trans Speech Audio Process 3(5):421–424
Article Google Scholar
Dutoit T, Leich H (1993) MBR-PSOLA: Text-to-speech Synthesis based on an MBE re-synthesis of the segments database. Speech Comm 13(3-4):435–440
Article Google Scholar
Fernandez R, Rendel A, Ramabhadran B, Hoory R (2014) Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceeding annual conference of the international speech communication association, Singapore, pp 2268–2272
Gao B, Qian Y, Wu Z, Soong FK (2008) Duration refinement by jointly optimizing state and longer unit likelihood. In: Proceeding annual conference of the international speech communication association, Brisbane, Australia, pp 2266–2269
Griffin DW, Lim JS (1988) Multiband excitation vocoder. IEEE Trans Acoust Speech Signal Process 36(8):1223–1235
Article Google Scholar
HTS toolkit (2018) http://hts.sp.nitech.ac.jp. Accessed Nov 2018
Halabi N (2016) Modern standard arabic phonetics for speech synthesis, Dissertation, University of Southamtpon
Halabi N, Wald M (2016) Phonetic inventory for an Arabic speech corpus. In: Proceeding international conference on language resources and evaluation, Portoroz, Slovenia, pp 734–738
Henter GE, Ronanki S, Watts O, Wester M, Wu Z, King S (2016) Robust TTS duration modelling using DNNs. In: Proceeding international conference on acoustics, speech and signal processing, Shanghai, China, pp 5130–5134
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Houidhek A, Colotte V, Mnasri Z, Jouvet D, Zangar I (2017) Statistical modelling of speech units in HMM-based speech synthesis for Arabic. In: Proceeding language & technology conference, Poznan, Poland, pp 1–6
Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceeding IEEE international conference on acoustics, speech and signal processing, atlanta, GA, USA, pp 373–376
Imai S, Sumita K, Furuichi C (1983) Mel log spectrum approximation (MLSA) filter for speech synthesis. Electron Commun Japan (Part I:, Commun) 66 (2):10–18
Article Google Scholar
Ishimatsu Y (2001) Investigation of state duration model based on gamma distribution for HMM-based speech synthesis. IEICE Technical Report, SP2001-81
Kawahara H (1997) Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In: Proceeding international conference on acoustics, speech and signal processing, Munich, Germany, pp 1303–1306
Klatt DH (1976) Linguistic uses of segmental duration in english: Acoustic and perceptual evidence. J Acoust Soc Am 59(5):1208–1221
Article Google Scholar
Klatt DH, William EC (1975) Perception of segment duration in sentence contexts. In: Structure and process in speech perception. Springer, Berlin, pp 69–89
Lazaridis A, Honnet PE, Garner PN (2014) SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis, Technical Report No.EPFL-REPORT-198140
Lu H, Wu YJ, Tokuda K, Dai LR, Wang RH (2009) Full covariance state duration modeling for HMM-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, Taipei, Taiwan, pp 4033–4036
MERLIN toolkit (2018) https://github.com/CSTR-Edinburgh/Merlin. Accessed Nov 2018
Mixdorff H (2002) An integrated approach to modeling German prosody, Doktor-Ingenieur habilitatus Dissertation, Technische Universitaet Dresden
Mnasri Z, Boukadida F, Ellouze N (2009) Segmental duration modeling using non parametric statistical learning. Int Rev Comput Softw 4(5):533–542
Google Scholar
Morise M, Yokomori F, Ozawa K (2016) WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inform Syst 99(7):1877–1884
Article Google Scholar
Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Comm 9 (5-6):453–467
Article Google Scholar
Moungsri D, Koriyama T, Kobayashi T (2017) Duration prediction using multiple Gaussian process experts for GPR-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, New Orleans, USA, pp 5495–5499
Newman D (1984) The phonetics of Arabic. J Am Orient Soc 44:1–6
Google Scholar
Ogbureke U, Cabral J, Berndsen J (2012) Explicit duration modelling in HMM-based speech synthesis using a hybrid hidden Markov model-multilayer perceptron. In: Proceeding SAPA-SCALE conference, workshops on statistical and perceptual audition speech communication with adaptive learning portland, OR, USA
Pan S, Tao J, Wang Y (2011) A state duration generation algorithm considering global variance for HMM-based speech synthesis. In: Proceeding annual summit and conference asia pacific signal and information processing association, Xi’an, China
Rao KS, Yegnanarayana B (2007) Modeling durations of syllables using neural networks. Comput Speech Lang 21(2):282–295
Article Google Scholar
Riedi M (1997) Modeling segmental duration with multivariate adaptive regression splines. In: Proceeding european conference on speech communication and technology, Rhodes, Greece, pp 2627–2630
Riley MD (1990) Tree-based modelling for speech synthesis. In: Proceeding ESCA workshop on speech synthesis, Autrans, France, pp 229–232
Rosen KM (2005) Analysis of speech segment duration with the lognormal distribution: a basis for unification and comparison. J Phon 33(4):411–426
Article Google Scholar
Rubin P, Baer T, Mermelstein P (1981) An articulatory synthesizer for perceptual research. J Acoust Soc Am 70(2):321–328
Article Google Scholar
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan Rj, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceeding international conference on acoustics, speech and signal processing, calgary, Alberta, Canada, pp 4779–4783
Shinoda K, Watanabe T (1997) Acoustic modeling based on the MDL principle for speech recognition. In: Proceeding european conference on speech communication and technology, Rhodes, Greece, pp 99–102
Silén H, Helander E, Nurminen J, Gabbouj M (2010) Analysis of duration prediction accuracy in HMM-based speech synthesis. In: Proceeding international conference on speech prosody, chicago, IL, USA, pp 1–4
Sola J, Sevilla J (1997) Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Trans Nucl Sci 44(3):1464–1468
Article Google Scholar
Thorpe LA, Shelton BR (1993) Subjective test methodology: MOS vs. DMOS in evaluation of speech coding algorithms. In: Proceeding IEEE workshop on speech coding for telecommunications, pp 73–74
Van Santen JP (1994) Assignment of segmentalduration in text-to-speech synthesis. Comput Speech Lang 8(2):95–128
Article Google Scholar
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. (2017) Tacotron: towards end-to-end speech synthesis. In: Proceeding annual conference of the international speech communication association, Stockholm, Sweden, pp 4006–4010
Wavenet (2020) https://deepmind.com/blog/article/wavenet-generative-model-raw-audio. Accessed Aug 2020
Wu Z, Watts O, King S (2016) MERLIN: an open source neural network speech synthesis system. In: Proceeding ISCA workshop on speech synthesis, Sunnyvale, USA, pp 202–207
Yijian W, Renhua W (2006) HMM-Based trainable speech synthesis for chinese. J Chinese Inform Process 20(4):75–81
Google Scholar
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1998) Duration modeling for HMM-based speech synthesis. In: Proceeding international conference on spoken language processing, Sydney, Australia, pp 29–32
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proceeding European conference on speech communication and technology, Budapest, Hungary, pp 2347–2350
Yu K, Mairesse F, Young S (2010) Word-level emphasis modelling in HMM-based speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, dallas, TX, USA, pp 4238–4241
Zaki A, Rajouani A, Najim M (2002) Un modèle prédictif de la durée segmentale pour la synthèse de la parole arabe à partir du texte. In: Proceeding Journées d’etudes sur la parole, Nancy, France, pp 89–92
Zangar I, Colotte V, Mnasri Z, Jouvet D, Houidhek A (2018) Duration modelling using DNN for Arabic speech synthesis. In: Proceeding international conference on speech prosody, Poznan, Poland, pp 597–601
Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceeding international conference on acoustics, speech and signal processing, Brisbane, Australia, pp 4470–4474
Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceeding IEEE international conference on acoustics, speech and signal processing, Vancouver, Canada, pp 7962–7966
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Comm 51(11):1039–1064
Article Google Scholar
Zen H, Tokuda K, Masuko T, Kobayashi T, Kitamura T (2004) Hidden semi-Markov model based speech synthesis. In: Proceeding international conference on spoken language processing, Jeju Island, Korea, pp 1393–1396

Download references

Acknowledgments

This research work was conducted in the framework of PHC-Utique Program, financed by CMCU (Comité mixte de coopération universitaire), grant No.15G1405.

Author information

Authors and Affiliations

University Tunis El Manar, Ecole Nationale d’Ingénieurs de Tunis, Electrical Engineering Department, BP 37, 1002, Tunis, Tunisia
Imene Zangar & Zied Mnasri
Université de Lorraine, CNRS, Inria, LORIA, F-54000, Nancy, France
Vincent Colotte & Denis Jouvet

Authors

Imene Zangar
View author publications
You can also search for this author in PubMed Google Scholar
Zied Mnasri
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Colotte
View author publications
You can also search for this author in PubMed Google Scholar
Denis Jouvet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zied Mnasri.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zangar, I., Mnasri, Z., Colotte, V. et al. Duration modelling and evaluation for Arabic statistical parametric speech synthesis. Multimed Tools Appl 80, 8331–8353 (2021). https://doi.org/10.1007/s11042-020-09901-7

Download citation

Received: 26 June 2020
Revised: 04 September 2020
Accepted: 16 September 2020
Published: 02 November 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11042-020-09901-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Duration modelling and evaluation for Arabic statistical parametric speech synthesis

Abstract

Access this article

Similar content being viewed by others

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

Statistical Analysis of the Prosodic Parameters of a Spontaneous Arabic Speech Corpus for Speech Synthesis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Duration modelling and evaluation for Arabic statistical parametric speech synthesis

Abstract

Access this article

Similar content being viewed by others

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

Statistical Analysis of the Prosodic Parameters of a Spontaneous Arabic Speech Corpus for Speech Synthesis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation