Skip to main content

2017 | OriginalPaper | Buchkapitel

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

verfasst von : Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In our earlier work in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in combination with Maximum Voiced Frequency (MVF), which was successfully used with a feed-forward deep neural network (DNN). The advantage of a continuous vocoder in this scenario is that vocoder parameters are simpler to model than traditional vocoders with discontinuous F0. However, DNNs have a lack of sequence modeling which might degrade the quality of synthesized speech. In order to avoid this problem, we propose the use of sequence-to-sequence modeling with recurrent neural networks (RNNs). In this paper, four neural network architectures (long short-term memory (LSTM), bidirectional LSTM (BLSTM), gated recurrent network (GRU), and standard RNN) are investigated and applied using this continuous vocoder to model F0, MVF, and Mel-Generalized Cepstrum (MGC) for more natural sounding speech synthesis. Experimental results from objective and subjective evaluations have shown that the proposed framework converges faster and gives state-of-the-art speech synthesis performance while outperforming the conventional feed-forward DNN.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zen, H., Tokuda, K., Black, A.: Statistical parameteric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef Zen, H., Tokuda, K., Black, A.: Statistical parameteric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRef
2.
Zurück zum Zitat Zen, H., Shannon, M., Byrne, W.: Autoregressive models for statistical parametric speech synthesis. IEEE Trans. Acoust. Speech Lang. Process. 21(3), 587–597 (2013)CrossRef Zen, H., Shannon, M., Byrne, W.: Autoregressive models for statistical parametric speech synthesis. IEEE Trans. Acoust. Speech Lang. Process. 21(3), 587–597 (2013)CrossRef
3.
Zurück zum Zitat Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura T.: Simultaneous modeling of spectrum, pitch, and duration in HMM based speech synthesis. In: Proceedings of Eurospeech, pp. 2347–2350 (1999) Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura T.: Simultaneous modeling of spectrum, pitch, and duration in HMM based speech synthesis. In: Proceedings of Eurospeech, pp. 2347–2350 (1999)
4.
Zurück zum Zitat Ling, Z.H., et al.: Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Sig. Process. Mag. 32(3), 35–52 (2015)CrossRef Ling, Z.H., et al.: Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Sig. Process. Mag. 32(3), 35–52 (2015)CrossRef
5.
Zurück zum Zitat Najafabadi, M., Villanustre, F., Khoshgoftaar, T., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1–21 (2015)CrossRef Najafabadi, M., Villanustre, F., Khoshgoftaar, T., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1–21 (2015)CrossRef
6.
Zurück zum Zitat Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP, pp. 7962–7966 (2013) Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP, pp. 7962–7966 (2013)
7.
Zurück zum Zitat Valentini-Botinhao, C., Wu, Z., and King, S.: Towards minimum perceptual error training for DNN-based speech synthesis. In: Interspeech, pp. 869–873 (2015) Valentini-Botinhao, C., Wu, Z., and King, S.: Towards minimum perceptual error training for DNN-based speech synthesis. In: Interspeech, pp. 869–873 (2015)
8.
Zurück zum Zitat Wu, Z., Valentini-Botinhao, C., Watts, O., and King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: ICASSP, pp. 4460–4464 (2015) Wu, Z., Valentini-Botinhao, C., Watts, O., and King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: ICASSP, pp. 4460–4464 (2015)
9.
Zurück zum Zitat Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: ICASSP, pp. 3844–3848 (2014) Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: ICASSP, pp. 3844–3848 (2014)
10.
Zurück zum Zitat Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5(2), 157–166 (1994)CrossRef Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5(2), 157–166 (1994)CrossRef
11.
Zurück zum Zitat Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
12.
Zurück zum Zitat Fan, Y., Qian Y., Xie F., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Interspeech, pp. 1964–1968 (2014) Fan, Y., Qian Y., Xie F., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Interspeech, pp. 1964–1968 (2014)
13.
Zurück zum Zitat Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint: 1412.3555 (2014) Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint: 1412.3555 (2014)
14.
Zurück zum Zitat Csapó, T.G., Németh, G, and Cernak M.: Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis. In: 3rd International Conference on Statistical Language and Speech Processing, SLSP 2015, vol. 9449, pp. 27–38 (2015) Csapó, T.G., Németh, G, and Cernak M.: Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis. In: 3rd International Conference on Statistical Language and Speech Processing, SLSP 2015, vol. 9449, pp. 27–38 (2015)
15.
Zurück zum Zitat Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Sig. Process. Lett. 20(1), 102–105 (2013)CrossRef Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Sig. Process. Lett. 20(1), 102–105 (2013)CrossRef
16.
Zurück zum Zitat Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation: exploiting amplitude and phase spectra. IEEE Sig. Process. Lett. 21(10), 1230–1234 (2014)CrossRef Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation: exploiting amplitude and phase spectra. IEEE Sig. Process. Lett. 21(10), 1230–1234 (2014)CrossRef
17.
Zurück zum Zitat Csapó, T.G., Németh, G., Cernak, M., Garner, P.N.: Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder. In: EUSIPCO, Budapest (2016) Csapó, T.G., Németh, G., Cernak, M., Garner, P.N.: Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder. In: EUSIPCO, Budapest (2016)
18.
Zurück zum Zitat Al-Radhi, M.S., Csapó T.G., and Németh, G.: Continuous vocoder in deep neural network based speech synthesis. In: Preparation (2017) Al-Radhi, M.S., Csapó T.G., and Németh, G.: Continuous vocoder in deep neural network based speech synthesis. In: Preparation (2017)
19.
Zurück zum Zitat Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis – a unified approach to speech spectral estimation. In: Proceedings of ICSLP, pp. 1043–1046 (1994) Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis – a unified approach to speech spectral estimation. In: Proceedings of ICSLP, pp. 1043–1046 (1994)
20.
Zurück zum Zitat Imai, S., Sumita, K., Furuichi, C.: Mel log spectrum approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. (Part I: Commun.) 66(2), 10–18 (1983)CrossRef Imai, S., Sumita, K., Furuichi, C.: Mel log spectrum approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. (Part I: Commun.) 66(2), 10–18 (1983)CrossRef
21.
Zurück zum Zitat Al-Radhi, M.S., Csapó, T.G., Németh, G.: Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis. In: Interspeech (2017) Al-Radhi, M.S., Csapó, T.G., Németh, G.: Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis. In: Interspeech (2017)
22.
Zurück zum Zitat Robel, A., Villavicencio, F., Rodet, X.: On cepstral and all-pole based spectral envelope modeling with unknown model order. Pattern Recogn. Lett. 28(11), 1343–1350 (2007)CrossRef Robel, A., Villavicencio, F., Rodet, X.: On cepstral and all-pole based spectral envelope modeling with unknown model order. Pattern Recogn. Lett. 28(11), 1343–1350 (2007)CrossRef
23.
Zurück zum Zitat Galas, T., Rodet, X.: An improved cepstral method for deconvolution of source-filter systems with discrete spectra. In: Proceedings of the ICMC, pp. 82–84 (1990) Galas, T., Rodet, X.: An improved cepstral method for deconvolution of source-filter systems with discrete spectra. In: Proceedings of the ICMC, pp. 82–84 (1990)
24.
Zurück zum Zitat Cappe, O., Moulines, E.: Regularization techniques for discrete cepstrum estimation. IEEE Sig. Process. 3(4), 100–103 (1996)CrossRef Cappe, O., Moulines, E.: Regularization techniques for discrete cepstrum estimation. IEEE Sig. Process. 3(4), 100–103 (1996)CrossRef
25.
Zurück zum Zitat Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings of the 9th ISCA Speech Synthesis Workshop, Sunnyvale, USA (2016) Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings of the 9th ISCA Speech Synthesis Workshop, Sunnyvale, USA (2016)
26.
Zurück zum Zitat Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans. on Signal Processing 45(11), 2673–2681 (1997)CrossRef Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans. on Signal Processing 45(11), 2673–2681 (1997)CrossRef
28.
Zurück zum Zitat Imai, S., Kobayashi, T., Tokuda, K., Masuko, T., Koishida, K., Sako, S., Zen, H.: Speech signal processing toolkit (SPTK) (2016) Imai, S., Kobayashi, T., Tokuda, K., Masuko, T., Koishida, K., Sako, S., Zen, H.: Speech signal processing toolkit (SPTK) (2016)
29.
Zurück zum Zitat ITU-R Recommendation BS.1534. Method for the subjective assessment of intermediate audio quality (2001) ITU-R Recommendation BS.1534. Method for the subjective assessment of intermediate audio quality (2001)
Metadaten
Titel
Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder
verfasst von
Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Géza Németh
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-66429-3_27

Premium Partner