Skip to main content
Top

2019 | OriginalPaper | Chapter

Deep Neural Network Based 3D Articulatory Movement Prediction Using Both Text and Audio Inputs

Authors : Lingyun Yu, Jun Yu, Qiang Ling

Published in: MultiMedia Modeling

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Robust and accurate predicting of articulatory movements has various important applications, such as human-machine interaction. Various approaches have been proposed to solve the acoustic-articulatory mapping problem. However, their precision is not high enough with only acoustic features available. Recently, deep neural network (DNN) has brought tremendous success in many fields. To increase the accuracy, on the one hand, we propose a new network architecture called bottleneck squeeze-and-excitation recurrent convolutional neural network (BSERCNN) for articulatory movement prediction. On the one hand, by introducing the squeeze-and-excitation (SE) module, our BSERCNN can model the interdependencies and relationships between channels and that makes our model more efficiency. On the other hand, phoneme-level text features and acoustic features are integrated together as inputs to BSERCNN for better performance. Experiments show that BSERCNN achieves the state-of-the-art root-mean-squared error (RMSE) 0.563 mm and the correlation coefficient 0.954 with both text and audio inputs.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
Literature
1.
go back to reference Yu, J., Wang, Z.-F.: A video, text, and speech-driven realistic 3-D virtual head for human-machine interface. IEEE Trans. Cybern. 45(5), 991–1002 (2015)CrossRef Yu, J., Wang, Z.-F.: A video, text, and speech-driven realistic 3-D virtual head for human-machine interface. IEEE Trans. Cybern. 45(5), 991–1002 (2015)CrossRef
2.
go back to reference Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)CrossRef Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)CrossRef
3.
go back to reference Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: A 3-D audio-visual corpus of affective communication. IEEE Trans. Multimedia 12(6), 591–598 (2010)CrossRef Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: A 3-D audio-visual corpus of affective communication. IEEE Trans. Multimedia 12(6), 591–598 (2010)CrossRef
4.
go back to reference Mitra, V.: Articulatory information for robust speech recognition, Ph.D. dissertation (2010) Mitra, V.: Articulatory information for robust speech recognition, Ph.D. dissertation (2010)
5.
go back to reference Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)CrossRef Toda, T., Black, A.W., Tokuda, K.: Statistical mapping between articulatory movements and acoustic spectrum using a gaussian mixture model. Speech Commun. 50(3), 215–227 (2008)CrossRef
6.
go back to reference Zhang, L., Renals, S.: Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)CrossRef Zhang, L., Renals, S.: Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)CrossRef
7.
go back to reference Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603 (2013) Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603 (2013)
8.
go back to reference Qian, Y., Fan, Y., Hu, W., Soong, F.K.: On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3829–3833 (2014) Qian, Y., Fan, Y., Hu, W., Soong, F.K.: On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3829–3833 (2014)
9.
go back to reference Uria, B., Murray, I., Renals, S., Richmond, K.: Deep architectures for articulatory inversion. In: Thirteenth Annual Conference of the International Speech Communication Association (2012) Uria, B., Murray, I., Renals, S., Richmond, K.: Deep architectures for articulatory inversion. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
10.
go back to reference Uria, B., Renals, S., Richmond, K.: A deep neural network for acoustic-articulatory speech inversion. In: NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011) Uria, B., Renals, S., Richmond, K.: A deep neural network for acoustic-articulatory speech inversion. In: NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011)
11.
go back to reference Zhu, P., Xie, L., Chen, Y.: Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: INTERSPEECH, pp. 2192–2196 (2015) Zhu, P., Xie, L., Chen, Y.: Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In: INTERSPEECH, pp. 2192–2196 (2015)
12.
go back to reference Wei, Z., Wu, Z., Xie, L.: Predicting articulatory movement from text using deep architecture with stacked bottleneck features. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE (2016) Wei, Z., Wu, Z., Xie, L.: Predicting articulatory movement from text using deep architecture with stacked bottleneck features. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE (2016)
13.
go back to reference Ling, Z.H., Richmond, K., Yamagishi, J.: An analysis of HMM-based prediction of articulatory movements. Speech Commun. 52(10), 834–846 (2010)CrossRef Ling, Z.H., Richmond, K., Yamagishi, J.: An analysis of HMM-based prediction of articulatory movements. Speech Commun. 52(10), 834–846 (2010)CrossRef
14.
go back to reference Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22(10), 1533–1545 (2014)CrossRef Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22(10), 1533–1545 (2014)CrossRef
16.
go back to reference Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Twelfth Annual Conference of the International Speech Communication Association (2011) Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
17.
go back to reference Cheng, X., Li, X., Tai, Y., Yang, J.: SESR: Single image super resolution with recursive squeeze and excitation networks, arXiv preprint arXiv:1801.10319 (2018) Cheng, X., Li, X., Tai, Y., Yang, J.: SESR: Single image super resolution with recursive squeeze and excitation networks, arXiv preprint arXiv:​1801.​10319 (2018)
18.
go back to reference Schönle, P.W., Gräbe, K., Wenig, P., Höhne, J., Schrader, J., Conrad, B.: Electromagnetic articulography: use of alternating magnetic fields for trackingmovements of multiple points inside and outside the vocal tract. Brain Lang. 31(1), 26–35 (1987)CrossRef Schönle, P.W., Gräbe, K., Wenig, P., Höhne, J., Schrader, J., Conrad, B.: Electromagnetic articulography: use of alternating magnetic fields for trackingmovements of multiple points inside and outside the vocal tract. Brain Lang. 31(1), 26–35 (1987)CrossRef
19.
go back to reference Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. Proc. SSW, Sunnyvale, USA (2016) Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. Proc. SSW, Sunnyvale, USA (2016)
20.
go back to reference Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014) Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
21.
go back to reference Ling, Z.-H., Richmond, K., Yamagishi, J.: HMM-based text-to-articulatory-movement prediction and analysis of critical articulators. In: Proc. Interspeech, pp. 2194–2197, Sep. 2010 Ling, Z.-H., Richmond, K., Yamagishi, J.: HMM-based text-to-articulatory-movement prediction and analysis of critical articulators. In: Proc. Interspeech, pp. 2194–2197, Sep. 2010
22.
go back to reference Richmond, K.: Preliminary inversion mapping results with a new EMA corpus (2009) Richmond, K.: Preliminary inversion mapping results with a new EMA corpus (2009)
23.
go back to reference Liu, P., Yu, Q., Wu, Z., Kang, S., Meng, H., Cai, L.: A deep recurrent approach for acoustic-to-articulatory inversion. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4450–4454. IEEE (2015) Liu, P., Yu, Q., Wu, Z., Kang, S., Meng, H., Cai, L.: A deep recurrent approach for acoustic-to-articulatory inversion. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4450–4454. IEEE (2015)
24.
go back to reference Yu, J., Li, A., Hu, F., et al.: Data-driven 3D visual pronunciation of Chinese IPA for language learning. In: 2013 International Conference on Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6. IEEE (2013) Yu, J., Li, A., Hu, F., et al.: Data-driven 3D visual pronunciation of Chinese IPA for language learning. In: 2013 International Conference on Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6. IEEE (2013)
25.
go back to reference Marcos, S., Gómez-García-Bermejo, J., Zalama, E.: A realistic, virtual head for human-computer interaction. Interact. Comput. 22(3), 176–192 (2010)CrossRef Marcos, S., Gómez-García-Bermejo, J., Zalama, E.: A realistic, virtual head for human-computer interaction. Interact. Comput. 22(3), 176–192 (2010)CrossRef
Metadata
Title
Deep Neural Network Based 3D Articulatory Movement Prediction Using Both Text and Audio Inputs
Authors
Lingyun Yu
Jun Yu
Qiang Ling
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-05710-7_6