Skip to main content
Erschienen in: Multimedia Systems 3/2016

01.06.2016 | Regular Paper

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

verfasst von: Jianrong Wang, Ju Zhang, Kiyoshi Honda, Jianguo Wei, Jianwu Dang

Erschienen in: Multimedia Systems | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates Inc, New Jersey (1987) Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates Inc, New Jersey (1987)
2.
Zurück zum Zitat McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)CrossRef McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)CrossRef
3.
Zurück zum Zitat Macleod, A., Summerfield, Q.: A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24(1), 29–43 (1990)CrossRef Macleod, A., Summerfield, Q.: A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24(1), 29–43 (1990)CrossRef
4.
Zurück zum Zitat Mehrabian, A.: Nonverbal betrayal of feeling. J. Exp. Res. Personal. 5(1), 64–73 (1971) Mehrabian, A.: Nonverbal betrayal of feeling. J. Exp. Res. Personal. 5(1), 64–73 (1971)
5.
Zurück zum Zitat Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: Proceedings of the International Conference on Image Processing, pp. 173–177 (1998) Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: Proceedings of the International Conference on Image Processing, pp. 173–177 (1998)
6.
Zurück zum Zitat Potamianos, G., Neti, C., Iyengar, G., Senior, A.W., Verma, A.: A cascade visual front end for speaker independent automatic speechreading. Int. J. Speech Technol. 4(3–4), 193–208 (2001)CrossRefMATH Potamianos, G., Neti, C., Iyengar, G., Senior, A.W., Verma, A.: A cascade visual front end for speaker independent automatic speechreading. Int. J. Speech Technol. 4(3–4), 193–208 (2001)CrossRefMATH
7.
Zurück zum Zitat Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D.: Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins summer 2000 workshop. In: Proceedings of the IEEE Fourth Workshop on Multimedia Signal Processing, pp. 619–624 (2001) Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D.: Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins summer 2000 workshop. In: Proceedings of the IEEE Fourth Workshop on Multimedia Signal Processing, pp. 619–624 (2001)
8.
Zurück zum Zitat Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRef Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRef
9.
Zurück zum Zitat Xu, C., Wang, Y., Tan, T., Quan, L.: Depth vs. intensity: which is more important for face recognition?. In: Proceedings of the International Conference on Pattern Recognition, pp. 342–345 (2004) Xu, C., Wang, Y., Tan, T., Quan, L.: Depth vs. intensity: which is more important for face recognition?. In: Proceedings of the International Conference on Pattern Recognition, pp. 342–345 (2004)
10.
Zurück zum Zitat Goecke, R., Millar. J.B.: The audio-video Australian English speech data corpus AVOZES. In: Proceedings of the International Conference on Spoken Language Processing, pp. 2525–2528 (2004) Goecke, R., Millar. J.B.: The audio-video Australian English speech data corpus AVOZES. In: Proceedings of the International Conference on Spoken Language Processing, pp. 2525–2528 (2004)
11.
Zurück zum Zitat Ortega, A., Sukno, F., Lleida, E., Frangi, A.F., Miguel, A., Buera, L., Zacur, E.: AV@ CAR: a Spanish multichannel multimodal corpus for invehicle automatic audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 763–766 (2004) Ortega, A., Sukno, F., Lleida, E., Frangi, A.F., Miguel, A., Buera, L., Zacur, E.: AV@ CAR: a Spanish multichannel multimodal corpus for invehicle automatic audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 763–766 (2004)
12.
Zurück zum Zitat Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: Acquisition of a 3D audio-visual corpus of affective speech. IEEE Trans. Multimed. 12(6), 591–598 (2010)CrossRef Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: Acquisition of a 3D audio-visual corpus of affective speech. IEEE Trans. Multimed. 12(6), 591–598 (2010)CrossRef
13.
Zurück zum Zitat Vorwerk, A., Wang, X., Kolossa, D., Zeiler, S., Orglmeister, R.: WAPUSK20-a database for robust audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 3016–3019 (2010) Vorwerk, A., Wang, X., Kolossa, D., Zeiler, S., Orglmeister, R.: WAPUSK20-a database for robust audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 3016–3019 (2010)
14.
Zurück zum Zitat Webb, J., Ashley, J.: Beginning Kinect Programming with the Microsoft Kinect SDK. Apress, California (2012)CrossRef Webb, J., Ashley, J.: Beginning Kinect Programming with the Microsoft Kinect SDK. Apress, California (2012)CrossRef
15.
Zurück zum Zitat Galatas, G., Potamianos, G., Kosmopoulos, D.I., McMurrough, C., Makedon, F.: Bilingual corpus for AVASR using multiple sensors and depth information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, pp. 103–106 (2011) Galatas, G., Potamianos, G., Kosmopoulos, D.I., McMurrough, C., Makedon, F.: Bilingual corpus for AVASR using multiple sensors and depth information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, pp. 103–106 (2011)
16.
Zurück zum Zitat Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: Proceedings of the European Signal Processing Conference, pp. 2714–2717 (2012) Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: Proceedings of the European Signal Processing Conference, pp. 2714–2717 (2012)
17.
Zurück zum Zitat Ahlberg, J.: Candide-3-an updated parameterized face. Report No. LiTH-ISY-R-2326. Linkoping University, Sweden (2001) Ahlberg, J.: Candide-3-an updated parameterized face. Report No. LiTH-ISY-R-2326. Linkoping University, Sweden (2001)
18.
Zurück zum Zitat Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2306–2310 (2013) Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2306–2310 (2013)
19.
Zurück zum Zitat Yargic, A., Dogan, M.: A lip reading application on MS Kinect camera. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, pp. 1–5 (2013) Yargic, A., Dogan, M.: A lip reading application on MS Kinect camera. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, pp. 1–5 (2013)
20.
Zurück zum Zitat Werda, S., Mahdi, W., Hamadou, A.B.: Lip localization and viseme classification for visual speech recognition. Int. J. Comput. Inf. Sci. 5(1), 62–75 (2013) Werda, S., Mahdi, W., Hamadou, A.B.: Lip localization and viseme classification for visual speech recognition. Int. J. Comput. Inf. Sci. 5(1), 62–75 (2013)
21.
Zurück zum Zitat Ramos, E.: Kinect Basics. Arduino and Kinect Projects. Apress, California (2012) Ramos, E.: Kinect Basics. Arduino and Kinect Projects. Apress, California (2012)
22.
Zurück zum Zitat Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for Lip-Reading. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 321–326 (2006) Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for Lip-Reading. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 321–326 (2006)
23.
Zurück zum Zitat Chatfield, C., Collins, A.J.: Introduction to Multivariate Analysis. Springer, Berlin (2013)MATH Chatfield, C., Collins, A.J.: Introduction to Multivariate Analysis. Springer, Berlin (2013)MATH
24.
Zurück zum Zitat Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., et al.: The Hidden Markov Model Toolkit Book (version 3.4). Entropic Cambridge Research Laboratory, Cambridge (1995) Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., et al.: The Hidden Markov Model Toolkit Book (version 3.4). Entropic Cambridge Research Laboratory, Cambridge (1995)
Metadaten
Titel
Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
verfasst von
Jianrong Wang
Ju Zhang
Kiyoshi Honda
Jianguo Wei
Jianwu Dang
Publikationsdatum
01.06.2016
Verlag
Springer Berlin Heidelberg
Erschienen in
Multimedia Systems / Ausgabe 3/2016
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-015-0499-9

Weitere Artikel der Ausgabe 3/2016

Multimedia Systems 3/2016 Zur Ausgabe

Neuer Inhalt