nach oben

Multimedia Systems

Erschienen in:

01.06.2016 | Regular Paper

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

verfasst von: Jianrong Wang, Ju Zhang, Kiyoshi Honda, Jianguo Wei, Jianwu Dang

Erschienen in: Multimedia Systems | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.

Vorheriger Artikel Robust visual tracking via online semi-supervised co-boosting

Nächster Artikel 2D Histogram-based player localization in broadcast volleyball videos

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates Inc, New Jersey (1987)

McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)CrossRef

Macleod, A., Summerfield, Q.: A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24(1), 29–43 (1990)CrossRef

Mehrabian, A.: Nonverbal betrayal of feeling. J. Exp. Res. Personal. 5(1), 64–73 (1971)

Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: Proceedings of the International Conference on Image Processing, pp. 173–177 (1998)

Potamianos, G., Neti, C., Iyengar, G., Senior, A.W., Verma, A.: A cascade visual front end for speaker independent automatic speechreading. Int. J. Speech Technol. 4(3–4), 193–208 (2001)CrossRefMATH

Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D.: Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins summer 2000 workshop. In: Proceedings of the IEEE Fourth Workshop on Multimedia Signal Processing, pp. 619–624 (2001)

Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRef

Xu, C., Wang, Y., Tan, T., Quan, L.: Depth vs. intensity: which is more important for face recognition?. In: Proceedings of the International Conference on Pattern Recognition, pp. 342–345 (2004)

10.

Goecke, R., Millar. J.B.: The audio-video Australian English speech data corpus AVOZES. In: Proceedings of the International Conference on Spoken Language Processing, pp. 2525–2528 (2004)

11.

Ortega, A., Sukno, F., Lleida, E., Frangi, A.F., Miguel, A., Buera, L., Zacur, E.: AV@ CAR: a Spanish multichannel multimodal corpus for invehicle automatic audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 763–766 (2004)

12.

Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: Acquisition of a 3D audio-visual corpus of affective speech. IEEE Trans. Multimed. 12(6), 591–598 (2010)CrossRef

13.

Vorwerk, A., Wang, X., Kolossa, D., Zeiler, S., Orglmeister, R.: WAPUSK20-a database for robust audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 3016–3019 (2010)

14.

Webb, J., Ashley, J.: Beginning Kinect Programming with the Microsoft Kinect SDK. Apress, California (2012)CrossRef

15.

Galatas, G., Potamianos, G., Kosmopoulos, D.I., McMurrough, C., Makedon, F.: Bilingual corpus for AVASR using multiple sensors and depth information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, pp. 103–106 (2011)

16.

Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: Proceedings of the European Signal Processing Conference, pp. 2714–2717 (2012)

17.

Ahlberg, J.: Candide-3-an updated parameterized face. Report No. LiTH-ISY-R-2326. Linkoping University, Sweden (2001)

18.

Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2306–2310 (2013)

19.

Yargic, A., Dogan, M.: A lip reading application on MS Kinect camera. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, pp. 1–5 (2013)

20.

Werda, S., Mahdi, W., Hamadou, A.B.: Lip localization and viseme classification for visual speech recognition. Int. J. Comput. Inf. Sci. 5(1), 62–75 (2013)

21.

Ramos, E.: Kinect Basics. Arduino and Kinect Projects. Apress, California (2012)

22.

Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for Lip-Reading. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 321–326 (2006)

23.

Chatfield, C., Collins, A.J.: Introduction to Multivariate Analysis. Springer, Berlin (2013)MATH

24.

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., et al.: The Hidden Markov Model Toolkit Book (version 3.4). Entropic Cambridge Research Laboratory, Cambridge (1995)

Titel: Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
verfasst von: Jianrong Wang
Ju Zhang
Kiyoshi Honda
Jianguo Wei
Jianwu Dang
Publikationsdatum: 01.06.2016
Verlag: Springer Berlin Heidelberg
Erschienen in: Multimedia Systems / Ausgabe 3/2016
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-015-0499-9

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence_ieS/© Springer Fachmedien Wiesbaden GmbH, Search Icon, Banner Hanser, Bunte Männchen, die Kunden darstelle, werden von einem riesigen Magneten angezogen. /© Oleksiy Mark, Dr. Daniel Schneider/© Fraunhofer IESE, Interview Level Ten PPA Bild/© LevelTen, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2016

A multiple reversible watermarking technique for fingerprint authentication

A content-based approach for detecting highlights in action movies

Automatic selection of color reference image for panoramic stitching

Robust visual tracking via online semi-supervised co-boosting

Spectral–spatial co-clustering of hyperspectral image data based on bipartite graph

2D Histogram-based player localization in broadcast volleyball videos

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.