Top

Pattern Analysis and Applications

Published in:

01-08-2014 | Theoretical Advances

Visual-speech-pass filtering for robust automatic lip-reading

Author: Jong-Seok Lee

Published in: Pattern Analysis and Applications | Issue 3/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This paper proposes a temporal filtering technique used in extraction of visual features for improved robustness of automatic lip-reading, called visual-speech-pass filtering. A band-pass filter is applied to the pixel value sequence of the images containing the speaker’s lip region to remove unwanted variations that are not relevant to the speech information. The filter is carefully designed based on psychological, spectral, and experimental analyses. Experimental results on two speaker-independent and one speaker-dependent recognition tasks demonstrate that the proposed technique significantly improves recognition performance in both clean and visually noisy conditions.

previous article Performance enhancement of online handwritten Tamil symbol recognition with reevaluation techniques

next article A smartphone-based virtual white cane

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

This is opposite to the acoustic speech case where the noise interference appears in the low frequency range, and thus high-pass filtering of the temporal trajectories of filterbank energies is helpful for noise-robustness [11].

Amer A, Dubois E (2005) Fast and reliable structure-oriented video noise estimation. IEEE Trans Circuits Syst Video Technol 15(1):113–118CrossRef

Arsic I, Thiran JP (2006) Mutual information eigenlips for audio-visual speech recognition. In: Proceedings of European Signal Processing Conference Florence, Italy

Bregler C, Konig Y (1994) Eigenlips for robust speech recognition. In: Proceedings of International Conference Acoustics, Speech and Signal Processing, Adelaide, Australia, vol. 2, pp 669–672

Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37CrossRef

Chiou GI, Hwang JN (1997) Lipreading from color video. IEEE Trans Image Process 6(8):1192–1195CrossRef

Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimed 2(3):141–151CrossRef

Fox NA, O’Mullane BA, Reilly RB (2005) VALID: a new practical audio-visual database, and comparative results. In: Proceedings of International Conference Audio- and Video-Based Biometric Person Authentication, New York, USA, pp 777–786

Frowein HW, Smoorenburg GF, Pyters L, Schinkel D (1991) Improved speech recognition through videotelephony: experiments with the hard of hearing. IEEE J Sel Areas Commun 9(4):611–616CrossRef

Gurbuz S, Tufekci Z, Patterson E, Gowdy J (2001) Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA. vol 1, pp 177–180

10.

Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of 29th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, vol 2, pp 1214–1218

11.

Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589CrossRef

12.

Huang X, Acero A, Hon HW (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Upper Saddle River

13.

Jung HY, Lee SY (2000) On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Trans Speech Audio Process 8(4):407–416CrossRefMathSciNet

14.

Kaynak MN, Zhi Q, Cheok AD, Sengupta K, Jian Z, Chung KC (2004) Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Commun 43(1–2):1–16CrossRef

15.

Lan, Y, Harvey, R, Theobald, BJ, Ong, EJ., Bowden, R (2009) Comparing visual features for lipreading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Norwich, UK, pp 102–106

16.

Lan, Y, Theobald, BJ, Harvey, R, Ong, EJ, Bowden, R (2010) Improving visual features for lip-reading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Kanagawa, Japan, pp 142–147

17.

Lee JS, Park CH (2008) Robust audio-visual speech recognition based on late integration. IEEE Trans Multimed 10(5):767–779CrossRef

18.

Lee JS, Park CH (2010) Hybrid simulated annealing and its application to optimization of hidden markov models for visual speech recognition. IEEE Trans Syst Man Cybern B 40(4):1188–1196CrossRef

19.

Lucey S (2003) An evaluation of visual speech features for the tasks of speech and speaker recognition. In: Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp 260–267 (2003)

20.

Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213CrossRef

21.

Matthews I, Potamianos G, Neti C, Luettin J (2001) A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of International Conference on Multimedia and Expo, Tokyo, Japan, pp 22–25

22.

Munhall K, Vatikiotis-Bateson E (1998) The moving face during speech communication. In: Campbell R, Dodd B, Burnham, D (eds) Hearing by eye II: advances in the psychology of speechreading and audio-visual speech. Psychology Press, Hove, pp 123–142

23.

Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Bellevue, WA, USA (2011)

24.

Ohala JJ (1975) The temporal regulation of speech. In: Fant G, Tatham MA (eds) Auditory analysis and perception. Academic Press, London, pp 431–453

25.

Oppenheim AV, Schafer RW (1999) Discrete-time signal processing, 2nd edn. Prentice-Hall, Upper Saddle River (1999)

26.

O’Shaughnessy D (2008) Automatic speech recognition: history, methods and challenges. Pattern Recognit 41:2965–2979CrossRefMATH

27.

Petajan ED (1985) Automatic lipreading to enhance speech recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, pp. 40–47

28.

Potamianos G, Graf HP (1998) Linear discriminant analysis for speechreading. In: Proceedings of IEEE Workshop on Multimedia Processing, Redeondo Beach, CA, USA, pp 221–226

29.

Potamianos G, Graf HP, Cosatto E (1998) An image transform approach for HMM based automatic lipreading. In: Proceedings of International Conference on Image Processing, Chicago, IL, USA, vol 3, pp 173–177

30.

Potamianos G, Neti C (2003) Audio-visual speech recognition in challenging environments. In: Proceedings of Eurospeech, Geneva, Switzerland, pp 1293–1296

31.

Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326CrossRef

32.

Rabi, G, Lu, SW (1997) Energy minimization for extracting mouth curves in a facial image. In: Proceedings of International Conference on Intelligent Information Systems, Bahamas, pp 381–385

33.

Saenko K, Darrell T, Glass J (2004) Articulatory features for robust visual speech recognition. In: Proceedings of International Conference on Multimodal Interfaces, State College, PA, USA, pp 152–158

34.

Saenko K, Livescu K, Glass J, Darrell T (2009) Multistream articulatory feature-based models for visual speech recognition IEEE Trans Pattern Anal Mach Intell 31:1700–1707CrossRef

35.

Seymour R, Stewart D, Ming J (2008) Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP J Image Video Process

36.

Silsbee PL, Bovik AC (1996) Computer lipreading for improved accuracy in automatic speech recognition. IEEE Trans Speech Audio Process 4(5):337–351CrossRef

37.

Vitkovitch M, Barber P (1996) Visible speech as a function of image quality: effects of display parameters on lipreading ability. Appl Cogn Psychol 10:121–140CrossRef

38.

Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265CrossRef

Title: Visual-speech-pass filtering for robust automatic lip-reading
Author: Jong-Seok Lee
Publication date: 01-08-2014
Publisher: Springer London
Published in: Pattern Analysis and Applications / Issue 3/2014
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-013-0350-x

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 3/2014

Performance enhancement of online handwritten Tamil symbol recognition with reevaluation techniques

A new approach to signature recognition using the fuzzy method

Dimensionality reduction and topographic mapping of binary tensors

Geometric algorithm for dominant point extraction from shape contour

A smartphone-based virtual white cane

Adding a rigid motion model to foreground detection: application to moving object detection in rivers

Premium Partner