Skip to main content
Top
Published in: Pattern Analysis and Applications 3/2014

01-08-2014 | Theoretical Advances

Visual-speech-pass filtering for robust automatic lip-reading

Author: Jong-Seok Lee

Published in: Pattern Analysis and Applications | Issue 3/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper proposes a temporal filtering technique used in extraction of visual features for improved robustness of automatic lip-reading, called visual-speech-pass filtering. A band-pass filter is applied to the pixel value sequence of the images containing the speaker’s lip region to remove unwanted variations that are not relevant to the speech information. The filter is carefully designed based on psychological, spectral, and experimental analyses. Experimental results on two speaker-independent and one speaker-dependent recognition tasks demonstrate that the proposed technique significantly improves recognition performance in both clean and visually noisy conditions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
This is opposite to the acoustic speech case where the noise interference appears in the low frequency range, and thus high-pass filtering of the temporal trajectories of filterbank energies is helpful for noise-robustness [11].
 
Literature
1.
go back to reference Amer A, Dubois E (2005) Fast and reliable structure-oriented video noise estimation. IEEE Trans Circuits Syst Video Technol 15(1):113–118CrossRef Amer A, Dubois E (2005) Fast and reliable structure-oriented video noise estimation. IEEE Trans Circuits Syst Video Technol 15(1):113–118CrossRef
2.
go back to reference Arsic I, Thiran JP (2006) Mutual information eigenlips for audio-visual speech recognition. In: Proceedings of European Signal Processing Conference Florence, Italy Arsic I, Thiran JP (2006) Mutual information eigenlips for audio-visual speech recognition. In: Proceedings of European Signal Processing Conference Florence, Italy
3.
go back to reference Bregler C, Konig Y (1994) Eigenlips for robust speech recognition. In: Proceedings of International Conference Acoustics, Speech and Signal Processing, Adelaide, Australia, vol. 2, pp 669–672 Bregler C, Konig Y (1994) Eigenlips for robust speech recognition. In: Proceedings of International Conference Acoustics, Speech and Signal Processing, Adelaide, Australia, vol. 2, pp 669–672
4.
go back to reference Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37CrossRef Chibelushi CC, Deravi F, Mason JSD (2002) A review of speech-based bimodal recognition. IEEE Trans Multimed 4(1):23–37CrossRef
5.
go back to reference Chiou GI, Hwang JN (1997) Lipreading from color video. IEEE Trans Image Process 6(8):1192–1195CrossRef Chiou GI, Hwang JN (1997) Lipreading from color video. IEEE Trans Image Process 6(8):1192–1195CrossRef
6.
go back to reference Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimed 2(3):141–151CrossRef Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimed 2(3):141–151CrossRef
7.
go back to reference Fox NA, O’Mullane BA, Reilly RB (2005) VALID: a new practical audio-visual database, and comparative results. In: Proceedings of International Conference Audio- and Video-Based Biometric Person Authentication, New York, USA, pp 777–786 Fox NA, O’Mullane BA, Reilly RB (2005) VALID: a new practical audio-visual database, and comparative results. In: Proceedings of International Conference Audio- and Video-Based Biometric Person Authentication, New York, USA, pp 777–786
8.
go back to reference Frowein HW, Smoorenburg GF, Pyters L, Schinkel D (1991) Improved speech recognition through videotelephony: experiments with the hard of hearing. IEEE J Sel Areas Commun 9(4):611–616CrossRef Frowein HW, Smoorenburg GF, Pyters L, Schinkel D (1991) Improved speech recognition through videotelephony: experiments with the hard of hearing. IEEE J Sel Areas Commun 9(4):611–616CrossRef
9.
go back to reference Gurbuz S, Tufekci Z, Patterson E, Gowdy J (2001) Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA. vol 1, pp 177–180 Gurbuz S, Tufekci Z, Patterson E, Gowdy J (2001) Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA. vol 1, pp 177–180
10.
go back to reference Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of 29th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, vol 2, pp 1214–1218 Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of 29th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, vol 2, pp 1214–1218
11.
go back to reference Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589CrossRef Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578–589CrossRef
12.
go back to reference Huang X, Acero A, Hon HW (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Upper Saddle River Huang X, Acero A, Hon HW (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice-Hall, Upper Saddle River
13.
go back to reference Jung HY, Lee SY (2000) On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Trans Speech Audio Process 8(4):407–416CrossRefMathSciNet Jung HY, Lee SY (2000) On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Trans Speech Audio Process 8(4):407–416CrossRefMathSciNet
14.
go back to reference Kaynak MN, Zhi Q, Cheok AD, Sengupta K, Jian Z, Chung KC (2004) Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Commun 43(1–2):1–16CrossRef Kaynak MN, Zhi Q, Cheok AD, Sengupta K, Jian Z, Chung KC (2004) Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Commun 43(1–2):1–16CrossRef
15.
go back to reference Lan, Y, Harvey, R, Theobald, BJ, Ong, EJ., Bowden, R (2009) Comparing visual features for lipreading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Norwich, UK, pp 102–106 Lan, Y, Harvey, R, Theobald, BJ, Ong, EJ., Bowden, R (2009) Comparing visual features for lipreading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Norwich, UK, pp 102–106
16.
go back to reference Lan, Y, Theobald, BJ, Harvey, R, Ong, EJ, Bowden, R (2010) Improving visual features for lip-reading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Kanagawa, Japan, pp 142–147 Lan, Y, Theobald, BJ, Harvey, R, Ong, EJ, Bowden, R (2010) Improving visual features for lip-reading. In: Proceedings of International Conference on Audio-Visual Speech Processing, Kanagawa, Japan, pp 142–147
17.
go back to reference Lee JS, Park CH (2008) Robust audio-visual speech recognition based on late integration. IEEE Trans Multimed 10(5):767–779CrossRef Lee JS, Park CH (2008) Robust audio-visual speech recognition based on late integration. IEEE Trans Multimed 10(5):767–779CrossRef
18.
go back to reference Lee JS, Park CH (2010) Hybrid simulated annealing and its application to optimization of hidden markov models for visual speech recognition. IEEE Trans Syst Man Cybern B 40(4):1188–1196CrossRef Lee JS, Park CH (2010) Hybrid simulated annealing and its application to optimization of hidden markov models for visual speech recognition. IEEE Trans Syst Man Cybern B 40(4):1188–1196CrossRef
19.
go back to reference Lucey S (2003) An evaluation of visual speech features for the tasks of speech and speaker recognition. In: Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp 260–267 (2003) Lucey S (2003) An evaluation of visual speech features for the tasks of speech and speaker recognition. In: Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp 260–267 (2003)
20.
go back to reference Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213CrossRef Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213CrossRef
21.
go back to reference Matthews I, Potamianos G, Neti C, Luettin J (2001) A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of International Conference on Multimedia and Expo, Tokyo, Japan, pp 22–25 Matthews I, Potamianos G, Neti C, Luettin J (2001) A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of International Conference on Multimedia and Expo, Tokyo, Japan, pp 22–25
22.
go back to reference Munhall K, Vatikiotis-Bateson E (1998) The moving face during speech communication. In: Campbell R, Dodd B, Burnham, D (eds) Hearing by eye II: advances in the psychology of speechreading and audio-visual speech. Psychology Press, Hove, pp 123–142 Munhall K, Vatikiotis-Bateson E (1998) The moving face during speech communication. In: Campbell R, Dodd B, Burnham, D (eds) Hearing by eye II: advances in the psychology of speechreading and audio-visual speech. Psychology Press, Hove, pp 123–142
23.
go back to reference Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Bellevue, WA, USA (2011) Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Bellevue, WA, USA (2011)
24.
go back to reference Ohala JJ (1975) The temporal regulation of speech. In: Fant G, Tatham MA (eds) Auditory analysis and perception. Academic Press, London, pp 431–453 Ohala JJ (1975) The temporal regulation of speech. In: Fant G, Tatham MA (eds) Auditory analysis and perception. Academic Press, London, pp 431–453
25.
go back to reference Oppenheim AV, Schafer RW (1999) Discrete-time signal processing, 2nd edn. Prentice-Hall, Upper Saddle River (1999) Oppenheim AV, Schafer RW (1999) Discrete-time signal processing, 2nd edn. Prentice-Hall, Upper Saddle River (1999)
26.
go back to reference O’Shaughnessy D (2008) Automatic speech recognition: history, methods and challenges. Pattern Recognit 41:2965–2979CrossRefMATH O’Shaughnessy D (2008) Automatic speech recognition: history, methods and challenges. Pattern Recognit 41:2965–2979CrossRefMATH
27.
go back to reference Petajan ED (1985) Automatic lipreading to enhance speech recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, pp. 40–47 Petajan ED (1985) Automatic lipreading to enhance speech recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, pp. 40–47
28.
go back to reference Potamianos G, Graf HP (1998) Linear discriminant analysis for speechreading. In: Proceedings of IEEE Workshop on Multimedia Processing, Redeondo Beach, CA, USA, pp 221–226 Potamianos G, Graf HP (1998) Linear discriminant analysis for speechreading. In: Proceedings of IEEE Workshop on Multimedia Processing, Redeondo Beach, CA, USA, pp 221–226
29.
go back to reference Potamianos G, Graf HP, Cosatto E (1998) An image transform approach for HMM based automatic lipreading. In: Proceedings of International Conference on Image Processing, Chicago, IL, USA, vol 3, pp 173–177 Potamianos G, Graf HP, Cosatto E (1998) An image transform approach for HMM based automatic lipreading. In: Proceedings of International Conference on Image Processing, Chicago, IL, USA, vol 3, pp 173–177
30.
go back to reference Potamianos G, Neti C (2003) Audio-visual speech recognition in challenging environments. In: Proceedings of Eurospeech, Geneva, Switzerland, pp 1293–1296 Potamianos G, Neti C (2003) Audio-visual speech recognition in challenging environments. In: Proceedings of Eurospeech, Geneva, Switzerland, pp 1293–1296
31.
go back to reference Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326CrossRef Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326CrossRef
32.
go back to reference Rabi, G, Lu, SW (1997) Energy minimization for extracting mouth curves in a facial image. In: Proceedings of International Conference on Intelligent Information Systems, Bahamas, pp 381–385 Rabi, G, Lu, SW (1997) Energy minimization for extracting mouth curves in a facial image. In: Proceedings of International Conference on Intelligent Information Systems, Bahamas, pp 381–385
33.
go back to reference Saenko K, Darrell T, Glass J (2004) Articulatory features for robust visual speech recognition. In: Proceedings of International Conference on Multimodal Interfaces, State College, PA, USA, pp 152–158 Saenko K, Darrell T, Glass J (2004) Articulatory features for robust visual speech recognition. In: Proceedings of International Conference on Multimodal Interfaces, State College, PA, USA, pp 152–158
34.
go back to reference Saenko K, Livescu K, Glass J, Darrell T (2009) Multistream articulatory feature-based models for visual speech recognition IEEE Trans Pattern Anal Mach Intell 31:1700–1707CrossRef Saenko K, Livescu K, Glass J, Darrell T (2009) Multistream articulatory feature-based models for visual speech recognition IEEE Trans Pattern Anal Mach Intell 31:1700–1707CrossRef
35.
go back to reference Seymour R, Stewart D, Ming J (2008) Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP J Image Video Process Seymour R, Stewart D, Ming J (2008) Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP J Image Video Process
36.
go back to reference Silsbee PL, Bovik AC (1996) Computer lipreading for improved accuracy in automatic speech recognition. IEEE Trans Speech Audio Process 4(5):337–351CrossRef Silsbee PL, Bovik AC (1996) Computer lipreading for improved accuracy in automatic speech recognition. IEEE Trans Speech Audio Process 4(5):337–351CrossRef
37.
go back to reference Vitkovitch M, Barber P (1996) Visible speech as a function of image quality: effects of display parameters on lipreading ability. Appl Cogn Psychol 10:121–140CrossRef Vitkovitch M, Barber P (1996) Visible speech as a function of image quality: effects of display parameters on lipreading ability. Appl Cogn Psychol 10:121–140CrossRef
38.
go back to reference Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265CrossRef Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265CrossRef
Metadata
Title
Visual-speech-pass filtering for robust automatic lip-reading
Author
Jong-Seok Lee
Publication date
01-08-2014
Publisher
Springer London
Published in
Pattern Analysis and Applications / Issue 3/2014
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-013-0350-x

Other articles of this Issue 3/2014

Pattern Analysis and Applications 3/2014 Go to the issue

Premium Partner