Skip to main content
Top
Published in: International Journal of Speech Technology 1/2016

23-01-2016

Multiple cameras audio visual speech recognition using active appearance model visual features in car environment

Authors: Astik Biswas, P. K. Sahu, Mahesh Chandra

Published in: International Journal of Speech Technology | Issue 1/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Consideration of visual speech features along with traditional acoustic features have shown decent performance in uncontrolled auditory environment. However, most of the existing audio-visual speech recognition (AVSR) systems have been developed in the laboratory conditions and rarely addressed the visual domain problems. This paper presents an active appearance model (AAM) based multiple-camera AVSR experiment. The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments. At first, a series of visual speech recognition (VSR) experiments are carried out to study the impact of each camera on multi-stream VSR. Four cameras in car audio-visual corpus is used to perform the experiments. The individual camera stream is fused to have four-stream synchronous hidden Markov model visual speech recognizer. Finally, optimum four-stream VSR is combined with single stream acoustic HMM to build five-stream AVSR. The dual modality AVSR system shows more robustness compared to acoustic speech recognizer across all driving conditions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The data of eight microphone is not available in the database. Thus literally we can say the number of microphones is seven.
 
2
Some files of some speakers are missing due to equipment failure while recording.
 
Literature
go back to reference Biswas, A., Sahu, P., Bhowmick, A., & Chandra, M. (2015). AAM based features for multiple camera visual speech recognition in car environment. Procedia Computer Science, 57, 614–621.CrossRef Biswas, A., Sahu, P., Bhowmick, A., & Chandra, M. (2015). AAM based features for multiple camera visual speech recognition in car environment. Procedia Computer Science, 57, 614–621.CrossRef
go back to reference Biswas, A., Sahu, P. K., & Chandra, M. (2014). Admissible wavelet packet features based on human inner ear frequency response for hindi consonant recognition. Computers & Electrical Engineering (Elsevier), 40(4), 1111–1122.CrossRef Biswas, A., Sahu, P. K., & Chandra, M. (2014). Admissible wavelet packet features based on human inner ear frequency response for hindi consonant recognition. Computers & Electrical Engineering (Elsevier), 40(4), 1111–1122.CrossRef
go back to reference Chien, J.-T., Lai, J.-R., Lai, P.-Y. (2001). Microphone array signal processing for far-talking speech recognition. In IEEE Third Workshop on Signal Processing Advances in Wireless Communications, (pp. 322–325). Chien, J.-T., Lai, J.-R., Lai, P.-Y. (2001). Microphone array signal processing for far-talking speech recognition. In IEEE Third Workshop on Signal Processing Advances in Wireless Communications, (pp. 322–325).
go back to reference Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models (pp. 484–498). Lecture Notes in Computer Science Heidelberg: Springer. Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models (pp. 484–498). Lecture Notes in Computer Science Heidelberg: Springer.
go back to reference Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In IEEE Transactions on Acoustic Speech Signal Process ASSP-28 (357–366). Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In IEEE Transactions on Acoustic Speech Signal Process ASSP-28 (357–366).
go back to reference Estellers, V., & Thiran, J.-P. (2012). Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2012(1), 1–23.CrossRef Estellers, V., & Thiran, J.-P. (2012). Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2012(1), 1–23.CrossRef
go back to reference Faubel, F., Georges, M., Kumatani, K., Bruhn, A., & Klakow, D. (2011). Improving hands-free speech recognition in a car through audio-visual voice activity detection. In Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), (pp. 70–75). Faubel, F., Georges, M., Kumatani, K., Bruhn, A., & Klakow, D. (2011). Improving hands-free speech recognition in a car through audio-visual voice activity detection. In Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), (pp. 70–75).
go back to reference Gao, X., Su, Y., Li, X., & Tao, D. (2010). A review of active appearance models. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(2), 145–158.CrossRef Gao, X., Su, Y., Li, X., & Tao, D. (2010). A review of active appearance models. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(2), 145–158.CrossRef
go back to reference Irwin, A. (2008). Investigating the effects of accent on visual speech, Ph.D. thesis, University of Nottingham. Irwin, A. (2008). Investigating the effects of accent on visual speech, Ph.D. thesis, University of Nottingham.
go back to reference Kaynak, M. N., Zhi, Q., Cheok, A. D., Sengupta, K., Jian, Z., & Chung, K. C. (2004). Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Communication, 43(1), 1–16.CrossRef Kaynak, M. N., Zhi, Q., Cheok, A. D., Sengupta, K., Jian, Z., & Chung, K. C. (2004). Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Communication, 43(1), 1–16.CrossRef
go back to reference Kleinschmidt, T., Dean, D., Sridharan, S., Mason, M. (2007). A continuous speech recognition evaluation protocol for the AVICAR database. In In proceedings of the International Conference on Signal Processing and Communication Systems (pp. 339–344). Kleinschmidt, T., Dean, D., Sridharan, S., Mason, M. (2007). A continuous speech recognition evaluation protocol for the AVICAR database. In In proceedings of the International Conference on Signal Processing and Communication Systems (pp. 339–344).
go back to reference Lee, K. F., & Hon, H. W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions of Acoustics, Speech and Signal Processing, 37(14), 1641–1648.CrossRef Lee, K. F., & Hon, H. W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions of Acoustics, Speech and Signal Processing, 37(14), 1641–1648.CrossRef
go back to reference Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., & Huang, T. S. (2004). AVICAR: Audio-visual speech corpus in a car environment. In INTERSPEECH (pp. 2489–2492). Jeju Island. Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., & Huang, T. S. (2004). AVICAR: Audio-visual speech corpus in a car environment. In INTERSPEECH (pp. 2489–2492). Jeju Island.
go back to reference Lucey, P., & Potamianos, G. (2006). Lipreading using profile versus frontal views. In IEEE 8th Workshop on Multimedia Signal Processing (pp. 24–28). Lucey, P., & Potamianos, G. (2006). Lipreading using profile versus frontal views. In IEEE 8th Workshop on Multimedia Signal Processing (pp. 24–28).
go back to reference Navarathna, R., Dean, D., Sridharan, S., & Lucey, P. (2013). Multiple cameras for audio-visual speech recognition in an automotive environment. Computer Speech & Language, 27(4), 911–927.CrossRef Navarathna, R., Dean, D., Sridharan, S., & Lucey, P. (2013). Multiple cameras for audio-visual speech recognition in an automotive environment. Computer Speech & Language, 27(4), 911–927.CrossRef
go back to reference Navarathna, R., Dean, D. B., Lucey, P. J., Sridharan, S., & Fookes, C. B. (2010). Recognising audio-visual speech in vehicles using the AVICAR database. In Proceedings of the 13th Australasian International Conference on Speech Science and Technology, The Australasian Speech Science & Technology Association (pp. 110–113). Navarathna, R., Dean, D. B., Lucey, P. J., Sridharan, S., & Fookes, C. B. (2010). Recognising audio-visual speech in vehicles using the AVICAR database. In Proceedings of the 13th Australasian International Conference on Speech Science and Technology, The Australasian Speech Science & Technology Association (pp. 110–113).
go back to reference Navarathna, R., Kleinschmidt, T., Dean, D. B., Sridharan, S., & Lucey, P. J. (2011). Can audio-visual speech recognition outperform acoustically enhanced speech recognition in automotive environment? In In Interspeech, (pp. 2241–2244). Navarathna, R., Kleinschmidt, T., Dean, D. B., Sridharan, S., & Lucey, P. J. (2011). Can audio-visual speech recognition outperform acoustically enhanced speech recognition in automotive environment? In In Interspeech, (pp. 2241–2244).
go back to reference Potamianos, G., & Neti, C. (2003) Audio-visual speech recognition in challenging environments. In INTERSPEECH (pp. 1293–1296). Potamianos, G., & Neti, C. (2003) Audio-visual speech recognition in challenging environments. In INTERSPEECH (pp. 1293–1296).
go back to reference Potamianos, G., Neti, C., Luettin, J., & Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing, 22, 23. Potamianos, G., Neti, C., Luettin, J., & Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing, 22, 23.
go back to reference Potamianos, G., & Lucey, P. (2006). Audio-visual asr from multiple views inside smart rooms. In IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (pp. 35–40). Potamianos, G., & Lucey, P. (2006). Audio-visual asr from multiple views inside smart rooms. In IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (pp. 35–40).
go back to reference Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics, 44(2), 175–184.CrossRef Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics, 44(2), 175–184.CrossRef
go back to reference Viola, P., Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Vol. 1, pp. 511–518). Viola, P., Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Vol. 1, pp. 511–518).
Metadata
Title
Multiple cameras audio visual speech recognition using active appearance model visual features in car environment
Authors
Astik Biswas
P. K. Sahu
Mahesh Chandra
Publication date
23-01-2016
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 1/2016
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9332-x

Other articles of this Issue 1/2016

International Journal of Speech Technology 1/2016 Go to the issue