Skip to main content
Erschienen in: International Journal of Speech Technology 1/2016

23.01.2016

Multiple cameras audio visual speech recognition using active appearance model visual features in car environment

verfasst von: Astik Biswas, P. K. Sahu, Mahesh Chandra

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Consideration of visual speech features along with traditional acoustic features have shown decent performance in uncontrolled auditory environment. However, most of the existing audio-visual speech recognition (AVSR) systems have been developed in the laboratory conditions and rarely addressed the visual domain problems. This paper presents an active appearance model (AAM) based multiple-camera AVSR experiment. The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments. At first, a series of visual speech recognition (VSR) experiments are carried out to study the impact of each camera on multi-stream VSR. Four cameras in car audio-visual corpus is used to perform the experiments. The individual camera stream is fused to have four-stream synchronous hidden Markov model visual speech recognizer. Finally, optimum four-stream VSR is combined with single stream acoustic HMM to build five-stream AVSR. The dual modality AVSR system shows more robustness compared to acoustic speech recognizer across all driving conditions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The data of eight microphone is not available in the database. Thus literally we can say the number of microphones is seven.
 
2
Some files of some speakers are missing due to equipment failure while recording.
 
Literatur
Zurück zum Zitat Biswas, A., Sahu, P., Bhowmick, A., & Chandra, M. (2015). AAM based features for multiple camera visual speech recognition in car environment. Procedia Computer Science, 57, 614–621.CrossRef Biswas, A., Sahu, P., Bhowmick, A., & Chandra, M. (2015). AAM based features for multiple camera visual speech recognition in car environment. Procedia Computer Science, 57, 614–621.CrossRef
Zurück zum Zitat Biswas, A., Sahu, P. K., & Chandra, M. (2014). Admissible wavelet packet features based on human inner ear frequency response for hindi consonant recognition. Computers & Electrical Engineering (Elsevier), 40(4), 1111–1122.CrossRef Biswas, A., Sahu, P. K., & Chandra, M. (2014). Admissible wavelet packet features based on human inner ear frequency response for hindi consonant recognition. Computers & Electrical Engineering (Elsevier), 40(4), 1111–1122.CrossRef
Zurück zum Zitat Chien, J.-T., Lai, J.-R., Lai, P.-Y. (2001). Microphone array signal processing for far-talking speech recognition. In IEEE Third Workshop on Signal Processing Advances in Wireless Communications, (pp. 322–325). Chien, J.-T., Lai, J.-R., Lai, P.-Y. (2001). Microphone array signal processing for far-talking speech recognition. In IEEE Third Workshop on Signal Processing Advances in Wireless Communications, (pp. 322–325).
Zurück zum Zitat Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models (pp. 484–498). Lecture Notes in Computer Science Heidelberg: Springer. Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models (pp. 484–498). Lecture Notes in Computer Science Heidelberg: Springer.
Zurück zum Zitat Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In IEEE Transactions on Acoustic Speech Signal Process ASSP-28 (357–366). Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In IEEE Transactions on Acoustic Speech Signal Process ASSP-28 (357–366).
Zurück zum Zitat Estellers, V., & Thiran, J.-P. (2012). Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2012(1), 1–23.CrossRef Estellers, V., & Thiran, J.-P. (2012). Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2012(1), 1–23.CrossRef
Zurück zum Zitat Faubel, F., Georges, M., Kumatani, K., Bruhn, A., & Klakow, D. (2011). Improving hands-free speech recognition in a car through audio-visual voice activity detection. In Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), (pp. 70–75). Faubel, F., Georges, M., Kumatani, K., Bruhn, A., & Klakow, D. (2011). Improving hands-free speech recognition in a car through audio-visual voice activity detection. In Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), (pp. 70–75).
Zurück zum Zitat Gao, X., Su, Y., Li, X., & Tao, D. (2010). A review of active appearance models. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(2), 145–158.CrossRef Gao, X., Su, Y., Li, X., & Tao, D. (2010). A review of active appearance models. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(2), 145–158.CrossRef
Zurück zum Zitat Irwin, A. (2008). Investigating the effects of accent on visual speech, Ph.D. thesis, University of Nottingham. Irwin, A. (2008). Investigating the effects of accent on visual speech, Ph.D. thesis, University of Nottingham.
Zurück zum Zitat Kaynak, M. N., Zhi, Q., Cheok, A. D., Sengupta, K., Jian, Z., & Chung, K. C. (2004). Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Communication, 43(1), 1–16.CrossRef Kaynak, M. N., Zhi, Q., Cheok, A. D., Sengupta, K., Jian, Z., & Chung, K. C. (2004). Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Communication, 43(1), 1–16.CrossRef
Zurück zum Zitat Kleinschmidt, T., Dean, D., Sridharan, S., Mason, M. (2007). A continuous speech recognition evaluation protocol for the AVICAR database. In In proceedings of the International Conference on Signal Processing and Communication Systems (pp. 339–344). Kleinschmidt, T., Dean, D., Sridharan, S., Mason, M. (2007). A continuous speech recognition evaluation protocol for the AVICAR database. In In proceedings of the International Conference on Signal Processing and Communication Systems (pp. 339–344).
Zurück zum Zitat Lee, K. F., & Hon, H. W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions of Acoustics, Speech and Signal Processing, 37(14), 1641–1648.CrossRef Lee, K. F., & Hon, H. W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions of Acoustics, Speech and Signal Processing, 37(14), 1641–1648.CrossRef
Zurück zum Zitat Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., & Huang, T. S. (2004). AVICAR: Audio-visual speech corpus in a car environment. In INTERSPEECH (pp. 2489–2492). Jeju Island. Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., & Huang, T. S. (2004). AVICAR: Audio-visual speech corpus in a car environment. In INTERSPEECH (pp. 2489–2492). Jeju Island.
Zurück zum Zitat Lucey, P., & Potamianos, G. (2006). Lipreading using profile versus frontal views. In IEEE 8th Workshop on Multimedia Signal Processing (pp. 24–28). Lucey, P., & Potamianos, G. (2006). Lipreading using profile versus frontal views. In IEEE 8th Workshop on Multimedia Signal Processing (pp. 24–28).
Zurück zum Zitat Navarathna, R., Dean, D., Sridharan, S., & Lucey, P. (2013). Multiple cameras for audio-visual speech recognition in an automotive environment. Computer Speech & Language, 27(4), 911–927.CrossRef Navarathna, R., Dean, D., Sridharan, S., & Lucey, P. (2013). Multiple cameras for audio-visual speech recognition in an automotive environment. Computer Speech & Language, 27(4), 911–927.CrossRef
Zurück zum Zitat Navarathna, R., Dean, D. B., Lucey, P. J., Sridharan, S., & Fookes, C. B. (2010). Recognising audio-visual speech in vehicles using the AVICAR database. In Proceedings of the 13th Australasian International Conference on Speech Science and Technology, The Australasian Speech Science & Technology Association (pp. 110–113). Navarathna, R., Dean, D. B., Lucey, P. J., Sridharan, S., & Fookes, C. B. (2010). Recognising audio-visual speech in vehicles using the AVICAR database. In Proceedings of the 13th Australasian International Conference on Speech Science and Technology, The Australasian Speech Science & Technology Association (pp. 110–113).
Zurück zum Zitat Navarathna, R., Kleinschmidt, T., Dean, D. B., Sridharan, S., & Lucey, P. J. (2011). Can audio-visual speech recognition outperform acoustically enhanced speech recognition in automotive environment? In In Interspeech, (pp. 2241–2244). Navarathna, R., Kleinschmidt, T., Dean, D. B., Sridharan, S., & Lucey, P. J. (2011). Can audio-visual speech recognition outperform acoustically enhanced speech recognition in automotive environment? In In Interspeech, (pp. 2241–2244).
Zurück zum Zitat Potamianos, G., & Neti, C. (2003) Audio-visual speech recognition in challenging environments. In INTERSPEECH (pp. 1293–1296). Potamianos, G., & Neti, C. (2003) Audio-visual speech recognition in challenging environments. In INTERSPEECH (pp. 1293–1296).
Zurück zum Zitat Potamianos, G., Neti, C., Luettin, J., & Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing, 22, 23. Potamianos, G., Neti, C., Luettin, J., & Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing, 22, 23.
Zurück zum Zitat Potamianos, G., & Lucey, P. (2006). Audio-visual asr from multiple views inside smart rooms. In IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (pp. 35–40). Potamianos, G., & Lucey, P. (2006). Audio-visual asr from multiple views inside smart rooms. In IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (pp. 35–40).
Zurück zum Zitat Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics, 44(2), 175–184.CrossRef Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics, 44(2), 175–184.CrossRef
Zurück zum Zitat Viola, P., Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Vol. 1, pp. 511–518). Viola, P., Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Vol. 1, pp. 511–518).
Metadaten
Titel
Multiple cameras audio visual speech recognition using active appearance model visual features in car environment
verfasst von
Astik Biswas
P. K. Sahu
Mahesh Chandra
Publikationsdatum
23.01.2016
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9332-x

Weitere Artikel der Ausgabe 1/2016

International Journal of Speech Technology 1/2016 Zur Ausgabe

Neuer Inhalt