Skip to main content
Top

2019 | OriginalPaper | Chapter

Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

Authors : Ali S. Saudi, Mahmoud I. Khalil, Hazem M. Abbas

Published in: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The performance of speech recognition systems can be significantly improved when visual information is used in conjunction with the audio ones, especially in noisy environments. Prompted by the great achievements of deep learning in solving Audio-Visual Speech Recognition (AVSR) problems, we propose a deep AVSR model based on Long Short-Term Memory Bidirectional Recurrent Neural Network (LSTM-BRNN). The proposed deep AVSR model utilizes the Gabor filters in both the audio and visual front-ends with Early Integration (EI) scheme. This model is termed as BRNN\(_{av}\) model. The Gabor features simulate the underlying spatiotemporal processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it Gabor Audio Features (GAF) and Gabor Visual Features (GVF). The experimental results show that the deep Gabor (LSTM-BRNNs)-based model achieves superior performance when compared to the (GMM-HMM)-based models which utilize the same front-ends. Furthermore, the use of GAF and GVF in both audio and visual front-ends attain significant improvement in the performance compared to the traditional audio and visual features.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
2.
go back to reference Feng, W., Guan, N., Li, Y., Zhang, X., Luo, Z.: Audio visual speech recognition with multimodal recurrent neural networks. In: IEEE International Joint Conference on Neural Networks (IJCNN), pp. 681–688. IEEE (2017) Feng, W., Guan, N., Li, Y., Zhang, X., Luo, Z.: Audio visual speech recognition with multimodal recurrent neural networks. In: IEEE International Joint Conference on Neural Networks (IJCNN), pp. 681–688. IEEE (2017)
3.
go back to reference Gabor, D.: Theory of communication. Part 1: the analysis of information. J. Inst. Electr. Eng. Part III Radio Commun. Eng. 93(26), 429–441 (1946) Gabor, D.: Theory of communication. Part 1: the analysis of information. J. Inst. Electr. Eng. Part III Radio Commun. Eng. 93(26), 429–441 (1946)
4.
go back to reference Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006) Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
5.
go back to reference Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014) Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014)
7.
go back to reference Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
8.
go back to reference Luan, S., Chen, C., Zhang, B., Han, J., Liu, J.: Gabor convolutional networks. IEEE Trans. Image Process. 27, 4357–4366 (2018)MathSciNetCrossRef Luan, S., Chen, C., Zhang, B., Han, J., Liu, J.: Gabor convolutional networks. IEEE Trans. Image Process. 27, 4357–4366 (2018)MathSciNetCrossRef
9.
go back to reference McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)CrossRef McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)CrossRef
10.
go back to reference Mesgarani, N., David, S., Shamma, S.: Representation of phonemes in primary auditory cortex: how the brain analyzes speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 765–768. IEEE (2007) Mesgarani, N., David, S., Shamma, S.: Representation of phonemes in primary auditory cortex: how the brain analyzes speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 765–768. IEEE (2007)
11.
go back to reference Meshgini, S., Aghagolzadeh, A., Seyedarabi, H.: Face recognition using Gabor-based direct linear discriminant analysis and support vector machine. Comput. Electr. Eng. 39(3), 727–745 (2013)CrossRef Meshgini, S., Aghagolzadeh, A., Seyedarabi, H.: Face recognition using Gabor-based direct linear discriminant analysis and support vector machine. Comput. Electr. Eng. 39(3), 727–745 (2013)CrossRef
12.
go back to reference Nakamura, S.: Statistical multimodal integration for audio-visual speech processing. IEEE Trans. Neural Netw. 13(4), 854–866 (2002)CrossRef Nakamura, S.: Statistical multimodal integration for audio-visual speech processing. IEEE Trans. Neural Netw. 13(4), 854–866 (2002)CrossRef
13.
go back to reference Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)CrossRef Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)CrossRef
14.
go back to reference Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, Florida, USA, vol. 2, pp. II-2017–II-2020. IEEE (2002) Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, Florida, USA, vol. 2, pp. II-2017–II-2020. IEEE (2002)
16.
go back to reference Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2304–2308. IEEE (2016) Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2304–2308. IEEE (2016)
17.
go back to reference Shen, L., Bai, L., Fairhurst, M.: Gabor wavelets and general discriminant analysis for face identification and verification. Image Vis. Comput. 25(5), 553–563 (2007)CrossRef Shen, L., Bai, L., Fairhurst, M.: Gabor wavelets and general discriminant analysis for face identification and verification. Image Vis. Comput. 25(5), 553–563 (2007)CrossRef
19.
go back to reference Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRef Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRef
20.
go back to reference Zhang, B., Gao, Y., Zhao, S., Liu, J.: Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor. IEEE Trans. Image Process. 19(2), 533–544 (2010)MathSciNetCrossRef Zhang, B., Gao, Y., Zhao, S., Liu, J.: Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor. IEEE Trans. Image Process. 19(2), 533–544 (2010)MathSciNetCrossRef
Metadata
Title
Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks
Authors
Ali S. Saudi
Mahmoud I. Khalil
Hazem M. Abbas
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-20984-1_7

Premium Partner