Skip to main content
Top

2015 | OriginalPaper | Chapter

Complementary Gaussian Mixture Models for Multimodal Speech Recognition

Authors : Gonzalo D. Sad, Lucas D. Terissi, Juan C. Gómez

Published in: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In speech recognition systems, typically, each word/phoneme in the vocabulary is represented by a model trained with samples of each particular class. The recognition is then performed by computing which model best represents the input word/phoneme to be classified. In this paper, a novel classification strategy based on complementary class models is presented. A complementary model to a particular class \(j\) refers to a model that is trained with instances of all the considered classes, excepting the ones associated to that class \(j\). This work describes new multi-classifier schemes for isolated word speech recognition based on the combination of standard Hidden Markov Models (HMMs) and Complementary Gaussian Mixture Models (CGMMs). In particular, two different conditions are considered. If the data is represented by single feature vectors a cascade classification scheme using HMMs and CGMMs is proposed. On the other hand, when data is represented by multiple feature vectors, a classification scheme based on a voting strategy which combines scores from individual HMMs and CGMMs is proposed. The classification schemes proposed in this paper are evaluated over two audio-visual speech databases, considering acoustic noisy conditions. Experimental results show that improvements in the recognition rates through a wide range of signal to noise ratios are achieved with the proposed classification methodologies.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ahlberg, J.: Candide-3 - an updated parameterised face. Technical report, Department of Electrical Engineering, Linköping University, Sweden (2001) Ahlberg, J.: Candide-3 - an updated parameterised face. Technical report, Department of Electrical Engineering, Linköping University, Sweden (2001)
3.
go back to reference Arora, S.J., Singh, R.P.: Automatic speech recognition: a review. Int. J. Comput. Appl. 60(9), 34–44 (2012) Arora, S.J., Singh, R.P.: Automatic speech recognition: a review. Int. J. Comput. Appl. 60(9), 34–44 (2012)
4.
go back to reference Borgström, B., Alwan, A.: A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 38(6), 1273–1280 (2008)CrossRef Borgström, B., Alwan, A.: A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 38(6), 1273–1280 (2008)CrossRef
5.
go back to reference Chibelushi, C., Deravi, F., Mason, J.: A review of speech-based bimodal recognition. IEEE Trans. Multimedia 4(1), 23–37 (2002)CrossRef Chibelushi, C., Deravi, F., Mason, J.: A review of speech-based bimodal recognition. IEEE Trans. Multimedia 4(1), 23–37 (2002)CrossRef
6.
go back to reference Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)CrossRef Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)CrossRef
7.
go back to reference Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)CrossRef Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)CrossRef
8.
go back to reference Jadhav, A., Pawar, R.: Review of various approaches towards speech recognition. In: Proceedings of the 2012 International Conference on Biomedical Engineering (ICoBE), pp. 99–103, February 2012 Jadhav, A., Pawar, R.: Review of various approaches towards speech recognition. In: Proceedings of the 2012 International Conference on Biomedical Engineering (ICoBE), pp. 99–103, February 2012
9.
go back to reference Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a survey. Comput. Vis. Image Underst. 108(1–2), 116–134 (2007)CrossRef Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a survey. Comput. Vis. Image Underst. 108(1–2), 116–134 (2007)CrossRef
10.
go back to reference Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE Trans. Multimedia 10(5), 767–779 (2008)CrossRef Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE Trans. Multimedia 10(5), 767–779 (2008)CrossRef
11.
go back to reference Nefian, A., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 2013–2016 (2002) Nefian, A., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 2013–2016 (2002)
12.
go back to reference Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)CrossRef Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)CrossRef
13.
go back to reference Potamianos, G., Neti, C., Gravier, G., Garg, A.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRef Potamianos, G., Neti, C., Gravier, G., Garg, A.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRef
14.
go back to reference Sad, G., Terissi, L., Gómez, J.: Isolated word speech recognition improvements based on the fusion of audio, video and audio-video classifiers. In: Proceedings of the XV Reunión de Trabajo en Procesamiento de la Información y Control (RPIC 2013), pp. 391–396, September 2013 Sad, G., Terissi, L., Gómez, J.: Isolated word speech recognition improvements based on the fusion of audio, video and audio-video classifiers. In: Proceedings of the XV Reunión de Trabajo en Procesamiento de la Información y Control (RPIC 2013), pp. 391–396, September 2013
15.
go back to reference Shivappa, S., Trivedi, M., Rao, B.: Audiovisual information fusion in human computer interfaces and intelligent environments: a survey. Proc. IEEE 98(10), 1692–1715 (2010)CrossRef Shivappa, S., Trivedi, M., Rao, B.: Audiovisual information fusion in human computer interfaces and intelligent environments: a survey. Proc. IEEE 98(10), 1692–1715 (2010)CrossRef
16.
go back to reference Terissi, L., Gómez, J.: 3D head pose and facial expression tracking using a single camera. J. Univ. Comput. Sci. 16(6), 903–920 (2010)MATH Terissi, L., Gómez, J.: 3D head pose and facial expression tracking using a single camera. J. Univ. Comput. Sci. 16(6), 903–920 (2010)MATH
17.
go back to reference Terissi, L., Parodi, M., Gómez, J.: Lip reading using wavelet-based features and random forests classification. In: Proceedings of the 22nd International Conference on Pattern Recognition (ICPR 2014), Stockholm, Sweden, pp. 791–796, August 2014 Terissi, L., Parodi, M., Gómez, J.: Lip reading using wavelet-based features and random forests classification. In: Proceedings of the 22nd International Conference on Pattern Recognition (ICPR 2014), Stockholm, Sweden, pp. 791–796, August 2014
18.
go back to reference Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)CrossRef Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)CrossRef
Metadata
Title
Complementary Gaussian Mixture Models for Multimodal Speech Recognition
Authors
Gonzalo D. Sad
Lucas D. Terissi
Juan C. Gómez
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-14899-1_6

Premium Partner