Skip to main content
Erschienen in: International Journal of Speech Technology 2/2018

13.04.2018

Robust front-end for audio, visual and audio–visual speech classification

verfasst von: Lucas D. Terissi, Gonzalo D. Sad, Juan C. Gómez

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper proposes a robust front-end for speech classification which can be employed with acoustic, visual or audio–visual information, indistinctly. Wavelet multiresolution analysis is employed to represent temporal input data associated with speech information. These wavelet-based features are then used as inputs to a Random Forest classifier to perform the speech classification. The performance of the proposed speech classification scheme is evaluated in different scenarios, namely, considering only acoustic information, only visual information (lip-reading), and fused audio–visual information. These evaluations are carried out over three different audio–visual databases, two of them public ones and the remaining one compiled by the authors of this paper. Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered. In addition, the proposed method performs better than other reported methods in the literature over the same two public databases. All the experiments were implemented using the same configuration parameters. These results also indicate that the proposed method performs satisfactorily, neither requiring the tuning of the wavelet decomposition parameters nor of the Random Forests classifier parameters, for each particular database and input modalities.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ahlberg, J. (2001). Candide-3: An updated parameterised face. Technical report, Linkoping: Department of Electrical Engineering, Linkping University. Ahlberg, J. (2001). Candide-3: An updated parameterised face. Technical report, Linkoping: Department of Electrical Engineering, Linkping University.
Zurück zum Zitat Ahmadi, S., Ahadi, S. M., Cranen, B., & Boves, L. (2014). Sparse coding of the modulation spectrum for noise-robust automatic speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 36.CrossRef Ahmadi, S., Ahadi, S. M., Cranen, B., & Boves, L. (2014). Sparse coding of the modulation spectrum for noise-robust automatic speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 36.CrossRef
Zurück zum Zitat Aleksic, P., Williams, J., Wu, Z., & Katsaggelos, A. (2002). Audio-visual continuous speech recognition using MPEG-4 compliant visual features. In Proceedings of the International Conference on Image Processing, vol 1, pp. 960–963.MATHCrossRef Aleksic, P., Williams, J., Wu, Z., & Katsaggelos, A. (2002). Audio-visual continuous speech recognition using MPEG-4 compliant visual features. In Proceedings of the International Conference on Image Processing, vol 1, pp. 960–963.MATHCrossRef
Zurück zum Zitat Ali, H., Ahmad, N., Zhou, X., Iqbal, K., & Ali, S. M. (2014). Dwt features performance analysis for automatic speech recognition of Urdu. SpringerPlus, 3(1), 204.CrossRef Ali, H., Ahmad, N., Zhou, X., Iqbal, K., & Ali, S. M. (2014). Dwt features performance analysis for automatic speech recognition of Urdu. SpringerPlus, 3(1), 204.CrossRef
Zurück zum Zitat Ali, H., Jianwei, A., & Iqbal, K. (2015). Automatic speech recognition of urdu digits with optimal classification approach. International Journal of Computer Applications, 118(9), 1–5.CrossRef Ali, H., Jianwei, A., & Iqbal, K. (2015). Automatic speech recognition of urdu digits with optimal classification approach. International Journal of Computer Applications, 118(9), 1–5.CrossRef
Zurück zum Zitat Amer, M. R., Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2014). Multimodal fusion using dynamic hybrid models. In IEEE Winter Conference on Applications of Computer Vision, pp. 556–563. Amer, M. R., Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2014). Multimodal fusion using dynamic hybrid models. In IEEE Winter Conference on Applications of Computer Vision, pp. 556–563.
Zurück zum Zitat Attar, M., Mosleh, M., & Ansari-Asl, K. (2010). Isolated words-recognition based on random forest classifiers. In Proceedings of 2010 4th International Conference on Intelligent Information Technology. Attar, M., Mosleh, M., & Ansari-Asl, K. (2010). Isolated words-recognition based on random forest classifiers. In Proceedings of 2010 4th International Conference on Intelligent Information Technology.
Zurück zum Zitat Biswas, A., Sahu, P. K., & Chandra, M. (2016). Multiple cameras audio visual speech recognition using active appearance model visual features in car environment. International Journal of Speech Technology, 19(1), 159–171.CrossRef Biswas, A., Sahu, P. K., & Chandra, M. (2016). Multiple cameras audio visual speech recognition using active appearance model visual features in car environment. International Journal of Speech Technology, 19(1), 159–171.CrossRef
Zurück zum Zitat Borde, P., Varpe, A., Manza, R., & Yannawar, P. (2015). Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. International Journal of Speech Technology, 18(2), 167–175.CrossRef Borde, P., Varpe, A., Manza, R., & Yannawar, P. (2015). Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. International Journal of Speech Technology, 18(2), 167–175.CrossRef
Zurück zum Zitat Borgström, B., & Alwan, A. (2008). A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Transactions on Systems Man and Cybernetics, 38(6), 1273–1280.CrossRef Borgström, B., & Alwan, A. (2008). A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Transactions on Systems Man and Cybernetics, 38(6), 1273–1280.CrossRef
Zurück zum Zitat Breiman, L. (1996). Bagging predictors. Machine Learning, 26(2), 123–140.MATH Breiman, L. (1996). Bagging predictors. Machine Learning, 26(2), 123–140.MATH
Zurück zum Zitat Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.MATH Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.MATH
Zurück zum Zitat Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.MATHCrossRef Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.MATHCrossRef
Zurück zum Zitat Dong, L., Foo, S. W., & Lian, Y. (2005). A two-channel training algorithm for hidden Markov model and its application to lip reading. EURASIP Journal on Advances in Signal Processing, 2005(9), 347367.MATHCrossRef Dong, L., Foo, S. W., & Lian, Y. (2005). A two-channel training algorithm for hidden Markov model and its application to lip reading. EURASIP Journal on Advances in Signal Processing, 2005(9), 347367.MATHCrossRef
Zurück zum Zitat Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3), 141–151.CrossRef Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3), 141–151.CrossRef
Zurück zum Zitat Estellers, V., Gurban, M., & Thiran, J. (2012). On dynamic stream weighting for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(4), 1145–1157.CrossRef Estellers, V., Gurban, M., & Thiran, J. (2012). On dynamic stream weighting for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(4), 1145–1157.CrossRef
Zurück zum Zitat Farooq, O., & Datta, S. (2003a). Phoneme recognition using wavelet based features. Information Sciences, 150(1–2), 5–15.CrossRef Farooq, O., & Datta, S. (2003a). Phoneme recognition using wavelet based features. Information Sciences, 150(1–2), 5–15.CrossRef
Zurück zum Zitat Farooq, O., & Datta, S. (2003b). Wavelet-based denoising for robust feature extraction for speech recognition. Electronics Letters, 39(1), 163–165.CrossRef Farooq, O., & Datta, S. (2003b). Wavelet-based denoising for robust feature extraction for speech recognition. Electronics Letters, 39(1), 163–165.CrossRef
Zurück zum Zitat Foo, S., Lian, Y., & Dong, L. (2004). Recognition of visual speech elements using adaptively boosted hidden Markov models. IEEE Transactions on Circuits and Systems for Video Technology, 14(5), 693–705.CrossRef Foo, S., Lian, Y., & Dong, L. (2004). Recognition of visual speech elements using adaptively boosted hidden Markov models. IEEE Transactions on Circuits and Systems for Video Technology, 14(5), 693–705.CrossRef
Zurück zum Zitat Gowdy, J., Subramanya, A., Bartels, C., & Bilmes, J. (2004). DBN based multi-stream models for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 993–996. Gowdy, J., Subramanya, A., Bartels, C., & Bilmes, J. (2004). DBN based multi-stream models for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 993–996.
Zurück zum Zitat Gowdy, J. N. & Tufekci, Z. (2000). Mel-scaled discrete wavelet coefficients for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol 3, pp. 1351–1354. Gowdy, J. N. & Tufekci, Z. (2000). Mel-scaled discrete wavelet coefficients for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol 3, pp. 1351–1354.
Zurück zum Zitat Gupta, M. & Gilbert, A. (2001). Robust speech recognition using wavelet coefficient features. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU ’01., pp. 445–448. Gupta, M. & Gilbert, A. (2001). Robust speech recognition using wavelet coefficient features. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU ’01., pp. 445–448.
Zurück zum Zitat Hu, D., Li, X., & Lu, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3574–3582. Hu, D., Li, X., & Lu, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3574–3582.
Zurück zum Zitat Iwano, K., Yoshinaga, T., Tamura, S., & Furui, S. (2007). Audio-visual speech recognition using lip information extracted from side-face images. EURASIP Journal on Audio, Speech, and Music Processing, 2007(1), 064506. Iwano, K., Yoshinaga, T., Tamura, S., & Furui, S. (2007). Audio-visual speech recognition using lip information extracted from side-face images. EURASIP Journal on Audio, Speech, and Music Processing, 2007(1), 064506.
Zurück zum Zitat Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE, 103(9), 1635–1653.CrossRef Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE, 103(9), 1635–1653.CrossRef
Zurück zum Zitat Kotnik, B., Kacic, Z., & Horvat, B. (2003). The usage of wavelet packet transformation in automatic noisy speech recognition systems. In The IEEE Region 8 EUROCON 2003. Computer as a Tool., vol. 2, pp. 131–134. Kotnik, B., Kacic, Z., & Horvat, B. (2003). The usage of wavelet packet transformation in automatic noisy speech recognition systems. In The IEEE Region 8 EUROCON 2003. Computer as a Tool., vol. 2, pp. 131–134.
Zurück zum Zitat Krishnamurthy, N., & Hansen, J. (2009). Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1394–1407.CrossRef Krishnamurthy, N., & Hansen, J. (2009). Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1394–1407.CrossRef
Zurück zum Zitat Lee, J.-S., & Park, C.-H. (2008). Robust audio-visual speech recognition based on late integration. IEEE Transactions on Multimedia, 10(5), 767–779.CrossRef Lee, J.-S., & Park, C.-H. (2008). Robust audio-visual speech recognition based on late integration. IEEE Transactions on Multimedia, 10(5), 767–779.CrossRef
Zurück zum Zitat Maganti, H. K., & Matassoni, M. (2014). Auditory processing-based features for improving speech recognition in adverse acoustic conditions. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 21.CrossRef Maganti, H. K., & Matassoni, M. (2014). Auditory processing-based features for improving speech recognition in adverse acoustic conditions. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 21.CrossRef
Zurück zum Zitat Matthews, I., Cootes, T., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 2002.CrossRef Matthews, I., Cootes, T., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 2002.CrossRef
Zurück zum Zitat Miki, M., Kitaoka, N., Miyajima, C., Nishino, T., & Takeda, K. (2014). Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 2.CrossRef Miki, M., Kitaoka, N., Miyajima, C., Nishino, T., & Takeda, K. (2014). Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 2.CrossRef
Zurück zum Zitat Monaci, G., Vandergheynst, P., & Sommer, F. T. (2009). Learning bimodal structure in audio-visual data. IEEE Transactions on Neural Networks, 20(12), 1898–1910.CrossRef Monaci, G., Vandergheynst, P., & Sommer, F. T. (2009). Learning bimodal structure in audio-visual data. IEEE Transactions on Neural Networks, 20(12), 1898–1910.CrossRef
Zurück zum Zitat Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696.
Zurück zum Zitat Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(1), 9–18.CrossRef Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(1), 9–18.CrossRef
Zurück zum Zitat Papandreou, G., Katsamanis, A., Pitsikalis, V., & Maragos, P. (2009). Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 17(3), 423–435.CrossRef Papandreou, G., Katsamanis, A., Pitsikalis, V., & Maragos, P. (2009). Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 17(3), 423–435.CrossRef
Zurück zum Zitat Pavez, E., & Silva, J. F. (2012). Analysis and design of wavelet-packet cepstral coefficients for automatic speech recognition. Speech Communication, 54(6), 814–835.CrossRef Pavez, E., & Silva, J. F. (2012). Analysis and design of wavelet-packet cepstral coefficients for automatic speech recognition. Speech Communication, 54(6), 814–835.CrossRef
Zurück zum Zitat Petridis, S. & Pantic, M. (2016). Deep complementary bottleneck features for visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2304–2308. Petridis, S. & Pantic, M. (2016). Deep complementary bottleneck features for visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2304–2308.
Zurück zum Zitat Potamianos, G., Graf, H. P., & Cosatto, E. (1998). An image transform approach for HMM based automatic lipreading. In Proceedings of the International Conference on Image Processing, pp. 173–177. Potamianos, G., Graf, H. P., & Cosatto, E. (1998). An image transform approach for HMM based automatic lipreading. In Proceedings of the International Conference on Image Processing, pp. 173–177.
Zurück zum Zitat Potamianos, G., Neti, C., Gravier, G., & Garg, A. (2003). Recent advances in the automatic recognition of audio-visual speech. Proceedings of the IEEE, 91(9), 1306–1326.CrossRef Potamianos, G., Neti, C., Gravier, G., & Garg, A. (2003). Recent advances in the automatic recognition of audio-visual speech. Proceedings of the IEEE, 91(9), 1306–1326.CrossRef
Zurück zum Zitat Potamianos, G., Neti, C., Iyengar, G., Senior, A. W., & Verma, A. (2001). A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 4(3), 193–208.MATHCrossRef Potamianos, G., Neti, C., Iyengar, G., Senior, A. W., & Verma, A. (2001). A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 4(3), 193–208.MATHCrossRef
Zurück zum Zitat Puviarasan, N., & Palanivel, S. (2011). Lip reading of hearing impaired persons using HMM. Expert Systems with Applications, 38(4), 4477–4481.CrossRef Puviarasan, N., & Palanivel, S. (2011). Lip reading of hearing impaired persons using HMM. Expert Systems with Applications, 38(4), 4477–4481.CrossRef
Zurück zum Zitat Rabiner, L. (1989). A tutorial on Hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRef Rabiner, L. (1989). A tutorial on Hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRef
Zurück zum Zitat Rajeswari, P. N. N. S. S., & Sathyanarayana, V. (2014). Robust speech recognition using wavelet domain front end and hidden Markov models. In V. Sridhar, H. S. Sheshadri, & M. C. Padma (Eds.), Emerging research in electronics, computer science and technology. New Delhi: Springer. Rajeswari, P. N. N. S. S., & Sathyanarayana, V. (2014). Robust speech recognition using wavelet domain front end and hidden Markov models. In V. Sridhar, H. S. Sheshadri, & M. C. Padma (Eds.), Emerging research in electronics, computer science and technology. New Delhi: Springer.
Zurück zum Zitat Saitoh, T., Morishita, K., & Konishi, R. (2008). Analysis of efficient lip reading method for various languages. In Proceedings of the 19th International Conference on Pattern Recognition, pp. 1–4. Saitoh, T., Morishita, K., & Konishi, R. (2008). Analysis of efficient lip reading method for various languages. In Proceedings of the 19th International Conference on Pattern Recognition, pp. 1–4.
Zurück zum Zitat Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 80–91.MATHCrossRef Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 80–91.MATHCrossRef
Zurück zum Zitat Shen, P., Tamura, S., & Hayamizu, S. (2014). Multistream sparse representation features for noise robust audio-visual speech recognition. Acoustical Science and Technology, 35(1), 17–27.CrossRef Shen, P., Tamura, S., & Hayamizu, S. (2014). Multistream sparse representation features for noise robust audio-visual speech recognition. Acoustical Science and Technology, 35(1), 17–27.CrossRef
Zurück zum Zitat Shin, J., Lee, J., & Kim, D. (2011). Real-time lip reading system for isolated Korean word recognition. Pattern Recognition, 44(3), 559–571.MATHCrossRef Shin, J., Lee, J., & Kim, D. (2011). Real-time lip reading system for isolated Korean word recognition. Pattern Recognition, 44(3), 559–571.MATHCrossRef
Zurück zum Zitat Shivappa, S., Trivedi, M., & Rao, B. (2010). Audiovisual information fusion in human computer interfaces and intelligent environments: A survey. Proceedings of the IEEE, 98(10), 1692–1715.CrossRef Shivappa, S., Trivedi, M., & Rao, B. (2010). Audiovisual information fusion in human computer interfaces and intelligent environments: A survey. Proceedings of the IEEE, 98(10), 1692–1715.CrossRef
Zurück zum Zitat Terissi, L. D., & Gómez, J. C. (2010). 3D head pose and facial expression tracking using a single camera. Journal of Universal Computer Science, 16(6), 903–920.MathSciNetMATH Terissi, L. D., & Gómez, J. C. (2010). 3D head pose and facial expression tracking using a single camera. Journal of Universal Computer Science, 16(6), 903–920.MathSciNetMATH
Zurück zum Zitat Trottier, L., Giguère, P., & Chaib-draa, B. (2015). Feature selection for robust automatic speech recognition: a temporal offset approach. International Journal of Speech Technology, 18(3), 395–404.CrossRef Trottier, L., Giguère, P., & Chaib-draa, B. (2015). Feature selection for robust automatic speech recognition: a temporal offset approach. International Journal of Speech Technology, 18(3), 395–404.CrossRef
Zurück zum Zitat Tufekci, Z., Gowdy, J. N., Gurbuz, S., & Patterson, E. (2006). Applied mel-frequency discrete wavelet coefficients and parallel model compensation for noise-robust speech recognition. Speech Communication, 48(10), 1294–1307.CrossRef Tufekci, Z., Gowdy, J. N., Gurbuz, S., & Patterson, E. (2006). Applied mel-frequency discrete wavelet coefficients and parallel model compensation for noise-robust speech recognition. Speech Communication, 48(10), 1294–1307.CrossRef
Zurück zum Zitat Uluskan, S., Sangwan, A., & Hansen, J. H. L. (2017). Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech. International Journal of Speech Technology, 20, 799–811.CrossRef Uluskan, S., Sangwan, A., & Hansen, J. H. L. (2017). Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech. International Journal of Speech Technology, 20, 799–811.CrossRef
Zurück zum Zitat Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition II: NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition II: NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef
Zurück zum Zitat Wang, S. L., Lau, W. H., & Leung, S. H. (2004). Automatic lip contour extraction from color images. Pattern Recognition, 37(12), 2375–2387.MATHCrossRef Wang, S. L., Lau, W. H., & Leung, S. H. (2004). Automatic lip contour extraction from color images. Pattern Recognition, 37(12), 2375–2387.MATHCrossRef
Zurück zum Zitat Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T. S., & Yan, S. (2010). Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6), 1031–1044.CrossRef Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T. S., & Yan, S. (2010). Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6), 1031–1044.CrossRef
Zurück zum Zitat Yau, W. C., Kumar, D. K., & Arjunan, S. P. (2007). Visual recognition of speech consonants using facial movement features. Integrated Computer-Aided Engineering-Informatics in Control, Automation and Robotics, 14(1), 49–61. Yau, W. C., Kumar, D. K., & Arjunan, S. P. (2007). Visual recognition of speech consonants using facial movement features. Integrated Computer-Aided Engineering-Informatics in Control, Automation and Robotics, 14(1), 49–61.
Zurück zum Zitat Yin, S., Liu, C., Zhang, Z., Lin, Y., Wang, D., Tejedor, J., et al. (2015). Noisy training for deep neural networks in speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 2.CrossRef Yin, S., Liu, C., Zhang, Z., Lin, Y., Wang, D., Tejedor, J., et al. (2015). Noisy training for deep neural networks in speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 2.CrossRef
Zurück zum Zitat Zhao, G., Barnard, M., & Pietikäinen, M. (2009). Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7), 1254–1265.CrossRef Zhao, G., Barnard, M., & Pietikäinen, M. (2009). Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7), 1254–1265.CrossRef
Metadaten
Titel
Robust front-end for audio, visual and audio–visual speech classification
verfasst von
Lucas D. Terissi
Gonzalo D. Sad
Juan C. Gómez
Publikationsdatum
13.04.2018
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2018
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-018-9504-y

Weitere Artikel der Ausgabe 2/2018

International Journal of Speech Technology 2/2018 Zur Ausgabe

Neuer Inhalt