Skip to main content
Erschienen in: International Journal of Speech Technology 1/2015

01.03.2015

Selection and enhancement of Gabor filters for automatic speech recognition

verfasst von: György Kovács, László Tóth, Dirk Van Compernolle

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Motivated by neurophysiological studies, the use of Gabor filters as acoustic feature extractors for speech recognition purposes has received increasing attention in the new millenium. As the optimal parametrization of these filters is not obvious, many researchers employ different feature selection methods to find the best filter set. In this study, however, we argue that these kinds of feature selection methods cannot fulfill this task, as we demonstrate this with results obtained from experiments. We show that one can easily construct a better filter set manually, using simple heuristic rules. Then, as an alternative to the usual filter selection methods, we propose a training method that can jointly optimize the spectro-temporal filters and the neural net acoustic model built on them. In this special neural network achitecture, the filters are incorporated into the network and employed as the lowest layer of it. This allows us to tune the filters using backpropagation, and to manipulate them directly and not through their parameters. This method also has the advantage of reducing the task of filter set enhancement to that of a simple neural net training. Next, we show that we can enhance our manually selected filter set with this novel neural net architecture using the filter coefficients as initial values for the backpropagation training. The resulting filter sets were evaluated on the phone recognition task of the TIMIT corpus, using both clean and noise contaminated data; while cross-database phone recognition performance was evaluated on the “Szeged” Hungarian broadcast news database. The results we get demonstrate that the proposed filter optimization algorithm can outperform the usual feature selection-based methods, and that the filter set obtained by fine tuning the manual filters with the neural net algorithm performs even better, beating all the other methods in terms of performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
As some of our references (Kleinschmidt and Gelbart 2002; Schädler et al. 2012) used log mel-scaled spectrograms with 23 channels, for purposes of comparison we ran the corresponding experiments using both 23 and 26 channels, and then in the subsequent experiments we used the configuration that performed the best.
 
2
Here, preexisting means that these filters were taken ‘as is’ from other authors.
 
3
One might argue that these features may be different for different languages, however. This issue needs to be examined to see if it really the case.
 
4
The basic idea of the algorithm has been introduced in a conference paper (Kovács and Tóth 2013). Here, we present several refinements and a more thorough evaluation of this approach.
 
Literatur
Zurück zum Zitat Aertsen, A. M., & Johannesma, P. I. (1981). The spectro-temporal receptive field. A functional characteristic of auditory neurons. Biological Cybernetics, 42(2), 133–143.CrossRefMATH Aertsen, A. M., & Johannesma, P. I. (1981). The spectro-temporal receptive field. A functional characteristic of auditory neurons. Biological Cybernetics, 42(2), 133–143.CrossRefMATH
Zurück zum Zitat Abdel-Hamid, O., Mohamed, A., Jiang, H., & Penn, G. (2012). Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition. Proceedings of ICASSP, 2012, pp. 4277–4280. Abdel-Hamid, O., Mohamed, A., Jiang, H., & Penn, G. (2012). Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition. Proceedings of ICASSP, 2012, pp. 4277–4280.
Zurück zum Zitat Biem, A., Mcdermott, E., & Katagiri, S. (1995). A discriminative filter bank model for speech recognition. Proceedings of ICASSP-96, pp. 545–548. Biem, A., Mcdermott, E., & Katagiri, S. (1995). A discriminative filter bank model for speech recognition. Proceedings of ICASSP-96, pp. 545–548.
Zurück zum Zitat Bourlard, H., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publication.CrossRef Bourlard, H., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publication.CrossRef
Zurück zum Zitat Ezzat, T., Bouvrie, J., & Poggio, T. (2007). Spectro-temporal analysis of speech using 2-D Gabor filters. Proceedings of interspeech, pp. 50–509. Ezzat, T., Bouvrie, J., & Poggio, T. (2007). Spectro-temporal analysis of speech using 2-D Gabor filters. Proceedings of interspeech, pp. 50–509.
Zurück zum Zitat Gábor, D. (1946). Theory of communication. Journal of IEE, 93, 429–457. Gábor, D. (1946). Theory of communication. Journal of IEE, 93, 429–457.
Zurück zum Zitat Gosztolya, G., & Tóth, L. (2010). Keyword spotting experiments on broadcast news data using phone-based technologies (in Hungarian). Proceedings of MSZNY, pp. 224–235. Gosztolya, G., & Tóth, L. (2010). Keyword spotting experiments on broadcast news data using phone-based technologies (in Hungarian). Proceedings of MSZNY, pp. 224–235.
Zurück zum Zitat Gramss, T. (1991). Fast algorithms to find invariant features for a word recognizing neural net. Proceedings of second international conference on artificial neural networks, pp. 180–184. Gramss, T. (1991). Fast algorithms to find invariant features for a word recognizing neural net. Proceedings of second international conference on artificial neural networks, pp. 180–184.
Zurück zum Zitat Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2006). Extreme learning machine: A new learning scheme of feedforward neural networks. Proceedings of international joint conference on neural netwroks, pp. 985–990. Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2006). Extreme learning machine: A new learning scheme of feedforward neural networks. Proceedings of international joint conference on neural netwroks, pp. 985–990.
Zurück zum Zitat Huang, L.-L., Shimizu, A., & Kobatake, H. (2005). Robust face detection using Gabor filter features. Pattern Recognition Letters, 26(11), 1641–1649.CrossRef Huang, L.-L., Shimizu, A., & Kobatake, H. (2005). Robust face detection using Gabor filter features. Pattern Recognition Letters, 26(11), 1641–1649.CrossRef
Zurück zum Zitat Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. Proceedings of ICASSP, 2011, pp. 5884–5887. Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. Proceedings of ICASSP, 2011, pp. 5884–5887.
Zurück zum Zitat Jones, J. P., & Palmer, L. A. (1987). An evaluation of two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophisiology, 56(6), 1233–1258. Jones, J. P., & Palmer, L. A. (1987). An evaluation of two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophisiology, 56(6), 1233–1258.
Zurück zum Zitat Kanedera, N., Arai, T., Hermansky, H., & Pavel, M. (1999). On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Communication, 28(1), 43–55.CrossRef Kanedera, N., Arai, T., Hermansky, H., & Pavel, M. (1999). On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Communication, 28(1), 43–55.CrossRef
Zurück zum Zitat Kleinschmidt, M. (2002a). Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acustica united with Acustica, 88(3), 416–422. Kleinschmidt, M. (2002a). Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acustica united with Acustica, 88(3), 416–422.
Zurück zum Zitat Kleinschmidt, M. (2002b). Spectro-temporal Gabor features as a front end for automatic speech recognition. Proceedings of triennial forum acusticum, September, 2002, Seville. Kleinschmidt, M. (2002b). Spectro-temporal Gabor features as a front end for automatic speech recognition. Proceedings of triennial forum acusticum, September, 2002, Seville.
Zurück zum Zitat Kleinschmidt, M., & Gelbart, D. (2002). Improving word accuracy with Gabor feature extraction. Proceedings of ICSLP, pp. 25–28. Kleinschmidt, M., & Gelbart, D. (2002). Improving word accuracy with Gabor feature extraction. Proceedings of ICSLP, pp. 25–28.
Zurück zum Zitat Kovács, G., & Tóth, L. (2010). Localized spectro-temporal features for noise-robust speech recognition. Proceedings of ICCC-CONTI, pp. 481–485. Kovács, G., & Tóth, L. (2010). Localized spectro-temporal features for noise-robust speech recognition. Proceedings of ICCC-CONTI, pp. 481–485.
Zurück zum Zitat Kovács, G., & Tóth, L. (2011). Phone recognition experiments with 2D DCT spectro-temporal features. Proceedings of SACI, 2011, pp. 143–146. Kovács, G., & Tóth, L. (2011). Phone recognition experiments with 2D DCT spectro-temporal features. Proceedings of SACI, 2011, pp. 143–146.
Zurück zum Zitat Kovács, G., & Tóth, L. (2013). The joint optimization of spectro-temporal features and neural net classifiers. Proceedings of TSD, 2013, pp. 552–559. Kovács, G., & Tóth, L. (2013). The joint optimization of spectro-temporal features and neural net classifiers. Proceedings of TSD, 2013, pp. 552–559.
Zurück zum Zitat Lamel, L. F., Kassel, R., & Seneff, S. (1986). Speech database development: Design and analysis of the acoustic-phonetic corpus. Proceedings of DARPA speech recognition workshop, pp. 100–109. Lamel, L. F., Kassel, R., & Seneff, S. (1986). Speech database development: Design and analysis of the acoustic-phonetic corpus. Proceedings of DARPA speech recognition workshop, pp. 100–109.
Zurück zum Zitat Lee, C., Hyun, D., Choi, E., & Go, J. (2003). Optimizing feature extraction for speech recognition. IEEE Transactions on Speech and Audio Processing, 11, 80–87.CrossRef Lee, C., Hyun, D., Choi, E., & Go, J. (2003). Optimizing feature extraction for speech recognition. IEEE Transactions on Speech and Audio Processing, 11, 80–87.CrossRef
Zurück zum Zitat Lee, K. F., & Hon, H. W. (1989). Speaker-independent phone recognition using Hidden Markov models. IEEE Transactions on Acoustics Speech and Signal Processing, 37, 1641–1648.CrossRef Lee, K. F., & Hon, H. W. (1989). Speaker-independent phone recognition using Hidden Markov models. IEEE Transactions on Acoustics Speech and Signal Processing, 37, 1641–1648.CrossRef
Zurück zum Zitat Lee, S.-M., Fang, S.-H., Hung, J.-W., & Lee L.-S. (2001). Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition. IEEE workshop on automatic speech recognition and understanding, ASRU ’01, pp. 49–52. Lee, S.-M., Fang, S.-H., Hung, J.-W., & Lee L.-S. (2001). Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition. IEEE workshop on automatic speech recognition and understanding, ASRU ’01, pp. 49–52.
Zurück zum Zitat Meyer, B. T., & Kollmeier, B. (2008). Optimization and evaluation of Gabor feature sets for ASR. Proceedings of interspeech, pp. 906–909. Meyer, B. T., & Kollmeier, B. (2008). Optimization and evaluation of Gabor feature sets for ASR. Proceedings of interspeech, pp. 906–909.
Zurück zum Zitat Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.CrossRef Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.CrossRef
Zurück zum Zitat Palaz, D., Collobert, R., & Magimai-Doss, M. (2013). End-to-end phoneme sequence recognition using convolutional neural networks. NIPS deep learning workshop. Palaz, D., Collobert, R., & Magimai-Doss, M. (2013). End-to-end phoneme sequence recognition using convolutional neural networks. NIPS deep learning workshop.
Zurück zum Zitat Sainath, T. N., Kingsbury, A., Ramebhadran, B., & Ramebhadran, M. (2013). Learning filter banks within a deep neural network. Proceedings of ASRU 2013. Sainath, T. N., Kingsbury, A., Ramebhadran, B., & Ramebhadran, M. (2013). Learning filter banks within a deep neural network. Proceedings of ASRU 2013.
Zurück zum Zitat Schädler, M. R., Meyer, B. T., & Kollmeier, B. (2012). Spectro-temporal modulation subspace-spanning filter bank features for robuest automatic speech recognition. The Journal of Acoustical Society of America, 132, 4134–4151.CrossRef Schädler, M. R., Meyer, B. T., & Kollmeier, B. (2012). Spectro-temporal modulation subspace-spanning filter bank features for robuest automatic speech recognition. The Journal of Acoustical Society of America, 132, 4134–4151.CrossRef
Zurück zum Zitat Somol, P., Novovicova, J., & Pudil, P. (2010). Efficient feature subset selection and subset size optimization. In E. Herout (Ed.), Pattern recognition recent advances (pp. 76–98). Rijeka: InTech. Somol, P., Novovicova, J., & Pudil, P. (2010). Efficient feature subset selection and subset size optimization. In E. Herout (Ed.), Pattern recognition recent advances (pp. 76–98). Rijeka: InTech.
Zurück zum Zitat Sun, Z., Bebis, G., & Miller, R. (2003). Evolutionary Gabor filter optimization with application to vehicle detection. Proceedings of ICDM, pp. 307–314. Sun, Z., Bebis, G., & Miller, R. (2003). Evolutionary Gabor filter optimization with application to vehicle detection. Proceedings of ICDM, pp. 307–314.
Zurück zum Zitat Tasi, D. M. (2009). Optimal Gabor filter design for texture segmentation using stochastic optimization. Image and Vision Computing, 19, 299–316. Tasi, D. M. (2009). Optimal Gabor filter design for texture segmentation using stochastic optimization. Image and Vision Computing, 19, 299–316.
Zurück zum Zitat Tiitinen, H., Miettinen, I., Alku, P., & May, P. (2012). Transient and sustained cortical activity elicited by connected speech of varying intelligibility. BMC Neuroscience, 13, 157. Tiitinen, H., Miettinen, I., Alku, P., & May, P. (2012). Transient and sustained cortical activity elicited by connected speech of varying intelligibility. BMC Neuroscience, 13, 157.
Zurück zum Zitat Tóth, L. (2013). Convolutional deep rectifier neural nets for phone recognition. Proceedings of interspeech, 2013, pp. 1722–1726. Tóth, L. (2013). Convolutional deep rectifier neural nets for phone recognition. Proceedings of interspeech, 2013, pp. 1722–1726.
Zurück zum Zitat Varga, A., & Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef Varga, A., & Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef
Zurück zum Zitat Vesely, K., Karafiat, M., & Grezl, F. (2011). Convolutive bottleneck network features for LVCSR. Proceedings of ASRU, 2011, pp. 42–47. Vesely, K., Karafiat, M., & Grezl, F. (2011). Convolutive bottleneck network features for LVCSR. Proceedings of ASRU, 2011, pp. 42–47.
Zurück zum Zitat Vinyals, O., & Deng, L. (2012). Are sparse representations rich enough for acoustic modeling? Proceedings of interspeech, 2012, pp. 1–1. Vinyals, O., & Deng, L. (2012). Are sparse representations rich enough for acoustic modeling? Proceedings of interspeech, 2012, pp. 1–1.
Zurück zum Zitat Young, S. J., Evermann, G., Gales, M. J. F., Kershaw, D., Moore, G., Odell, J. J., et al. (2006). The HTK book version 3.4. Cambridge: Cambridge University Engineering Department. Young, S. J., Evermann, G., Gales, M. J. F., Kershaw, D., Moore, G., Odell, J. J., et al. (2006). The HTK book version 3.4. Cambridge: Cambridge University Engineering Department.
Metadaten
Titel
Selection and enhancement of Gabor filters for automatic speech recognition
verfasst von
György Kovács
László Tóth
Dirk Van Compernolle
Publikationsdatum
01.03.2015
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2015
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-014-9246-4

Weitere Artikel der Ausgabe 1/2015

International Journal of Speech Technology 1/2015 Zur Ausgabe

Neuer Inhalt