nach oben

International Journal of Speech Technology

Erschienen in:

01.03.2015

Selection and enhancement of Gabor filters for automatic speech recognition

verfasst von: György Kovács, László Tóth, Dirk Van Compernolle

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Motivated by neurophysiological studies, the use of Gabor filters as acoustic feature extractors for speech recognition purposes has received increasing attention in the new millenium. As the optimal parametrization of these filters is not obvious, many researchers employ different feature selection methods to find the best filter set. In this study, however, we argue that these kinds of feature selection methods cannot fulfill this task, as we demonstrate this with results obtained from experiments. We show that one can easily construct a better filter set manually, using simple heuristic rules. Then, as an alternative to the usual filter selection methods, we propose a training method that can jointly optimize the spectro-temporal filters and the neural net acoustic model built on them. In this special neural network achitecture, the filters are incorporated into the network and employed as the lowest layer of it. This allows us to tune the filters using backpropagation, and to manipulate them directly and not through their parameters. This method also has the advantage of reducing the task of filter set enhancement to that of a simple neural net training. Next, we show that we can enhance our manually selected filter set with this novel neural net architecture using the filter coefficients as initial values for the backpropagation training. The resulting filter sets were evaluated on the phone recognition task of the TIMIT corpus, using both clean and noise contaminated data; while cross-database phone recognition performance was evaluated on the “Szeged” Hungarian broadcast news database. The results we get demonstrate that the proposed filter optimization algorithm can outperform the usual feature selection-based methods, and that the filter set obtained by fine tuning the manual filters with the neural net algorithm performs even better, beating all the other methods in terms of performance.

Nächster Artikel Modified group delay feature based total variability space modelling for speaker recognition

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

As some of our references (Kleinschmidt and Gelbart 2002; Schädler et al. 2012) used log mel-scaled spectrograms with 23 channels, for purposes of comparison we ran the corresponding experiments using both 23 and 26 channels, and then in the subsequent experiments we used the configuration that performed the best.

Here, preexisting means that these filters were taken ‘as is’ from other authors.

One might argue that these features may be different for different languages, however. This issue needs to be examined to see if it really the case.

The basic idea of the algorithm has been introduced in a conference paper (Kovács and Tóth 2013). Here, we present several refinements and a more thorough evaluation of this approach.

Aertsen, A. M., & Johannesma, P. I. (1981). The spectro-temporal receptive field. A functional characteristic of auditory neurons. Biological Cybernetics, 42(2), 133–143.CrossRefMATH

Abdel-Hamid, O., Mohamed, A., Jiang, H., & Penn, G. (2012). Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition. Proceedings of ICASSP, 2012, pp. 4277–4280.

Biem, A., Mcdermott, E., & Katagiri, S. (1995). A discriminative filter bank model for speech recognition. Proceedings of ICASSP-96, pp. 545–548.

Bourlard, H., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publication.CrossRef

Ezzat, T., Bouvrie, J., & Poggio, T. (2007). Spectro-temporal analysis of speech using 2-D Gabor filters. Proceedings of interspeech, pp. 50–509.

Gábor, D. (1946). Theory of communication. Journal of IEE, 93, 429–457.

Gelbart, D., Kleinschmidt, M., & Meyer, B. T. (2013). Gabor feature extraction for automatic speech recognition. Retrieved October 22, 2013, from http://www1.icsi.berkeley.edu/Speech/papers/gabor/.

Gosztolya, G., & Tóth, L. (2010). Keyword spotting experiments on broadcast news data using phone-based technologies (in Hungarian). Proceedings of MSZNY, pp. 224–235.

Gramss, T. (1991). Fast algorithms to find invariant features for a word recognizing neural net. Proceedings of second international conference on artificial neural networks, pp. 180–184.

Hirsch, H.-G. (2010). FaNT: Filtering and noise-adding tool. Retrieved March 22, 2010, from http://dnt.kr.hs-niederrhein.de/download.html.

Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2006). Extreme learning machine: A new learning scheme of feedforward neural networks. Proceedings of international joint conference on neural netwroks, pp. 985–990.

Huang, L.-L., Shimizu, A., & Kobatake, H. (2005). Robust face detection using Gabor filter features. Pattern Recognition Letters, 26(11), 1641–1649.CrossRef

Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. Proceedings of ICASSP, 2011, pp. 5884–5887.

Jones, J. P., & Palmer, L. A. (1987). An evaluation of two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophisiology, 56(6), 1233–1258.

Kanedera, N., Arai, T., Hermansky, H., & Pavel, M. (1999). On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Communication, 28(1), 43–55.CrossRef

Kleinschmidt, M. (2002a). Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acustica united with Acustica, 88(3), 416–422.

Kleinschmidt, M. (2002b). Spectro-temporal Gabor features as a front end for automatic speech recognition. Proceedings of triennial forum acusticum, September, 2002, Seville.

Kleinschmidt, M., & Gelbart, D. (2002). Improving word accuracy with Gabor feature extraction. Proceedings of ICSLP, pp. 25–28.

Kovács, G., & Tóth, L. (2010). Localized spectro-temporal features for noise-robust speech recognition. Proceedings of ICCC-CONTI, pp. 481–485.

Kovács, G., & Tóth, L. (2011). Phone recognition experiments with 2D DCT spectro-temporal features. Proceedings of SACI, 2011, pp. 143–146.

Kovács, G., & Tóth, L. (2013). The joint optimization of spectro-temporal features and neural net classifiers. Proceedings of TSD, 2013, pp. 552–559.

Lamel, L. F., Kassel, R., & Seneff, S. (1986). Speech database development: Design and analysis of the acoustic-phonetic corpus. Proceedings of DARPA speech recognition workshop, pp. 100–109.

Lee, C., Hyun, D., Choi, E., & Go, J. (2003). Optimizing feature extraction for speech recognition. IEEE Transactions on Speech and Audio Processing, 11, 80–87.CrossRef

Lee, K. F., & Hon, H. W. (1989). Speaker-independent phone recognition using Hidden Markov models. IEEE Transactions on Acoustics Speech and Signal Processing, 37, 1641–1648.CrossRef

Lee, S.-M., Fang, S.-H., Hung, J.-W., & Lee L.-S. (2001). Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition. IEEE workshop on automatic speech recognition and understanding, ASRU ’01, pp. 49–52.

Meyer, B. T., & Kollmeier, B. (2008). Optimization and evaluation of Gabor feature sets for ASR. Proceedings of interspeech, pp. 906–909.

Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.CrossRef

Palaz, D., Collobert, R., & Magimai-Doss, M. (2013). End-to-end phoneme sequence recognition using convolutional neural networks. NIPS deep learning workshop.

Sainath, T. N., Kingsbury, A., Ramebhadran, B., & Ramebhadran, M. (2013). Learning filter banks within a deep neural network. Proceedings of ASRU 2013.

Schädler, M. R., Meyer, B. T., & Kollmeier, B. (2012). Spectro-temporal modulation subspace-spanning filter bank features for robuest automatic speech recognition. The Journal of Acoustical Society of America, 132, 4134–4151.CrossRef

Somol, P., Novovicova, J., & Pudil, P. (2010). Efficient feature subset selection and subset size optimization. In E. Herout (Ed.), Pattern recognition recent advances (pp. 76–98). Rijeka: InTech.

Sun, Z., Bebis, G., & Miller, R. (2003). Evolutionary Gabor filter optimization with application to vehicle detection. Proceedings of ICDM, pp. 307–314.

Tasi, D. M. (2009). Optimal Gabor filter design for texture segmentation using stochastic optimization. Image and Vision Computing, 19, 299–316.

Tiitinen, H., Miettinen, I., Alku, P., & May, P. (2012). Transient and sustained cortical activity elicited by connected speech of varying intelligibility. BMC Neuroscience, 13, 157.

Tóth, L. (2013). Convolutional deep rectifier neural nets for phone recognition. Proceedings of interspeech, 2013, pp. 1722–1726.

Varga, A., & Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef

Vesely, K., Karafiat, M., & Grezl, F. (2011). Convolutive bottleneck network features for LVCSR. Proceedings of ASRU, 2011, pp. 42–47.

Vinyals, O., & Deng, L. (2012). Are sparse representations rich enough for acoustic modeling? Proceedings of interspeech, 2012, pp. 1–1.

von Ossietzky, C. (2013). Gabor filter bank features. Retrieved September 15, 2013, from http://medi.uni-oldenburg.de/GBFB.

Young, S. J., Evermann, G., Gales, M. J. F., Kershaw, D., Moore, G., Odell, J. J., et al. (2006). The HTK book version 3.4. Cambridge: Cambridge University Engineering Department.

Titel: Selection and enhancement of Gabor filters for automatic speech recognition
verfasst von: György Kovács
László Tóth
Dirk Van Compernolle
Publikationsdatum: 01.03.2015
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 1/2015
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-014-9246-4

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Die Gewinner und Laudatoren des Sustainability Award in Automotive 2024/© Uli Regenscheit | ATZlive, Search Icon, Banner Hanser, Suresh Vittal/© Alteryx, Additiv gefertigte Teile/© Marina_Skoropadskaya | Getty Images | iStock, Warnschild "Land unter"/© Bluedesign / Fotolia, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH, adäsion-Webinar-Matinee/© krystiannawrocki_ Getty Images

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2015

$$\hbox {F}_{0}$$ F 0 contour generation and synthesis using Bengali Hmm-based speech synthesis system

Modified group delay feature based total variability space modelling for speaker recognition

Automated modification of consonant–vowel ratio of stops for improving speech intelligibility

A detection and classification method for nasalized vowels in noise using product spectrum based cepstra

Statistical analysis of features and classification of alphasyllabary sounds in Kannada language

A statistical-based decision for arabic pronunciation assessment

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.