Top

International Journal of Speech Technology

Published in:

12-10-2019

Designing of Gabor filters for spectro-temporal feature extraction to improve the performance of ASR system

Authors: Anirban Dutta, Gudmalwar Ashishkumar, Ch. V. Rama Rao

Published in: International Journal of Speech Technology | Issue 4/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Existing automatic speech recognition (ASR) system uses the spectral or temporal features of speech. The performance of such systems is still poor compared to the human perception of hearing, especially in noisy environments. This paper concentrates on the extraction of spectro-temporal features based on physiological and psychoacoustically inspired approaches. Here, two dimensional Gabor filters are used to estimate the spectro-temporal features from time–frequency representation of uttered speech signals. The Gabor filters are designed using the concept of constant Q factor. It is found that human perception system maintains approximately constant Q in its frequency response along the chain of its filter bank. Constant Q analysis ensures that the Gabor filters occupy a set of geometrically spaced spectral and temporal bins. Time–frequency representation of speech signal is a key ingredient for Gabor based feature extraction method. For time–frequency mapping, Gammatonegram is adopted instead of conventional spectrogram representations. The performance of the ASR system with the proposed feature set is experimentally validated using AURORA2 noisy digit database. Under clean training; the proposed features obtained a relative improvement of about 50% in word error rate (WER) compared to Mel frequency cepstral coefficients (MFCC) features. A relative improvement of 23% in WER is also obtained compared with that of existing spectro-temporal feature extraction methods. Further analysis is carried out on TIMIT corrupted with noise samples taken from the NOISEX-92 database. The experimental verification proves the robustness of proposed features in building a robust acoustic model for the ASR system.

previous article Early reflection detection using autocorrelation to improve robustness of speaker verification in reverberant conditions

next article Acoustic-phonetic feature based Kannada dialect identification from vowel sounds

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Amrouche, A., Taleb-Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. (2009). Improvement of the speech recognition in noisy environments using a nonparametric regression. International Journal of Parallel, Emergent and Distributed Systems, 24(1), 49–67.MathSciNetCrossRef

Barker, J., Vincent, E., Ma, N., Christensen, H., & Green, P. (2013). The PASCAL CHiME speech separation and recognition challenge. Computer Speech and Language, 27(3), 621–633.CrossRef

Chi, T., Ru, P., & Shamma, S. A. (2005). Multiresolution spectrotemporal analysis of complex sounds. The Journal of the Acoustical Society of America, 118(2), 887–906.CrossRef

Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.CrossRef

Depireux, D. A., Simon, J. Z., Klein, D. J., & Shamma, S. A. (2001). Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. Journal of Neurophysiology, 85(3), 1220–1234.CrossRef

Dörfler, M. (2001). Time–frequency analysis for music signals: A mathematical approach. Journal of New Music Research, 30(1), 3–12.CrossRef

Dubey, R. K., & Kumar, A. (2013). Non-intrusive speech quality assessment using several combinations of auditory features. International Journal of Speech Technology, 16(1), 89–101.CrossRef

Ellis, D. P. W. (2009). Gammatone-like spectrograms. http://www.ee.columbia.edu/dpwe/resources/matlab/gammatonegram.

Fartash, M., Setayeshi, S., & Razzazi, F. (2015). A noise robust speech features extraction approach in multidimensional cortical representation using multilinear principal component analysis. International Journal of Speech Technology, 18(3), 351–365.CrossRef

Ganapathy, S., & Omar, M. (2014). Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering. The Journal of the Acoustical Society of America, 136(5), EL343–EL349.CrossRef

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic–phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n 93.

Gautam, S., & Singh, L. (2017). Development of spectro-temporal features of speech in children. International Journal of Speech Technology, 20(3), 543–551.CrossRef

Gold, B., Morgan, N., & Ellis, D. (2011). Speech and audio signal processing: Processing and perception of speech and music. New York: Wiley.CrossRef

Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.CrossRef

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, Ar, Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.CrossRef

Hirsch, H. G. (2005). FaNT-filtering and noise adding tool. Niederrhein University of Applied Sciences. http://dnt.kr.hsnr.de/download.html.

Hirsch, H. G., & Pearce, D. (2000). The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millennium ISCA Tutorial and Research Workshop (ITRW).

Holighaus, N., Dörfler, M., Velasco, G. A., & Grill, T. (2013). A framework for invertible, real-time constant-Q transforms. IEEE Transactions on Audio, Speech, and Language Processing, 21(4), 775–785.CrossRef

Kanedera, N., Arai, T., Hermansky, H., & Pavel, M. (1999). On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Communication, 28(1), 43–55.CrossRef

Katsiamis, A. G., Drakakis, E. M., & Lyon, R. F. (2007). Practical gammatone-like filters for auditory processing. EURASIP Journal on Audio, Speech, and Music Processing, 2007(1), 063685.

Kim, C., & Stern, R. M. (2009). Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction. In: Tenth annual conference of the International Speech Communication Association.

Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(7), 1315–1329.CrossRef

Kleinschmidt, M. (2003). Localized spectro-temporal features for automatic speech recognition. In Eighth European conference on speech communication and technology.

Kleinschmidt, M., & Gelbart, D. (2002). Improving word accuracy with Gabor feature extraction. In Seventh international conference on spoken language processing.

Kovács, G., Tóth, L., & Van Compernolle, D. (2015). Selection and enhancement of Gabor filters for automatic speech recognition. International Journal of Speech Technology, 18(1), 1–16.CrossRef

Martinez, A. M. C., Moritz, N., & Meyer, B. T. (2014). Should deep neural nets have ears? The role of auditory features in deep learning approaches. In Fifteenth annual conference of the International Speech Communication Association.

Martinez, A. M. C., Mallidi, S. H., & Meyer, B. T. (2017). On the relevance of auditory-based Gabor features for deep learning in robust speech recognition. Computer Speech and Language, 45, 21–38.CrossRef

Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review. Language and Cognitive Processes, 27(7–8), 953–978.CrossRef

Mesgarani, N., Slaney, M., & Shamma, S. A. (2006). Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 920–930.CrossRef

Mesgarani, N., David, S., & Shamma, S. (2007). Representation of phonemes in primary auditory cortex: How the brain analyzes speech. In 2007 IEEE international conference on acoustics, speech and signal processing—ICASSP’07 (Vol. 4, pp. IV-765). IEEE.

Mesgarani, N., Thomas, S., & Hermansky, H. (2010). A multistream multiresolution framework for phoneme recognition. In Eleventh annual conference of the International Speech Communication Association.

Meyer, B. T., & Kollmeier, B. (2011). Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Communication,53(5), 753–767.

Mohamed, Ar., Sainath, T. N., Dahl, G. E., Ramabhadran, B., Hinton, G. E., Picheny, M. A., et al. (2011). Deep belief networks using discriminative features for phone recognition. In ICASSP (pp. 5060–5063).

Norris, D., McQueen, J. M., & Cutler, A. (2016). Prediction, Bayesian inference and feedback in speech recognition. Language, Cognition and Neuroscience, 31(1), 4–18.CrossRef

Patel, H., Thakkar, A., Pandya, M., & Makwana, K. (2018). Neural network with deep learning architectures. Journal of Information and Optimization Sciences, 39(1), 31–38.MathSciNetCrossRef

Patterson, R., et al. (1992). Complex sounds and auditory images. In Y. Cazals, et al. (Eds.), Auditory physiology and perception. Oxford: Pergamon Press.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. Technical report. IEEE Signal Processing Society.

Povey, D., Zhang, X., & Khudanpur, S. (2014). Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv preprint arXiv:14107455.

Qiu, A., Schreiner, C. E., & Escabí, M. A. (2003). Gabor analysis of auditory midbrain receptive fields: Spectro-temporal and binaural composition. Journal of Neurophysiology, 90(1), 456–476.CrossRef

Rath, S. P., Povey, D., Veselỳ, K., & Cernockỳ, J. (2013). Improved feature processing for deep neural networks. In Interspeech (pp. 109–113).

Revathi, A., Sasikaladevi, N., Nagakrishnan, R., & Jeyalakshmi, C. (2018). Robust emotion recognition from speech: Gamma tone features and models. International Journal of Speech Technology, 21(3), 723–739.CrossRef

Schädler, M. R., & Kollmeier, B. (2015). Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition. The Journal of the Acoustical Society of America, 137(4), 2047–2059.CrossRef

Schädler, M. R., Meyer, B. T., & Kollmeier, B. (2012). Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. The Journal of the Acoustical Society of America, 131(5), 4134–4151.CrossRef

Schröder, J., Goetze, S., & Anemüller, J. (2015). Spectro-temporal Gabor filterbank features for acoustic event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12), 2198–2208.CrossRef

Shokouhi, N., & Hansen, J. H. (2017). Teager–Kaiser energy operators for overlapped speech detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(5), 1035–1047.CrossRef

Slaney, M., et al. (1993). An efficient implementation of the Patterson–Holdsworth auditory filter bank. Technical report, 35(8). Apple Computer, Perception Group.

Spille, C., Kollmeier, B., & Meyer, B. T. (2017). Combining binaural and cortical features for robust speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, 25(4), 756–767.CrossRef

Todisco, M., Delgado, H., & Evans, N. (2016). A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In Speaker Odyssey workshop, Bilbao, Spain (Vol. 25, pp. 249–252).

Valero, X., & Alias, F. (2012). Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification. IEEE Transactions on Multimedia, 14(6), 1684–1689.CrossRef

Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRef

Zhang, X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 215–219). IEEE.

Zhao, S. Y., Ravuri, S., & Morgan, N. (2009). Multi-stream to many-stream: Using spectro-temporal features for ASR. In: Tenth annual conference of the International Speech Communication Association.

Title: Designing of Gabor filters for spectro-temporal feature extraction to improve the performance of ASR system
Authors: Anirban Dutta
Gudmalwar Ashishkumar
Ch. V. Rama Rao
Publication date: 12-10-2019
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 4/2019
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-019-09650-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2019

Maximum entropy PLDA for robust speaker recognition under speech coding distortion

A novel voice conversion approach using cascaded powerful cepstrum predictors with excitation and phase extracted from the target training space encoded as a KD-tree

Evaluation of PNN pattern-layer activation function approximations in different training setups

Sliding-band dynamic range compression for use in hearing aids

Acoustic-phonetic feature based Kannada dialect identification from vowel sounds

Hybridization DE with K-means for speaker clustering in speaker diarization of broadcasts news