Skip to main content
Top
Published in: International Journal of Speech Technology 1/2020

26-11-2019

Bark scaled oversampled WPT based speech recognition enhancement in noisy environments

Authors: Navneet Upadhyay, Hamurabi Gamboa Rosales

Published in: International Journal of Speech Technology | Issue 1/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The performance of speech recognition system degrades significantly in real-world environment, is a case of the acoustic mismatch between the training and operating conditions. This paper presents a two-stage approach to make a speech recognition system immune to additive and uncorrelated background noise i.e. robust. In the first stage, an oversampled wavelet packet decomposes the entire input noisy speech into seventeen nonlinear frequency subbands like the Bark scale of the human hearing system and the adaptive noise estimation based spectral subtraction filters the noisy speech from each subband signal. The oversampled WPT is linear and advantageous as it causes to overcome the shift-invariance complexity by removing the decimation after the filtering at each decomposition level. In the second stage, a nonparametric approach is used for feature extraction from filtered speech, and the parameters from the feature extraction stage are compared with the parameters extracted from speech signals stored in a template to recognize the utterance. A series of experiments are carried out to evaluate the performance of the proposed two-stage system in a variety of real environments, with and without the use of the first stage. Recognition accuracy is evaluated at the word level in a wide range of SNRs for various types of noisy environments. The experimental results show significant improvement in recognition performance at low SNR using the proposed system.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Acero, A., & Stern, R. M. (1990). Environmental robustness in automatic speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA (Vol. 2, pp. 849–852). Acero, A., & Stern, R. M. (1990). Environmental robustness in automatic speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA (Vol. 2, pp. 849–852).
go back to reference Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., et al. (2007). Automatic speech recognition and speech variability: A review. Speech Communication,49, 763–786.CrossRef Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., et al. (2007). Automatic speech recognition and speech variability: A review. Speech Communication,49, 763–786.CrossRef
go back to reference Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech corrupted by acoustic noise. In International Conference on Acoustics, Speech, and Signal Processing, Washington, DC, USA (Vol. 4, pp. 208–211). Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech corrupted by acoustic noise. In International Conference on Acoustics, Speech, and Signal Processing, Washington, DC, USA (Vol. 4, pp. 208–211).
go back to reference Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transaction on Speech and Audio Processing,27(2), 113–120. Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transaction on Speech and Audio Processing,27(2), 113–120.
go back to reference Cohen, I. (2003). Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Transactions on Speech, and Audio Processing,11(5), 466–475.CrossRef Cohen, I. (2003). Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Transactions on Speech, and Audio Processing,11(5), 466–475.CrossRef
go back to reference Cutajar, M., Gatt, E., Grech, I., Casha, O., & Micallef, J. (2013). Comparative study of automatic speech recognition techniques. IET Signal Processing,7(1), 25–46.CrossRef Cutajar, M., Gatt, E., Grech, I., Casha, O., & Micallef, J. (2013). Comparative study of automatic speech recognition techniques. IET Signal Processing,7(1), 25–46.CrossRef
go back to reference Flores, J. A. N. & Young, S. J. (1993). Adapting a HMM based recognizer for noisy speech enhanced by spectral subtraction. In European conference on speech communication and technology (pp. 829–832). Flores, J. A. N. & Young, S. J. (1993). Adapting a HMM based recognizer for noisy speech enhanced by spectral subtraction. In European conference on speech communication and technology (pp. 829–832).
go back to reference Gong, Y. (1995). Speech recognition in noisy environments: A survey. Computer Speech & Language,16, 261–291.MathSciNet Gong, Y. (1995). Speech recognition in noisy environments: A survey. Computer Speech & Language,16, 261–291.MathSciNet
go back to reference Hirsch, H. G. & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In International conference on spoken language processing, China, Oct 16–20, 2000 (pp. 17–21). Hirsch, H. G. & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In International conference on spoken language processing, China, Oct 16–20, 2000 (pp. 17–21).
go back to reference Juang, B. H. (1991). Speech recognition in adverse environments. Computer Speech & Language,5, 275–294.CrossRef Juang, B. H. (1991). Speech recognition in adverse environments. Computer Speech & Language,5, 275–294.CrossRef
go back to reference Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics,33(3), 251–272.MathSciNetCrossRef Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics,33(3), 251–272.MathSciNetCrossRef
go back to reference Kamath, S., & Loizou, P. (2002). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In International conference on acoustics, speech, and signal processing, USA, May 2002 (Vol. 4, pp. 4160–4164). Kamath, S., & Loizou, P. (2002). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In International conference on acoustics, speech, and signal processing, USA, May 2002 (Vol. 4, pp. 4160–4164).
go back to reference Lin, L., Holmes, W., & Ambikairajah, E. (2002). Speech denoising using perceptual modification of Wiener filtering. Electronics Letters,38(23), 1486–1487.CrossRef Lin, L., Holmes, W., & Ambikairajah, E. (2002). Speech denoising using perceptual modification of Wiener filtering. Electronics Letters,38(23), 1486–1487.CrossRef
go back to reference Lin, L., Holmes, W. H., & Ambikairajah, E. (2003). Adaptive noise estimation algorithm for speech enhancement. Electronics Letters,39(9), 754–755.CrossRef Lin, L., Holmes, W. H., & Ambikairajah, E. (2003). Adaptive noise estimation algorithm for speech enhancement. Electronics Letters,39(9), 754–755.CrossRef
go back to reference Mallat, S. (1989). A theory for multi-resolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,11(7), 674–693.CrossRef Mallat, S. (1989). A theory for multi-resolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,11(7), 674–693.CrossRef
go back to reference Mallat, S. (2009). A wavelet tour of signal processing: The sparse way (3rd ed.). New York: Academic Press.MATH Mallat, S. (2009). A wavelet tour of signal processing: The sparse way (3rd ed.). New York: Academic Press.MATH
go back to reference Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transaction on Speech and Audio Processing,9(5), 504–512.CrossRef Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transaction on Speech and Audio Processing,9(5), 504–512.CrossRef
go back to reference Olhede, S., & Walden, A. T. (2005). A generalized demodulation approach to time-frequency projections for multi-component signals. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,461, 2159–2179.MathSciNetCrossRef Olhede, S., & Walden, A. T. (2005). A generalized demodulation approach to time-frequency projections for multi-component signals. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,461, 2159–2179.MathSciNetCrossRef
go back to reference Pallett, Devid S. (1985). Performance assessment of automatic speech recognizers. Journal of Research of the National Bureau of Standards,90(5), 371–385.CrossRef Pallett, Devid S. (1985). Performance assessment of automatic speech recognizers. Journal of Research of the National Bureau of Standards,90(5), 371–385.CrossRef
go back to reference Rix, A. R., Beerends, J., Hollier, M., & Hekstra, A. (2001). Perceptual evaluation of speech quality (PESQ): A new method for speech quality assessment of telephone networks and codecs. In Proceedings of IEEE international conference on acoustics, speech, and signal processing, Salt Lake City, UT (Vol. 2, pp. 749–752). Rix, A. R., Beerends, J., Hollier, M., & Hekstra, A. (2001). Perceptual evaluation of speech quality (PESQ): A new method for speech quality assessment of telephone networks and codecs. In Proceedings of IEEE international conference on acoustics, speech, and signal processing, Salt Lake City, UT (Vol. 2, pp. 749–752).
go back to reference Upadhyay, N., & Karmakar, A. (2014). A perceptually motivated stationary wavelet packet filterbank using improved spectral over-subtraction for enhancement of speech in various noise environments. International Journal of Speech Technology,17, 117–132.CrossRef Upadhyay, N., & Karmakar, A. (2014). A perceptually motivated stationary wavelet packet filterbank using improved spectral over-subtraction for enhancement of speech in various noise environments. International Journal of Speech Technology,17, 117–132.CrossRef
go back to reference Upadhyay, N., & Rosales, H. G. (2016). Auditory driven subband speech enhancement for automatic recognition of noisy speech. International Journal of Speech Technology,19(4), 869–880.CrossRef Upadhyay, N., & Rosales, H. G. (2016). Auditory driven subband speech enhancement for automatic recognition of noisy speech. International Journal of Speech Technology,19(4), 869–880.CrossRef
go back to reference Walden, A. T., & Contreras, C. (1998). The phase-corrected undecimated discrete wavelet packet transform and its application to interpreting the timing of events. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,454, 2243–2266.CrossRef Walden, A. T., & Contreras, C. (1998). The phase-corrected undecimated discrete wavelet packet transform and its application to interpreting the timing of events. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,454, 2243–2266.CrossRef
go back to reference Yamada, Takeshi, Kumakura, Masakazu, & Kitawaki, Nobuhiko. (2006). Performance estimation of speech recognition system under noise conditions using objective quality measures and artificial voice. IEEE Transactions on Audio, Speech and Language Processing,14(6), 2006–2013.CrossRef Yamada, Takeshi, Kumakura, Masakazu, & Kitawaki, Nobuhiko. (2006). Performance estimation of speech recognition system under noise conditions using objective quality measures and artificial voice. IEEE Transactions on Audio, Speech and Language Processing,14(6), 2006–2013.CrossRef
go back to reference Zwicker, E., & Terhardt, E. (1980). Analytical expressions for critical band rate and critical bandwidth as a function of frequency. The Journal of the Acoustical Society of America,68, 1523–1525.CrossRef Zwicker, E., & Terhardt, E. (1980). Analytical expressions for critical band rate and critical bandwidth as a function of frequency. The Journal of the Acoustical Society of America,68, 1523–1525.CrossRef
Metadata
Title
Bark scaled oversampled WPT based speech recognition enhancement in noisy environments
Authors
Navneet Upadhyay
Hamurabi Gamboa Rosales
Publication date
26-11-2019
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 1/2020
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-019-09657-y

Other articles of this Issue 1/2020

International Journal of Speech Technology 1/2020 Go to the issue