Skip to main content
Erschienen in: Cognitive Computation 4/2013

01.12.2013

Auditory-Inspired Morphological Processing of Speech Spectrograms: Applications in Automatic Speech Recognition and Speech Enhancement

verfasst von: Joyner Cadore, Francisco J. Valverde-Albacete, Ascensión Gallardo-Antolín, Carmen Peláez-Moreno

Erschienen in: Cognitive Computation | Ausgabe 4/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

New auditory-inspired speech processing methods are presented in this paper, combining spectral subtraction and two-dimensional non-linear filtering techniques originally conceived for image processing purposes. In particular, mathematical morphology operations, like erosion and dilation, are applied to noisy speech spectrograms using specifically designed structuring elements inspired in the masking properties of the human auditory system. This is effectively complemented with a pre-processing stage including the conventional spectral subtraction procedure and auditory filterbanks. These methods were tested in both speech enhancement and automatic speech recognition tasks. For the first, time-frequency anisotropic structuring elements over grey-scale spectrograms were found to provide a better perceptual quality than isotropic ones, revealing themselves as more appropriate—under a number of perceptual quality estimation measures and several signal-to-noise ratios on the Aurora database—for retaining the structure of speech while removing background noise. For the second, the combination of Spectral Subtraction and auditory-inspired Morphological Filtering was found to improve recognition rates in a noise-contaminated version of the Isolet database.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
See [47] for a discussion on the acoustic cues employed by humans in each of the levels.
 
2
To pave the way for later notation we are going to introduce our own names for the logarithmic scales mz and ERB rate, respectively.
 
3
The basic quantity over which perception of sound is measured is sound pressure level which is a normalised, logarithmic sound pressure—unless otherwise noted log refers in this paper to base-10 logarithms— \(L_p = 20\log\frac{p}{p_0} (\hbox{dB SPL})\) where p is sound pressure and \(p_0 = 20 \; {\mu}Pa\) is the reference sound pressure, the lowest audible pressure for human ears at mid-frequencies. A related quantity is sound (intensity) level, a normalised, logarithmic intensity level \(L_I = 10\log\frac{I}{I_0} (\hbox{dB SL})\) where Ip 2 is the acoustic intensity, an energy related quantity. When using I 0 = 10−12 N/m 2 for reference, both levels can be equated and we drop the subindex:
$$ L = 20\log\frac{p}{p_0} (\hbox{dB SPL}) = 10\log\frac{I}{I_0} (\hbox{dB SL}) $$
Both dB SPL and dB SL will be simply notated as dB.
 
4
At least not with the reduced number of filterbanks suggested by psychoacoustical experiments.
 
Literatur
1.
Zurück zum Zitat Baker J. The Dragon system—an overview. IEEE Trans Acoust Speech Signal Process. 1975;23(1):24–29CrossRef Baker J. The Dragon system—an overview. IEEE Trans Acoust Speech Signal Process. 1975;23(1):24–29CrossRef
2.
Zurück zum Zitat Beerends J, Hekstra A, Rix A, Hollier M. Perceptual evaluation of speech quality (PESQ) the new ITU standard for end-to-end speech quality assessment. Part II: psychoacoustic model. J Audio Eng Soc. 2002;50(10):765–78 Beerends J, Hekstra A, Rix A, Hollier M. Perceptual evaluation of speech quality (PESQ) the new ITU standard for end-to-end speech quality assessment. Part II: psychoacoustic model. J Audio Eng Soc. 2002;50(10):765–78
3.
Zurück zum Zitat Berouti M, Schwartz R, Makhoul J Enhancement of speech corrupted by acoustic noise. IEEE Int Conf Acoust Speech Signal Process. 1979;4:208–211. IEEE. Berouti M, Schwartz R, Makhoul J Enhancement of speech corrupted by acoustic noise. IEEE Int Conf Acoust Speech Signal Process. 1979;4:208–211. IEEE.
4.
Zurück zum Zitat Bourlard H, Morgan N. Hybrid HMM/ANN systems for speech recognition: overview and new research directions. Adapt Process Seq Data Struct. 1998;389–417. Bourlard H, Morgan N. Hybrid HMM/ANN systems for speech recognition: overview and new research directions. Adapt Process Seq Data Struct. 1998;389–417.
6.
Zurück zum Zitat Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357–66CrossRef Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357–66CrossRef
7.
Zurück zum Zitat Dougherty ER, Lotufo RA. Hands-on morphological image processing. Tutorial texts in optical engineering, vol. TT59. SPIE press 2003. Dougherty ER, Lotufo RA. Hands-on morphological image processing. Tutorial texts in optical engineering, vol. TT59. SPIE press 2003.
8.
Zurück zum Zitat Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process. 1984;32(6):1109–21CrossRef Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process. 1984;32(6):1109–21CrossRef
9.
Zurück zum Zitat Evans N, Mason J, Roach M, et al. Noise compensation using spectrogram morphological filtering. In: Proceedings of the 4th IASTED International Conference on Signal and Image Processing. 2002. pp. 157–61. Evans N, Mason J, Roach M, et al. Noise compensation using spectrogram morphological filtering. In: Proceedings of the 4th IASTED International Conference on Signal and Image Processing. 2002. pp. 157–61.
10.
Zurück zum Zitat Ezeiza A, López de Ipiña K, Hernández C, Barroso N. Enhancing the feature extraction process for automatic speech recognition with fractal dimensions. Cogn Comput. 2012. pp. 1–6. Ezeiza A, López de Ipiña K, Hernández C, Barroso N. Enhancing the feature extraction process for automatic speech recognition with fractal dimensions. Cogn Comput. 2012. pp. 1–6.
11.
Zurück zum Zitat Fastl H, Zwicker E. Psycho-acoustics: facts and models, 3rd edn. New York: Springer; 2007. Fastl H, Zwicker E. Psycho-acoustics: facts and models, 3rd edn. New York: Springer; 2007.
12.
Zurück zum Zitat Faundez-Zanuy M, Hussain A, Mekyska J, Sesa-Nogueras E, Monte-Moreno E, Esposito A, Chetouani M, Garre-Olmo J, Abel A, Smekal Z, López de Ipiña K. Biometric applications related to human beings: there is life beyond security. Cogn Comput. 2012; 1–16. Faundez-Zanuy M, Hussain A, Mekyska J, Sesa-Nogueras E, Monte-Moreno E, Esposito A, Chetouani M, Garre-Olmo J, Abel A, Smekal Z, López de Ipiña K. Biometric applications related to human beings: there is life beyond security. Cogn Comput. 2012; 1–16.
13.
Zurück zum Zitat Florentine M, Fastl H, Buus S. Temporal integration in normal hearing, cochlear impairment, and impairment simulated by masking. J Acoust Soc Am. 1998; 84(1):195–203.PubMedCrossRef Florentine M, Fastl H, Buus S. Temporal integration in normal hearing, cochlear impairment, and impairment simulated by masking. J Acoust Soc Am. 1998; 84(1):195–203.PubMedCrossRef
15.
Zurück zum Zitat Glasberg B, Moore B. Derivation of auditory filter shapes from notched-noise data. Hear Res. 1990;47(1–2):103–38PubMedCrossRef Glasberg B, Moore B. Derivation of auditory filter shapes from notched-noise data. Hear Res. 1990;47(1–2):103–38PubMedCrossRef
16.
Zurück zum Zitat Gonzalez R, Woods R Digital image processing. Boston: Addison-Wesley; 1993. Gonzalez R, Woods R Digital image processing. Boston: Addison-Wesley; 1993.
17.
Zurück zum Zitat Greenberg S. The integration of phonetic knowledge in speech technology, Text, Speech and Language Technology vol. 25, chap. From here to utility. New York: Springer; 2005. pp. 107–132. Greenberg S. The integration of phonetic knowledge in speech technology, Text, Speech and Language Technology vol. 25, chap. From here to utility. New York: Springer; 2005. pp. 107–132.
18.
Zurück zum Zitat Gunawan TS, Ambikairajah E, Epps J. Perceptual speech enhancement exploiting temporal masking properties of human auditory system. Speech Commun. 2010;52:381–93CrossRef Gunawan TS, Ambikairajah E, Epps J. Perceptual speech enhancement exploiting temporal masking properties of human auditory system. Speech Commun. 2010;52:381–93CrossRef
19.
Zurück zum Zitat Hansen J, Pellom B. An effective quality evaluation protocol for speech enhancement algorithms. In: International Conference on Spoken Language Processing. Sydney, Australia; 1998. pp. 2819–22. Hansen J, Pellom B. An effective quality evaluation protocol for speech enhancement algorithms. In: International Conference on Spoken Language Processing. Sydney, Australia; 1998. pp. 2819–22.
20.
Zurück zum Zitat Heckmann M, Domont X, Joublin F, Goerick C A hierarchical framework for spectro-temporal feature extraction. Speech Commun. 2010; (53):736–52. Heckmann M, Domont X, Joublin F, Goerick C A hierarchical framework for spectro-temporal feature extraction. Speech Commun. 2010; (53):736–52.
21.
Zurück zum Zitat Hirsch H, Pearce D. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) 2000. Hirsch H, Pearce D. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) 2000.
22.
Zurück zum Zitat Hu Y, Loizou P. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process. 2008;16(1):229–38.CrossRef Hu Y, Loizou P. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process. 2008;16(1):229–38.CrossRef
23.
Zurück zum Zitat Hu Y, Loizou P. Evaluation of objective measures for speech enhancement. In: Proceedings of the Interspeech. 2006; pp. 1447–50 . Hu Y, Loizou P. Evaluation of objective measures for speech enhancement. In: Proceedings of the Interspeech. 2006; pp. 1447–50 .
24.
Zurück zum Zitat Hurmalainen A, Virtanen T Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 2012; pp. 4113–16. Hurmalainen A, Virtanen T Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 2012; pp. 4113–16.
25.
Zurück zum Zitat Irino T, Patterson R A time-domain, level-dependent auditory filter: The gammachirp. J Acoust Soc Am 1997;101(1):412–19.CrossRef Irino T, Patterson R A time-domain, level-dependent auditory filter: The gammachirp. J Acoust Soc Am 1997;101(1):412–19.CrossRef
26.
Zurück zum Zitat Irino T, Patterson R. A dynamic compressive gammachirp auditory filterbank. IEEE Trans Audio Speech Lang Process. 2006;14(6):2222–32CrossRef Irino T, Patterson R. A dynamic compressive gammachirp auditory filterbank. IEEE Trans Audio Speech Lang Process. 2006;14(6):2222–32CrossRef
27.
Zurück zum Zitat Jelinek F, Bahl L, Mercer R. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Trans Inf Theory. 1975;21(3):250–56CrossRef Jelinek F, Bahl L, Mercer R. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Trans Inf Theory. 1975;21(3):250–56CrossRef
28.
Zurück zum Zitat Jesteadt W, Bacon SP, Lehman JR. Forward masking as a function of frequency, masker level, and signal delay. J Acoust Soc Am. 1982;71(4):950–62PubMedCrossRef Jesteadt W, Bacon SP, Lehman JR. Forward masking as a function of frequency, masker level, and signal delay. J Acoust Soc Am. 1982;71(4):950–62PubMedCrossRef
29.
Zurück zum Zitat Klatt D. Prediction of perceived phonetic distance from critical-band spectra: a first step. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 7, 1982. pp. 1278–81. Klatt D. Prediction of perceived phonetic distance from critical-band spectra: a first step. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 7, 1982. pp. 1278–81.
31.
Zurück zum Zitat Martínez C, Goddard J, Milone D, Rufiner H. Bioinspired sparse spectro-temporal representation of speech for robust classification. Comput Speech Lang. 2012;26:336–48.CrossRef Martínez C, Goddard J, Milone D, Rufiner H. Bioinspired sparse spectro-temporal representation of speech for robust classification. Comput Speech Lang. 2012;26:336–48.CrossRef
32.
Zurück zum Zitat Matheron G, Serra J. The birth of mathematical morphology. In: Proceedings of the 6th International Symposium on Mathematical Morphology. Sydney, Australia; 2002. pp. 1–16. Matheron G, Serra J. The birth of mathematical morphology. In: Proceedings of the 6th International Symposium on Mathematical Morphology. Sydney, Australia; 2002. pp. 1–16.
33.
Zurück zum Zitat Meddis R. Simulation of mechanical to neural transduction in the auditory receptor. J Acoust Soc Am. 1986;79(3):702–11PubMedCrossRef Meddis R. Simulation of mechanical to neural transduction in the auditory receptor. J Acoust Soc Am. 1986;79(3):702–11PubMedCrossRef
34.
Zurück zum Zitat Meddis R. Simulation of auditory-neural transduction: further studies. J Acoust Soc Am. 1988;83(3):1056–63PubMedCrossRef Meddis R. Simulation of auditory-neural transduction: further studies. J Acoust Soc Am. 1988;83(3):1056–63PubMedCrossRef
35.
Zurück zum Zitat Meyer B, Kollmeier B. Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun. 2010;53:753–67 Meyer B, Kollmeier B. Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun. 2010;53:753–67
36.
Zurück zum Zitat Moore B, Glasberg B. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. J Acoust Soc Am. 1983;74:750.PubMedCrossRef Moore B, Glasberg B. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. J Acoust Soc Am. 1983;74:750.PubMedCrossRef
37.
Zurück zum Zitat Moore B, Glasberg B. A revised model of loudness perception applied to cochlear hearing loss. Hear Res. 2004;188(1–2):70–88PubMedCrossRef Moore B, Glasberg B. A revised model of loudness perception applied to cochlear hearing loss. Hear Res. 2004;188(1–2):70–88PubMedCrossRef
38.
Zurück zum Zitat Patterson R, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M. Complex sounds and auditory images. Aud Physiol Percept 1992;83:429–46CrossRef Patterson R, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M. Complex sounds and auditory images. Aud Physiol Percept 1992;83:429–46CrossRef
39.
Zurück zum Zitat Peláez-Moreno C, García-Moral A, Valverde-Albacete F. Analyzing phonetic confusions using formal concept analysis. J Acoust Soc Am. 2010;128(3):1377–90PubMedCrossRef Peláez-Moreno C, García-Moral A, Valverde-Albacete F. Analyzing phonetic confusions using formal concept analysis. J Acoust Soc Am. 2010;128(3):1377–90PubMedCrossRef
40.
Zurück zum Zitat Quackenbush S, Barnwell T, Clements M. Objective measures of speech quality. Upper Saddle River: Prentice Hall Englewood Cliffs; 1988. Quackenbush S, Barnwell T, Clements M. Objective measures of speech quality. Upper Saddle River: Prentice Hall Englewood Cliffs; 1988.
41.
Zurück zum Zitat Quatieri TF (2002) Discrete-time speech signal processing. Principles and Practice. Signal Processing. Upper Saddle River: Prentice Hall; 2002. Quatieri TF (2002) Discrete-time speech signal processing. Principles and Practice. Signal Processing. Upper Saddle River: Prentice Hall; 2002.
42.
Zurück zum Zitat Rabiner L, Juang BH. Fundamentals of speech recognition. Signal Processing. Upper Saddle River: Prentice Hall; 1993. Rabiner L, Juang BH. Fundamentals of speech recognition. Signal Processing. Upper Saddle River: Prentice Hall; 1993.
43.
Zurück zum Zitat Rix A, Hollier M, Hekstra A, Beerends J. Perceptual evaluation of speech quality (PESQ), the new ITU standard for end-to-end speech quality assessment. Part I: Time-delay compensation. J Acoust Soc Am. 2002;50(10):755–64 Rix A, Hollier M, Hekstra A, Beerends J. Perceptual evaluation of speech quality (PESQ), the new ITU standard for end-to-end speech quality assessment. Part I: Time-delay compensation. J Acoust Soc Am. 2002;50(10):755–64
44.
Zurück zum Zitat Scalart P, Filho J. Speech enhancement based on a priori signal to noise estimation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE 1986. pp. 629–32. Scalart P, Filho J. Speech enhancement based on a priori signal to noise estimation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE 1986. pp. 629–32.
45.
Zurück zum Zitat Serra J, Soille P (eds). Mathematical morphology and its application to image processing. Computational imaging and vision. Kluwer Academic 1994. Serra J, Soille P (eds). Mathematical morphology and its application to image processing. Computational imaging and vision. Kluwer Academic 1994.
46.
Zurück zum Zitat Stevens SS, Volkmann J, Newman EB. A scale for the measurement of the psychological magnitude of pitch. J Acoust Soc Am. 1937;8:185–90.CrossRef Stevens SS, Volkmann J, Newman EB. A scale for the measurement of the psychological magnitude of pitch. J Acoust Soc Am. 1937;8:185–90.CrossRef
47.
Zurück zum Zitat Summerfield Q, Culling J. Auditory segregation of competing voices: absence of effectes of FM or AM coherence. Philos Trans R Soc Lond. 1992;336:357–66CrossRef Summerfield Q, Culling J. Auditory segregation of competing voices: absence of effectes of FM or AM coherence. Philos Trans R Soc Lond. 1992;336:357–66CrossRef
48.
Zurück zum Zitat ten Bosch L, Kirchhoff K. Editorial note: Bridging the gap between human and automatic speech recognition. Speech Commun. 2007;49(5):331–5CrossRef ten Bosch L, Kirchhoff K. Editorial note: Bridging the gap between human and automatic speech recognition. Speech Commun. 2007;49(5):331–5CrossRef
49.
Zurück zum Zitat Weiss NA, Hasset MJ. Introductory statistics. Addison- Wesley, Reading; 1993. pp. 407–08. Weiss NA, Hasset MJ. Introductory statistics. Addison- Wesley, Reading; 1993. pp. 407–08.
50.
Zurück zum Zitat Yeh J, Chen C. Auditory front-ends for noise-robust automatic speech recognition. In: 7th International Symposium on Chinese Spoken Language Process (ISCSLP), IEEE 2010. pp. 205–08. Yeh J, Chen C. Auditory front-ends for noise-robust automatic speech recognition. In: 7th International Symposium on Chinese Spoken Language Process (ISCSLP), IEEE 2010. pp. 205–08.
51.
Zurück zum Zitat Yin H, Hohmann V, Nadeu C. Acoustic features for speech recognition based on gammatone filterbank and instantaneous frequency. Speech Commun. 2010;53:707–15. Yin H, Hohmann V, Nadeu C. Acoustic features for speech recognition based on gammatone filterbank and instantaneous frequency. Speech Commun. 2010;53:707–15.
52.
Zurück zum Zitat Zwicker E, Feldtkeller R. The ear as a communication receiver. Woodbury: Acoustical Society of America; 1999. Zwicker E, Feldtkeller R. The ear as a communication receiver. Woodbury: Acoustical Society of America; 1999.
53.
Zurück zum Zitat Zwicker E, Jaroszewski A. Inverse frequency dependence of simultaneous tone-on-tone masking patterns at low levels. J Acoust Soc Am. 1982;71(6):1508–12.PubMedCrossRef Zwicker E, Jaroszewski A. Inverse frequency dependence of simultaneous tone-on-tone masking patterns at low levels. J Acoust Soc Am. 1982;71(6):1508–12.PubMedCrossRef
54.
Zurück zum Zitat Zwicker E, Terhardt E. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J Acoust Soc Am. 1980;68:1523CrossRef Zwicker E, Terhardt E. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J Acoust Soc Am. 1980;68:1523CrossRef
Metadaten
Titel
Auditory-Inspired Morphological Processing of Speech Spectrograms: Applications in Automatic Speech Recognition and Speech Enhancement
verfasst von
Joyner Cadore
Francisco J. Valverde-Albacete
Ascensión Gallardo-Antolín
Carmen Peláez-Moreno
Publikationsdatum
01.12.2013
Verlag
Springer US
Erschienen in
Cognitive Computation / Ausgabe 4/2013
Print ISSN: 1866-9956
Elektronische ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-012-9196-6

Weitere Artikel der Ausgabe 4/2013

Cognitive Computation 4/2013 Zur Ausgabe

Premium Partner