Skip to main content
Top
Published in: International Journal of Speech Technology 4/2017

16-10-2017

Clean speech/speech with background music classification using HNGD spectrum

Authors: Banriskhem K. Khonglah, S. R. Mahadeva Prasanna

Published in: International Journal of Speech Technology | Issue 4/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This work explores the characteristics of speech in terms of the spectral characteristics of vocal tract system for deriving features effective for clean speech and speech with background music classification. A representation of the spectral characteristics of the vocal tract system in the form of Hilbert envelope of the numerator of group delay (HNGD) spectrum is explored for the task. This representation complements the existing methods of computing the spectral characteristics in terms of the temporal resolution. This spectrum has an additive and high resolution property which gives a better representation of the formants especially the higher ones. A feature is extracted from the HNGD spectrum which is known as the spectral contrast across the sub-bands and this feature essentially represents the relative spectral characteristics of the vocal tract system. The vocal tract system is also represented approximately in terms of the mel frequency cepstral coefficients (MFCCs) which represent the average spectral characteristics. The MFCCs and the sum of the spectral contrast on HNGD can be used as features to represent the average and relative spectral characteristics of the vocal tract system, respectively. These features complement each other and can be combined in a multidimensional framework to provide good discrimination between clean speech and speech with background music segments. The spectral contrast on HNGD spectrum is compared to the spectral contrast on discrete fourier transform (DFT) spectrum, which also represents the relative spectral characteristics of the vocal tract system. It is observed that better performances are achieved on the HNGD spectrum than the DFT spectrum. The features are classified using classifiers like Gaussian mixture models and support vector machines.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Anand Joseph, M., Guruprasad, S., & Yegnanarayana, B. (2006). Extracting formants from short segments of speech using group delay functions. Anand Joseph, M., Guruprasad, S., & Yegnanarayana, B. (2006). Extracting formants from short segments of speech using group delay functions.
go back to reference Bayya, Y., & Gowda, D. N. (2013). Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Communication, 55(6), 782–795.CrossRef Bayya, Y., & Gowda, D. N. (2013). Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Communication, 55(6), 782–795.CrossRef
go back to reference Beyerlein, P., Aubert, X., Haeb-Umbach, R., Harris, M., Klakow, D., Wendemuth, A., et al. (2002). Large vocabulary continuous speech recognition of broadcast news-the philips/rwth approach. Speech Communication, 37(1), 109–131.CrossRefMATH Beyerlein, P., Aubert, X., Haeb-Umbach, R., Harris, M., Klakow, D., Wendemuth, A., et al. (2002). Large vocabulary continuous speech recognition of broadcast news-the philips/rwth approach. Speech Communication, 37(1), 109–131.CrossRefMATH
go back to reference Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society, 35, 99–109.MathSciNetMATH Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society, 35, 99–109.MathSciNetMATH
go back to reference Castán, D., Ortega, A., Miguel, A., & Lleida, E. (2014). Audio segmentation-by-classification approach based on factor analysis in broadcast news domain. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 1–13.CrossRef Castán, D., Ortega, A., Miguel, A., & Lleida, E. (2014). Audio segmentation-by-classification approach based on factor analysis in broadcast news domain. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 1–13.CrossRef
go back to reference Gauvain, J., Lamel, L., & Adda, G. (2000). Transcribing broadcast news for audio and video indexing. Communications of the ACM, 43(2), 64–70.CrossRef Gauvain, J., Lamel, L., & Adda, G. (2000). Transcribing broadcast news for audio and video indexing. Communications of the ACM, 43(2), 64–70.CrossRef
go back to reference Gauvain, J.-L., Lamel, L., & Adda, G. (2002). The limsi broadcast news transcription system. Speech Communication, 37(1), 89–108.CrossRefMATH Gauvain, J.-L., Lamel, L., & Adda, G. (2002). The limsi broadcast news transcription system. Speech Communication, 37(1), 89–108.CrossRefMATH
go back to reference Jiang, D.-N., Lu, L., Zhang, H.-J., Tao, J.-H., & Cai, L.-H. (2002). Music type classification by spectral contrast feature. In Proceedings 2002 IEEE international conference on multimedia and expo, 2002 (ICME’02) (Vol. 1, pp. 113–116). IEEE. Jiang, D.-N., Lu, L., Zhang, H.-J., Tao, J.-H., & Cai, L.-H. (2002). Music type classification by spectral contrast feature. In Proceedings 2002 IEEE international conference on multimedia and expo, 2002 (ICME’02) (Vol. 1, pp. 113–116). IEEE.
go back to reference Khonglah, B. K., & Prasanna, S. M. (2016). Speech/music classification using speech-specific features. Digital Signal Processing, 48, 71–83.CrossRefMathSciNet Khonglah, B. K., & Prasanna, S. M. (2016). Speech/music classification using speech-specific features. Digital Signal Processing, 48, 71–83.CrossRefMathSciNet
go back to reference Murthy, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.CrossRef Murthy, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.CrossRef
go back to reference Nguyen, L., Matsoukas, S., Davenport, J., Kubala, F., Schwartz, R., & Makhoul, J. (2002). Progress in transcription of broadcast news using byblos. Speech Communication, 38(1–2), 213230.MATH Nguyen, L., Matsoukas, S., Davenport, J., Kubala, F., Schwartz, R., & Makhoul, J. (2002). Progress in transcription of broadcast news using byblos. Speech Communication, 38(1–2), 213230.MATH
go back to reference Oppenheim, A . V., & Schafer, R . W. (1975). Digital signal processing. New Delhi: Prentice-Hall.MATH Oppenheim, A . V., & Schafer, R . W. (1975). Digital signal processing. New Delhi: Prentice-Hall.MATH
go back to reference Prasad, R., & Yegnanarayana, B. (2013). Acoustic segmentation of speech using zero time liftering (ztl) (pp. 2292–2296). Prasad, R., & Yegnanarayana, B. (2013). Acoustic segmentation of speech using zero time liftering (ztl) (pp. 2292–2296).
go back to reference Renals, S., Abberley, D., Kirby, D., & Robinson, T. (2000). Indexing and retrieval of broadcast news. Speech Communication, 32(1), 5–20.CrossRef Renals, S., Abberley, D., Kirby, D., & Robinson, T. (2000). Indexing and retrieval of broadcast news. Speech Communication, 32(1), 5–20.CrossRef
go back to reference Sarma, B. D., Prasanna, S. M., & Sarmah, P. (2017). Consonant-vowel unit recognition using dominant aperiodic and transition region detection. Speech Communication, 92, 77–89.CrossRef Sarma, B. D., Prasanna, S. M., & Sarmah, P. (2017). Consonant-vowel unit recognition using dominant aperiodic and transition region detection. Speech Communication, 92, 77–89.CrossRef
go back to reference Scheirer, E., & Slaney, M. (1997). Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing., 2, 1331–1334. Scheirer, E., & Slaney, M. (1997). Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing., 2, 1331–1334.
go back to reference Sell, G., & Clark, P. (2014). Music tonality features for speech/music discrimination. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2489–2493). IEEE. Sell, G., & Clark, P. (2014). Music tonality features for speech/music discrimination. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2489–2493). IEEE.
go back to reference Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA Speech Recognition Workshop (pp. 97–99). Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA Speech Recognition Workshop (pp. 97–99).
go back to reference Srinivas, K. S., & Prahallad, K. (2012). An fir implementation of zero frequency filtering of speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 20(9), 2613–2617.CrossRef Srinivas, K. S., & Prahallad, K. (2012). An fir implementation of zero frequency filtering of speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 20(9), 2613–2617.CrossRef
go back to reference Tzanetakis, G., & Cook, P. (2000). Sound analysis using mpeg compressed audio. In Proceedings IEEE international conference on acoustics, speech, and signal processing, 2000 (ICASSP’00) (Vol. 2, pp. II761–II764). Tzanetakis, G., & Cook, P. (2000). Sound analysis using mpeg compressed audio. In Proceedings IEEE international conference on acoustics, speech, and signal processing, 2000 (ICASSP’00) (Vol. 2, pp. II761–II764).
go back to reference Vavrek, J., Vozáriková, E., Pleva, M., & Juhár, J. (2012). Broadcast news audio classification using svm binary trees. In 2012 35th international conference on telecommunications and signal processing (TSP) (pp. 469–473). IEEE Vavrek, J., Vozáriková, E., Pleva, M., & Juhár, J. (2012). Broadcast news audio classification using svm binary trees. In 2012 35th international conference on telecommunications and signal processing (TSP) (pp. 469–473). IEEE
go back to reference Wegmann, S., Zhan, P., & Gillick, L. (1999). Progress in broadcast news transcription at dragon systems. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 33–36. Wegmann, S., Zhan, P., & Gillick, L. (1999). Progress in broadcast news transcription at dragon systems. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 33–36.
go back to reference Woodland, P. (2002). The development of the htk broadcast news transcription system: An overview. Speech Communication, 37(1–2), 47–67.CrossRefMATH Woodland, P. (2002). The development of the htk broadcast news transcription system: An overview. Speech Communication, 37(1–2), 47–67.CrossRefMATH
go back to reference Yegnanarayana, B. (1978). Formant extraction from linear-prediction phase spectra. The Journal of the Acoustical Society of America, 63(5), 1638–1640.CrossRef Yegnanarayana, B. (1978). Formant extraction from linear-prediction phase spectra. The Journal of the Acoustical Society of America, 63(5), 1638–1640.CrossRef
go back to reference Yegnanarayana, B., & Murthy, H. A. (1992). Significance of group delay functions in spectrum estimation. IEEE Transactions on Signal Processing, 40(9), 2281–2289.CrossRefMATH Yegnanarayana, B., & Murthy, H. A. (1992). Significance of group delay functions in spectrum estimation. IEEE Transactions on Signal Processing, 40(9), 2281–2289.CrossRefMATH
go back to reference Zhang, T., & Kuo, C. J. (2001). Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing, 9(4), 441–457.CrossRef Zhang, T., & Kuo, C. J. (2001). Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing, 9(4), 441–457.CrossRef
Metadata
Title
Clean speech/speech with background music classification using HNGD spectrum
Authors
Banriskhem K. Khonglah
S. R. Mahadeva Prasanna
Publication date
16-10-2017
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 4/2017
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9464-7

Other articles of this Issue 4/2017

International Journal of Speech Technology 4/2017 Go to the issue