Skip to main content
Erschienen in: Wireless Personal Communications 1/2017

30.05.2017

Frame Selection for Robust Speaker Identification: A Hybrid Approach

verfasst von: Swati Prasad, Zheng-Hua Tan, Ramjee Prasad

Erschienen in: Wireless Personal Communications | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Identification of a person using voice is a challenging task under environmental noises. Important and reliable frame selection for feature extraction from the time-domain speech signal under noise can play a significant role in improving speaker identification accuracy. Therefore, this paper presents a frame selection method using hybrid technique, which combines two techniques, namely, voice activity detection (VAD) and variable frame rate (VFR) analysis. It efficiently captures the active speech part, the changes in the temporal characteristics of the speech signal, taking into account the signal-to-noise ratio, and thereby speaker-specific information. Experimental results on noisy speech, generated by artificially adding various noise signals to the clean YOHO speech at different SNRs have shown improved results for the frame selection by the hybrid technique in comparison with any one of the techniques used for the hybrid. The proposed hybrid technique outperformed both the VFR and the widely used Gaussian statistical model based VAD method for all noise scenarios at different SNRs, except for the Babble noise corrupted speech at 5 dB SNR, for which, VFR performed better. Considering the average identification accuracies of different noise scenarios, a relative improvement of 9.79% over the VFR, and 18.05% over the Gaussian statistical model based VAD method has been achieved.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Atal, B. S. (1976). Automatic recognition of speakers from their voices. Proceedings of the IEEE, 64, 460–475.CrossRef Atal, B. S. (1976). Automatic recognition of speakers from their voices. Proceedings of the IEEE, 64, 460–475.CrossRef
2.
Zurück zum Zitat Doddington, G. R. (1985). Speaker recognition—identifying people by their voices. Proceedings of the IEEE, 73, 1651–1664.CrossRef Doddington, G. R. (1985). Speaker recognition—identifying people by their voices. Proceedings of the IEEE, 73, 1651–1664.CrossRef
3.
Zurück zum Zitat Campbel, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.CrossRef Campbel, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.CrossRef
4.
Zurück zum Zitat Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communications, 52(1), 12–40.CrossRef Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communications, 52(1), 12–40.CrossRef
5.
Zurück zum Zitat Mammone, R. J., Zhang, X., & Ramachandran, R. P. (1996). Robust speaker recognition–a feature based approach. IEEE Signal Processing Magazine, 13, 5871.CrossRef Mammone, R. J., Zhang, X., & Ramachandran, R. P. (1996). Robust speaker recognition–a feature based approach. IEEE Signal Processing Magazine, 13, 5871.CrossRef
6.
Zurück zum Zitat Togneri, R., & Pullela, D. (2011). An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits Systems Magazine, 11(2), 23–61.CrossRef Togneri, R., & Pullela, D. (2011). An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits Systems Magazine, 11(2), 23–61.CrossRef
7.
Zurück zum Zitat Zhao, X., Wang, Y., & Wang, D. L. (2014). Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(4), 836–845.CrossRef Zhao, X., Wang, Y., & Wang, D. L. (2014). Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(4), 836–845.CrossRef
8.
Zurück zum Zitat Kinnunen, T., Saeidi, R., Sedlak, F., Lee, K. A., Sandberg, J., Hansson-Sandsten, M., et al. (2012). Low-variance multitaper MFCC features: A case study in robust speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 20(7), 1990–2001.CrossRef Kinnunen, T., Saeidi, R., Sedlak, F., Lee, K. A., Sandberg, J., Hansson-Sandsten, M., et al. (2012). Low-variance multitaper MFCC features: A case study in robust speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 20(7), 1990–2001.CrossRef
9.
Zurück zum Zitat Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communications, 55, 237–251.CrossRef Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communications, 55, 237–251.CrossRef
10.
Zurück zum Zitat Sadjadi, S. O., Hasan, T., & Hansen, J. H. L. (2012). Mean hilbert envelope coefficients (MHEC) for robust speaker recognition. In Proceedings of Interspeech (pp. 1696–1699). Sadjadi, S. O., Hasan, T., & Hansen, J. H. L. (2012). Mean hilbert envelope coefficients (MHEC) for robust speaker recognition. In Proceedings of Interspeech (pp. 1696–1699).
11.
Zurück zum Zitat Ephraim, Y., & Van Trees, H. (1995). A signal subspace approach for speech enhancement. IEEE/ACM Transactions on Audio, Speech and Language Processing, 3(6), 251–266.CrossRef Ephraim, Y., & Van Trees, H. (1995). A signal subspace approach for speech enhancement. IEEE/ACM Transactions on Audio, Speech and Language Processing, 3(6), 251–266.CrossRef
12.
Zurück zum Zitat Brajevic, Z., & Petosic, A. (2012). Signal denoising using STFT with Bayes prediction and Ephraim–Malah estimation. In Proceedings of the 54th international symposium ELMAR (pp. 183–186). Brajevic, Z., & Petosic, A. (2012). Signal denoising using STFT with Bayes prediction and Ephraim–Malah estimation. In Proceedings of the 54th international symposium ELMAR (pp. 183–186).
13.
Zurück zum Zitat Govindan, S. M., Duraisamy, P., & Yuan, X. (2014). Adaptive wavelet shrinkage for noise robust speaker recognition. Digital Signal Processing, 33, 180–190.CrossRef Govindan, S. M., Duraisamy, P., & Yuan, X. (2014). Adaptive wavelet shrinkage for noise robust speaker recognition. Digital Signal Processing, 33, 180–190.CrossRef
14.
Zurück zum Zitat Kim, K., & Kim, M. Y. (2010). Robust speaker recognition against background noise in an enhanced multicondition domain. IEEE Transactions on Consumer Electronics, 56(3), 1684–1688.CrossRef Kim, K., & Kim, M. Y. (2010). Robust speaker recognition against background noise in an enhanced multicondition domain. IEEE Transactions on Consumer Electronics, 56(3), 1684–1688.CrossRef
15.
Zurück zum Zitat Zao, L., & Coelho, R. (2011). Colored noise based multicondition training for robust speaker identification. IEEE Signal Processing Letters, 18(11), 675–678.CrossRef Zao, L., & Coelho, R. (2011). Colored noise based multicondition training for robust speaker identification. IEEE Signal Processing Letters, 18(11), 675–678.CrossRef
16.
Zurück zum Zitat Venturini, A., Zao, L., & Coelho, R. (2014). On speech features fusion, integration Gaussian modeling and multi-style training for noise robust speaker classification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(12), 1951–1964.CrossRef Venturini, A., Zao, L., & Coelho, R. (2014). On speech features fusion, integration Gaussian modeling and multi-style training for noise robust speaker classification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(12), 1951–1964.CrossRef
17.
Zurück zum Zitat Dehak, N., kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.CrossRef Dehak, N., kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.CrossRef
18.
Zurück zum Zitat Mashao, D. J., & Skosan, M. (2006). Combining classifier decisions for robust speaker identification. Pattern Recognition, 39, 147–155.CrossRef Mashao, D. J., & Skosan, M. (2006). Combining classifier decisions for robust speaker identification. Pattern Recognition, 39, 147–155.CrossRef
19.
Zurück zum Zitat Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture models. IEEE/ACM Transactions on Audio, Speech and Language Processing, 3(1), 72–83.CrossRef Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture models. IEEE/ACM Transactions on Audio, Speech and Language Processing, 3(1), 72–83.CrossRef
20.
Zurück zum Zitat Mak, M.-W., & Yu, H.-B. (2014). A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer Speech and Language, 28, 295–313.CrossRef Mak, M.-W., & Yu, H.-B. (2014). A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer Speech and Language, 28, 295–313.CrossRef
21.
Zurück zum Zitat Deng, S., & Han, J. (2012). Likelihood ratio sign test for voice activity detection. IET Signal Processing, 6(4), 306–312.CrossRefMathSciNet Deng, S., & Han, J. (2012). Likelihood ratio sign test for voice activity detection. IET Signal Processing, 6(4), 306–312.CrossRefMathSciNet
22.
Zurück zum Zitat Jung, C.-S., Kim, M. Y., & Kang, H.-G. (2010). Selecting feature frames for automatic speaker recognition using mutual information. IEEE/ACM Transactions on Audio, Speech and Language Processing, 18(6), 1332–1340.CrossRef Jung, C.-S., Kim, M. Y., & Kang, H.-G. (2010). Selecting feature frames for automatic speaker recognition using mutual information. IEEE/ACM Transactions on Audio, Speech and Language Processing, 18(6), 1332–1340.CrossRef
23.
Zurück zum Zitat Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T. & Okuno, H. G. (2006). Speaker identification under noisy environment by using harmonic structure extraction and reliable frame weighting. In Proceedings of interspeech (pp. 1459–1462). Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T. & Okuno, H. G. (2006). Speaker identification under noisy environment by using harmonic structure extraction and reliable frame weighting. In Proceedings of interspeech (pp. 1459–1462).
24.
Zurück zum Zitat Tan, Z.-H., & Lindberg, B. (2010). Low complexity frame rate analysis for speech recognition and voice activity detection. IEEE Journal of Selected Topics in Signal Processing, 4(5), 798–807.CrossRef Tan, Z.-H., & Lindberg, B. (2010). Low complexity frame rate analysis for speech recognition and voice activity detection. IEEE Journal of Selected Topics in Signal Processing, 4(5), 798–807.CrossRef
25.
Zurück zum Zitat Tan, Z.-H., & Kraljevski, I. (2014). Joint variable frame rate and length analysis for speech recognition under adverse conditions. Computers and Electrical Engineering, 40, 2139–2149.CrossRef Tan, Z.-H., & Kraljevski, I. (2014). Joint variable frame rate and length analysis for speech recognition under adverse conditions. Computers and Electrical Engineering, 40, 2139–2149.CrossRef
26.
Zurück zum Zitat Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.CrossRef Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.CrossRef
27.
Zurück zum Zitat Hirsch, H. G. & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of ISCA ITRW ASR. Hirsch, H. G. & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of ISCA ITRW ASR.
28.
Zurück zum Zitat Campbel, J. P. Jr. (1995). Testing with YOHO cd-rom verification corpus. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (pp. 341–344). Campbel, J. P. Jr. (1995). Testing with YOHO cd-rom verification corpus. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (pp. 341–344).
29.
Zurück zum Zitat M-Guarasa, J., Ordonez, J., Montero, J. M., Ferreiros, J., Cordoba, R., & Haro, L. F. D. (2003). Revisiting scenarios and methods for variable frame rate analysis in automatic speech recognition. In Proceedings of Eurospeech. M-Guarasa, J., Ordonez, J., Montero, J. M., Ferreiros, J., Cordoba, R., & Haro, L. F. D. (2003). Revisiting scenarios and methods for variable frame rate analysis in automatic speech recognition. In Proceedings of Eurospeech.
30.
Zurück zum Zitat Zhu, Q. & Alwan, A. (2000). On the use of variable frame rate analysis in speech recognition. In Proceedings of IEEE international conference on acoustics, speech, and signal processing. Zhu, Q. & Alwan, A. (2000). On the use of variable frame rate analysis in speech recognition. In Proceedings of IEEE international conference on acoustics, speech, and signal processing.
Metadaten
Titel
Frame Selection for Robust Speaker Identification: A Hybrid Approach
verfasst von
Swati Prasad
Zheng-Hua Tan
Ramjee Prasad
Publikationsdatum
30.05.2017
Verlag
Springer US
Erschienen in
Wireless Personal Communications / Ausgabe 1/2017
Print ISSN: 0929-6212
Elektronische ISSN: 1572-834X
DOI
https://doi.org/10.1007/s11277-017-4544-1

Weitere Artikel der Ausgabe 1/2017

Wireless Personal Communications 1/2017 Zur Ausgabe

Neuer Inhalt