Skip to main content
Erschienen in: International Journal of Speech Technology 2/2015

01.06.2015

Improving the self-adaptive voice activity detector for speaker verification using map adaptation and asymmetric tapers

verfasst von: Nassim Asbai, Messaoud Bengherabi, Abderrahmane Amrouche, Youcef Aklouf

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper brings an improvement of voice activity detection, based on vector quantization and speech enhancement preprocessing (VQ-VAD) proposed recently, and applied to speaker verification system under noisy environment. VQ-VAD is based on computing the likelihood ratio on an utterance-by utterance basis from mel-frequency cepstral coefficients that train speech and non-speech models. Whereas the notion of speech and non-speech segments in speech signal is independent of the speaker. For this, a modified VQ-VAD technique is proposed in this paper, by creating two UBM’s for speech and non-speech models, trained from a long utterance-independence model. Then, an adaptation of UBM’s models to the short utterance of speaker is performed via MAP adaptation, instead of using VQ models. Mel-frequency cepstral coefficient’s were also extracted by using the recently proposed asymmetric tapers instead of the traditional Hamming windowing. Using the GMM–UBM as a baseline system for speaker verification, extensive simulation results were done by adding different noise levels to the clean TIMIT database, characterized by its short training and very short testing utterances. The obtained results show the superiority of the proposed GMM-MAP-VAD approach in adverse conditions. Furthermore a drastic reduction in the EER is observed when using asymmetric tapers.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Amrouche, A., Debyeche, M., Taleb-Ahmed, A., Michel Rouvaen, J., & Yagoub, M. C. (2010). An efficient speech recognition system in adverse conditions using the nonparametric regression. Engineering Applications of Artificial Intelligence, 23(1), 85–94.CrossRef Amrouche, A., Debyeche, M., Taleb-Ahmed, A., Michel Rouvaen, J., & Yagoub, M. C. (2010). An efficient speech recognition system in adverse conditions using the nonparametric regression. Engineering Applications of Artificial Intelligence, 23(1), 85–94.CrossRef
Zurück zum Zitat Asbai, N., Bengherabi, M., Amrouche, A., & Harizi, F. (2013a). Improving speaker verification robustness by front-end diversity and score level fusion. In Proceedings of 2013 International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) kyoto, Japan (pp. 136–142). IEEE. Asbai, N., Bengherabi, M., Amrouche, A., & Harizi, F. (2013a). Improving speaker verification robustness by front-end diversity and score level fusion. In Proceedings of 2013 International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) kyoto, Japan (pp. 136–142). IEEE.
Zurück zum Zitat Asbai, N., Bengherabi, M., Harizi, F., & Amrouche, A. (2013b). Improving the performance of speaker verification systems under noisy conditions using low level features and score level fusion. In International Conference on Signal Processing and Multimedia Applications (SIGMAP), Iceland (pp. 33–38). INSTICC. Asbai, N., Bengherabi, M., Harizi, F., & Amrouche, A. (2013b). Improving the performance of speaker verification systems under noisy conditions using low level features and score level fusion. In International Conference on Signal Processing and Multimedia Applications (SIGMAP), Iceland (pp. 33–38). INSTICC.
Zurück zum Zitat Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech corrupted by acoustic noise. In Acoustics, Speech, and Signal Processing, ICASSP’79 (Vol. 4, pp. 208–211). IEEE. Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech corrupted by acoustic noise. In Acoustics, Speech, and Signal Processing, ICASSP’79 (Vol. 4, pp. 208–211). IEEE.
Zurück zum Zitat Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19(4), 788–798.CrossRef Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19(4), 788–798.CrossRef
Zurück zum Zitat Do, M. N. (2003). Fast approximation of Kullback–Leibler distance for dependence trees and hidden Markov models. Signal Processing Letters, IEEE, 10(4), 115–118.CrossRefMathSciNet Do, M. N. (2003). Fast approximation of Kullback–Leibler distance for dependence trees and hidden Markov models. Signal Processing Letters, IEEE, 10(4), 115–118.CrossRefMathSciNet
Zurück zum Zitat Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. Speech and Audio Processing, IEEE Transactions on, 2(2), 291–298.CrossRef Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. Speech and Audio Processing, IEEE Transactions on, 2(2), 291–298.CrossRef
Zurück zum Zitat Gerkmann, T., & Hendriks, R. C. (2012). Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. Audio, Speech, and Language Processing, IEEE Transactions on, 20(4), 1383–1393.CrossRef Gerkmann, T., & Hendriks, R. C. (2012). Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. Audio, Speech, and Language Processing, IEEE Transactions on, 20(4), 1383–1393.CrossRef
Zurück zum Zitat Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-Gomar, M., & Ortega-Garcia, J. (2006). Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Computer Speech & Language, 20(2), 331–355.CrossRef Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-Gomar, M., & Ortega-Garcia, J. (2006). Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Computer Speech & Language, 20(2), 331–355.CrossRef
Zurück zum Zitat Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7), 881–892.CrossRef Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7), 881–892.CrossRef
Zurück zum Zitat Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 15(4), 1435–1447.CrossRef Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 15(4), 1435–1447.CrossRef
Zurück zum Zitat Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.CrossRef
Zurück zum Zitat Kinnunen, T., & Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In Proceedings of Acoustics, Speech and Signal Processing, 2013. ICASSP 2013. Canada, (pp. 7229–7233). IEEE. Kinnunen, T., & Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In Proceedings of Acoustics, Speech and Signal Processing, 2013. ICASSP 2013. Canada, (pp. 7229–7233). IEEE.
Zurück zum Zitat Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. Communications, IEEE Transactions on, 28(1), 84–95.CrossRef Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. Communications, IEEE Transactions on, 28(1), 84–95.CrossRef
Zurück zum Zitat Ma, B., Meng, H. M., Mak, M. W. (2007). Effects of device mismatch, language mismatch and environmental mismatch on speaker verification. In Acoustics, speech and signal processing, 2007. ICASSP 2007. IEEE International Conference on (Vol. 4, pp. IV-301). Ma, B., Meng, H. M., Mak, M. W. (2007). Effects of device mismatch, language mismatch and environmental mismatch on speaker verification. In Acoustics, speech and signal processing, 2007. ICASSP 2007. IEEE International Conference on (Vol. 4, pp. IV-301).
Zurück zum Zitat Mak, M. W., & Yu, H. B. (2014). A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer Speech & Language, 28(1), 295–313.CrossRef Mak, M. W., & Yu, H. B. (2014). A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer Speech & Language, 28(1), 295–313.CrossRef
Zurück zum Zitat Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics. Speech and Audio Processing, IEEE Transactions on, 9(5), 504–512.CrossRef Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics. Speech and Audio Processing, IEEE Transactions on, 9(5), 504–512.CrossRef
Zurück zum Zitat Morales-Cordovilla, J. A., Sánchez, V., Gómez, A. M., & Peinado, A. M. (2012). On the use of asymmetric windows for robust speech recognition. Circuits, Systems, and Signal Processing, 31(2), 727–736.CrossRefMathSciNet Morales-Cordovilla, J. A., Sánchez, V., Gómez, A. M., & Peinado, A. M. (2012). On the use of asymmetric windows for robust speech recognition. Circuits, Systems, and Signal Processing, 31(2), 727–736.CrossRefMathSciNet
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1), 19–41.CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1), 19–41.CrossRef
Zurück zum Zitat Rozman, R., & Kodek, D. M. (2007). Using asymmetric windows in automatic speech recognition. Speech communication, 49(4), 268–276.CrossRef Rozman, R., & Kodek, D. M. (2007). Using asymmetric windows in automatic speech recognition. Speech communication, 49(4), 268–276.CrossRef
Zurück zum Zitat Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication, 12(3), 247–251.CrossRef Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication, 12(3), 247–251.CrossRef
Metadaten
Titel
Improving the self-adaptive voice activity detector for speaker verification using map adaptation and asymmetric tapers
verfasst von
Nassim Asbai
Messaoud Bengherabi
Abderrahmane Amrouche
Youcef Aklouf
Publikationsdatum
01.06.2015
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2015
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-014-9260-6

Weitere Artikel der Ausgabe 2/2015

International Journal of Speech Technology 2/2015 Zur Ausgabe