Skip to main content
Erschienen in: International Journal of Speech Technology 3/2016

25.07.2016

Improvements on self-adaptive voice activity detector for telephone data

verfasst von: Haoran Wei, Yanhua Long, Hongwei Mao

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Voice activity detection (VAD) has been studied for many decades and energy VAD is most commonly used. Energy VAD performs well under noise-free environments but deteriorates under noisy environment. Self-adaptive VAD performs much better than the traditional energy VAD in many aspects. However, one issue is that, the single one minimum energy threshold of the self-adaptive AVD could not perform well under the conditions with different channel varieties or background noises. In this paper, we make several improvements on the self-adaptive VAD to deal with that issue and enhance the detection performances. A k-means based average energy clustering approach is proposed to find better minimum energy thresholds for each speech recording. In the VAD decision phase, the new threshold is used for the likelihood ratio test. Furthermore, better results have been achieved by applying the median filtering as a post-processing step of self-adaptive VAD to smooth the short-time noise VAD errors. Experimental results on a subset of the NIST 2006 speaker recognition evaluation dataset show that our proposed method outperforms both the traditional energy-based and self-adaptive VAD approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Benyassine, A., Schlomot, E., & Su, H. Y. (1997). ITU-T recommendation g729 annex b: A silence compression scheme for use with g729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Communications Magazine, 35, 64–73.CrossRef Benyassine, A., Schlomot, E., & Su, H. Y. (1997). ITU-T recommendation g729 annex b: A silence compression scheme for use with g729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Communications Magazine, 35, 64–73.CrossRef
Zurück zum Zitat Bimbot, F., et al. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 2004, 430–451.CrossRef Bimbot, F., et al. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 2004, 430–451.CrossRef
Zurück zum Zitat Brummer, N., et al. (2010). ABC system description for NIST SRE 2010. NIST 2010 Speaker Recognition Evaluation (pp. 1–20). Brummer, N., et al. (2010). ABC system description for NIST SRE 2010. NIST 2010 Speaker Recognition Evaluation (pp. 1–20).
Zurück zum Zitat Burget, L., et al. (2007). Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech and Language Processing, 15, 1979–1986.CrossRef Burget, L., et al. (2007). Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech and Language Processing, 15, 1979–1986.CrossRef
Zurück zum Zitat Burlick, M., et al. (2013). On the improvement of multimodal voice activity detection. In Interspeech (pp. 685–689). Burlick, M., et al. (2013). On the improvement of multimodal voice activity detection. In Interspeech (pp. 685–689).
Zurück zum Zitat ETSI. (1999). Detector V A. for adaptive multi-rate (AMR) speech traffic channels. ETSI. (1999). Detector V A. for adaptive multi-rate (AMR) speech traffic channels.
Zurück zum Zitat ETSI. (2002). Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 201-108 Recommendation. ETSI. (2002). Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 201-108 Recommendation.
Zurück zum Zitat Freeman, D. K., Cosier, G., & Southcott, C. B. (1989). The voice activity detector for the Pan-European digital cellular mobile telephone service. International Conference on. IEEE Acoustics, Speech, and Signal Processing, 1, 369–372.CrossRef Freeman, D. K., Cosier, G., & Southcott, C. B. (1989). The voice activity detector for the Pan-European digital cellular mobile telephone service. International Conference on. IEEE Acoustics, Speech, and Signal Processing, 1, 369–372.CrossRef
Zurück zum Zitat Ganapathy, S., et al. (2011). Multi-layer perceptron based speech activity detection for speaker verification. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2011) (pp. 321–324). Ganapathy, S., et al. (2011). Multi-layer perceptron based speech activity detection for speaker verification. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2011) (pp. 321–324).
Zurück zum Zitat Ghosh, P. K., et al. (2011). Robust voice activity detection using long-term signal variability. IEEE Transactions on Audio, Speech and Language Processing, 19, 600–613.CrossRef Ghosh, P. K., et al. (2011). Robust voice activity detection using long-term signal variability. IEEE Transactions on Audio, Speech and Language Processing, 19, 600–613.CrossRef
Zurück zum Zitat Hautamaki, V., et al. (2007). Improving speaker verification by periodicity based voice activity detection. In Proceeding of the 12th International Conference on Speech and Computer (SPECOM 2007) (pp. 645–650). Hautamaki, V., et al. (2007). Improving speaker verification by periodicity based voice activity detection. In Proceeding of the 12th International Conference on Speech and Computer (SPECOM 2007) (pp. 645–650).
Zurück zum Zitat Heese, F., et al. (2015). Speech-codebook based soft voice activity detection. In ICASSP (pp. 4335–4339). Heese, F., et al. (2015). Speech-codebook based soft voice activity detection. In ICASSP (pp. 4335–4339).
Zurück zum Zitat Huijbregts, M., Wooters, C., & Ordelman. R. (2007). Filtering the unknown: Speech activity detection in heterogeneous video collections. In Interspeech (pp. 2925–2928). Huijbregts, M., Wooters, C., & Ordelman. R. (2007). Filtering the unknown: Speech activity detection in heterogeneous video collections. In Interspeech (pp. 2925–2928).
Zurück zum Zitat Kenny, P., Ouellet, P., & Senoussaoui, M. (2010). The CRIM system for the 2010 nist speaker recognition evaluation. In Proceeding of the NIST 2010 Speaker Recognition Evaluation. Kenny, P., Ouellet, P., & Senoussaoui, M. (2010). The CRIM system for the 2010 nist speaker recognition evaluation. In Proceeding of the NIST 2010 Speaker Recognition Evaluation.
Zurück zum Zitat Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.CrossRef Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.CrossRef
Zurück zum Zitat Kinnunen, T., & Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In ICASSP (pp. 7229–7233). Kinnunen, T., & Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In ICASSP (pp. 7229–7233).
Zurück zum Zitat McCowan, I., et al. (2011). The delta-phase spectrum with application to voice activity detection and speaker recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 19(7), 2026–2038.CrossRef McCowan, I., et al. (2011). The delta-phase spectrum with application to voice activity detection and speaker recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 19(7), 2026–2038.CrossRef
Zurück zum Zitat Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.CrossRef Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.CrossRef
Zurück zum Zitat Ramírez, J., Segura, J. C., & Benítez, C. (2004). A new Kullback-Leibler VAD for speech recognition in noise. IEEE Signal Processing Letters, 11(2), 266–269.CrossRef Ramírez, J., Segura, J. C., & Benítez, C. (2004). A new Kullback-Leibler VAD for speech recognition in noise. IEEE Signal Processing Letters, 11(2), 266–269.CrossRef
Zurück zum Zitat Ramírez, J., Segura, J. C., & Benítez, C. (2005). An effective subband OSF-based VAD with noise reduction for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(6), 1119–1129.CrossRef Ramírez, J., Segura, J. C., & Benítez, C. (2005). An effective subband OSF-based VAD with noise reduction for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(6), 1119–1129.CrossRef
Zurück zum Zitat Sahidullah, M., & Saha, G. (2012). Comparison of speech activity detection techniques for speaker recognition. arXiv preprint arXiv:1210.0297. Sahidullah, M., & Saha, G. (2012). Comparison of speech activity detection techniques for speaker recognition. arXiv preprint arXiv:​1210.​0297.
Zurück zum Zitat Sangwan, A., Chiranth, M. C., & Jamadagni, H. S. (2002). VAD techniques for real-time speech transmission on the internet. High Speed Networks and Multimedia Communications 5th IEEE International Conference on IEEE (pp. 46–50). Sangwan, A., Chiranth, M. C., & Jamadagni, H. S. (2002). VAD techniques for real-time speech transmission on the internet. High Speed Networks and Multimedia Communications 5th IEEE International Conference on IEEE (pp. 46–50).
Zurück zum Zitat Saon, G., Thomas, S., & Soltau, H. (2013). The IBM speech activity detection system for the DARPA RATS program. In Interspeech (pp. 3497–3501). Saon, G., Thomas, S., & Soltau, H. (2013). The IBM speech activity detection system for the DARPA RATS program. In Interspeech (pp. 3497–3501).
Zurück zum Zitat Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6, 1–3.CrossRef Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6, 1–3.CrossRef
Zurück zum Zitat Sun, H., Nwe, T. L., Ma, B., & Li, H. (2009). Speaker diarization for meeting room audio. In Proceeding of the Interspeech (pp. 900–903). Sun, H., Nwe, T. L., Ma, B., & Li, H. (2009). Speaker diarization for meeting room audio. In Proceeding of the Interspeech (pp. 900–903).
Zurück zum Zitat Thomas, S., Saon, G., & Segbroeck, M. V. (2015). Improvements to the IBM speech activity detection system for the DARPA RATS program. International Conference on Acoustics, Speech, and Signal Processing IEEE (pp. 4500–4504). Thomas, S., Saon, G., & Segbroeck, M. V. (2015). Improvements to the IBM speech activity detection system for the DARPA RATS program. International Conference on Acoustics, Speech, and Signal Processing IEEE (pp. 4500–4504).
Zurück zum Zitat Ye, J., et al. (2013). Incremental acoustic subspace learning for voice activity detection using harmonicity-based features. In Interspeech (pp. 695–699). Ye, J., et al. (2013). Incremental acoustic subspace learning for voice activity detection using harmonicity-based features. In Interspeech (pp. 695–699).
Zurück zum Zitat Yu, H. B., & Mak, M. W. (2011). Comparison of voice activity detectors for interview speech in NIST speaker recognition evaluation. In Interspeech (pp. 2353–2356). Yu, H. B., & Mak, M. W. (2011). Comparison of voice activity detectors for interview speech in NIST speaker recognition evaluation. In Interspeech (pp. 2353–2356).
Zurück zum Zitat Zhang, X. L., & Wu, J. (2013). Denoising deep neural networks based voice activity detection. Acoustics, Speech and Signal Processing (ICASSP) (pp. 853–857). Zhang, X. L., & Wu, J. (2013). Denoising deep neural networks based voice activity detection. Acoustics, Speech and Signal Processing (ICASSP) (pp. 853–857).
Metadaten
Titel
Improvements on self-adaptive voice activity detector for telephone data
verfasst von
Haoran Wei
Yanhua Long
Hongwei Mao
Publikationsdatum
25.07.2016
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 3/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9355-3

Weitere Artikel der Ausgabe 3/2016

International Journal of Speech Technology 3/2016 Zur Ausgabe