nach oben

International Journal of Speech Technology

Erschienen in:

25.07.2016

Improvements on self-adaptive voice activity detector for telephone data

verfasst von: Haoran Wei, Yanhua Long, Hongwei Mao

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Voice activity detection (VAD) has been studied for many decades and energy VAD is most commonly used. Energy VAD performs well under noise-free environments but deteriorates under noisy environment. Self-adaptive VAD performs much better than the traditional energy VAD in many aspects. However, one issue is that, the single one minimum energy threshold of the self-adaptive AVD could not perform well under the conditions with different channel varieties or background noises. In this paper, we make several improvements on the self-adaptive VAD to deal with that issue and enhance the detection performances. A k-means based average energy clustering approach is proposed to find better minimum energy thresholds for each speech recording. In the VAD decision phase, the new threshold is used for the likelihood ratio test. Furthermore, better results have been achieved by applying the median filtering as a post-processing step of self-adaptive VAD to smooth the short-time noise VAD errors. Experimental results on a subset of the NIST 2006 speaker recognition evaluation dataset show that our proposed method outperforms both the traditional energy-based and self-adaptive VAD approaches.

Vorheriger Artikel Integrated acoustic echo and noise suppression in modulation domain

Nächster Artikel Robust acoustic bird recognition for habitat monitoring with wireless sensor networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Benyassine, A., Schlomot, E., & Su, H. Y. (1997). ITU-T recommendation g729 annex b: A silence compression scheme for use with g729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Communications Magazine, 35, 64–73.CrossRef

Bimbot, F., et al. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 2004, 430–451.CrossRef

Brummer, N., et al. (2010). ABC system description for NIST SRE 2010. NIST 2010 Speaker Recognition Evaluation (pp. 1–20).

Burget, L., et al. (2007). Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech and Language Processing, 15, 1979–1986.CrossRef

Burlick, M., et al. (2013). On the improvement of multimodal voice activity detection. In Interspeech (pp. 685–689).

ETSI. (1999). Detector V A. for adaptive multi-rate (AMR) speech traffic channels.

ETSI. (2002). Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 201-108 Recommendation.

Freeman, D. K., Cosier, G., & Southcott, C. B. (1989). The voice activity detector for the Pan-European digital cellular mobile telephone service. International Conference on. IEEE Acoustics, Speech, and Signal Processing, 1, 369–372.CrossRef

Ganapathy, S., et al. (2011). Multi-layer perceptron based speech activity detection for speaker verification. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2011) (pp. 321–324).

Ghosh, P. K., et al. (2011). Robust voice activity detection using long-term signal variability. IEEE Transactions on Audio, Speech and Language Processing, 19, 600–613.CrossRef

Hautamaki, V., et al. (2007). Improving speaker verification by periodicity based voice activity detection. In Proceeding of the 12th International Conference on Speech and Computer (SPECOM 2007) (pp. 645–650).

Heese, F., et al. (2015). Speech-codebook based soft voice activity detection. In ICASSP (pp. 4335–4339).

Huijbregts, M., Wooters, C., & Ordelman. R. (2007). Filtering the unknown: Speech activity detection in heterogeneous video collections. In Interspeech (pp. 2925–2928).

Kenny, P., Ouellet, P., & Senoussaoui, M. (2010). The CRIM system for the 2010 nist speaker recognition evaluation. In Proceeding of the NIST 2010 Speaker Recognition Evaluation.

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.CrossRef

Kinnunen, T., & Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In ICASSP (pp. 7229–7233).

McCowan, I., et al. (2011). The delta-phase spectrum with application to voice activity detection and speaker recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 19(7), 2026–2038.CrossRef

NIST Multimodal Information Group (2006). NIST speaker recognition evaluation training set LDC2011S09. Web download. Philadelphia: Linguistic Data Consortium, 2011. https://catalog.ldc.upenn.edu/LDC2011S09. Accessed 16 Nov 2011.

Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.CrossRef

Ramírez, J., Segura, J. C., & Benítez, C. (2004). A new Kullback-Leibler VAD for speech recognition in noise. IEEE Signal Processing Letters, 11(2), 266–269.CrossRef

Ramírez, J., Segura, J. C., & Benítez, C. (2005). An effective subband OSF-based VAD with noise reduction for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(6), 1119–1129.CrossRef

Sahidullah, M., & Saha, G. (2012). Comparison of speech activity detection techniques for speaker recognition. arXiv preprint arXiv:1210.0297.

Sangwan, A., Chiranth, M. C., & Jamadagni, H. S. (2002). VAD techniques for real-time speech transmission on the internet. High Speed Networks and Multimedia Communications 5th IEEE International Conference on IEEE (pp. 46–50).

Saon, G., Thomas, S., & Soltau, H. (2013). The IBM speech activity detection system for the DARPA RATS program. In Interspeech (pp. 3497–3501).

Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6, 1–3.CrossRef

Sun, H., Nwe, T. L., Ma, B., & Li, H. (2009). Speaker diarization for meeting room audio. In Proceeding of the Interspeech (pp. 900–903).

Thomas, S., Saon, G., & Segbroeck, M. V. (2015). Improvements to the IBM speech activity detection system for the DARPA RATS program. International Conference on Acoustics, Speech, and Signal Processing IEEE (pp. 4500–4504).

Ye, J., et al. (2013). Incremental acoustic subspace learning for voice activity detection using harmonicity-based features. In Interspeech (pp. 695–699).

Yu, H. B., & Mak, M. W. (2011). Comparison of voice activity detectors for interview speech in NIST speaker recognition evaluation. In Interspeech (pp. 2353–2356).

Zhang, X. L., & Wu, J. (2013). Denoising deep neural networks based voice activity detection. Acoustics, Speech and Signal Processing (ICASSP) (pp. 853–857).

Titel: Improvements on self-adaptive voice activity detector for telephone data
verfasst von: Haoran Wei
Yanhua Long
Hongwei Mao
Publikationsdatum: 25.07.2016
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 3/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-016-9355-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2016

Performance of speaker identification using CSM and TM

Performance of speaker localization using microphone array

Arabic speech synthesis and diacritic recognition

Text-dependent speaker verification using classical LBG, adaptive LBG and FCM vector quantization

Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition

Speech transmission with COFDM based on different discrete transforms