Skip to main content
Top

2017 | OriginalPaper | Chapter

13. Voice Activity Detection

Authors : Tom Bäckström, Christian Uhle

Published in: Speech Coding

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Voice Activity Detection (VAD) provides the information whether an audio signal contains speech or not. Besides speech coding and transmission, there are many other applications in speech and audio processing that benefit from this information, and their performance is crucially dependent on the accuracy and robustness of the applied VAD. Various approaches to detect speech have been developed in the past, but when considering the challenging scenarios in which speech needs to be detected, e.g. hands-free communication in noisy environments or dialog in background music, there is still room for improvement. In this chapter, we describe the problem and the environments of VAD, discuss the procedure, examples for methods and their evaluation. Especially the more challenging application scenarios illustrate how superior human hearing can be compared to implementations of audio signal processing.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Anemüller, J., Schmidt, D., Bach, J.-H.: Detection of speech embedded in real acoustic background based on amplitude modulation spectrogram features. In: Proceedings of the Interspeech (2008) Anemüller, J., Schmidt, D., Bach, J.-H.: Detection of speech embedded in real acoustic background based on amplitude modulation spectrogram features. In: Proceedings of the Interspeech (2008)
2.
go back to reference Barbedo, J., Lopes, A.: A robust and computationally efficient speech/music discriminator. J. Audio Eng. Soc. 54(7), 571–588 (2006) Barbedo, J., Lopes, A.: A robust and computationally efficient speech/music discriminator. J. Audio Eng. Soc. 54(7), 571–588 (2006)
3.
go back to reference Benyassine, A., Shlomot, E., Su, H.-S., Massaloux, D., Lamblin, C., Petit, J.-P.: Itu-t recommandation g.729 annex b: a silence compression scheme for us with g.729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Commun. Mag. 35(9), 64–73 (1997)CrossRef Benyassine, A., Shlomot, E., Su, H.-S., Massaloux, D., Lamblin, C., Petit, J.-P.: Itu-t recommandation g.729 annex b: a silence compression scheme for us with g.729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Commun. Mag. 35(9), 64–73 (1997)CrossRef
4.
go back to reference Carey, M., Parris, E., Lloyd-Thomas, H.: A comparison of features for speech, music discrimination. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1999) Carey, M., Parris, E., Lloyd-Thomas, H.: A comparison of features for speech, music discrimination. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1999)
5.
go back to reference Cornu, E., Sheikhzadeh, H., Brennan, R.L., Abutalebi, H.R., Tam, E.C.Y., Iles, P., Wong, K.W.: Etsi amr-2 vad: Evaluation and ultra low-resource implementation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2003) Cornu, E., Sheikhzadeh, H., Brennan, R.L., Abutalebi, H.R., Tam, E.C.Y., Iles, P., Wong, K.W.: Etsi amr-2 vad: Evaluation and ultra low-resource implementation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2003)
6.
go back to reference Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Proc. 28(4), 357–366 (1980)CrossRef Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Proc. 28(4), 357–366 (1980)CrossRef
7.
go back to reference Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, Chichester (2000)MATH Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, Chichester (2000)MATH
8.
go back to reference El-Maleh, K., Klein, M., Petrucci, G., Kabal, V.: Speech/music discrimination for multimedia applications. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2000) El-Maleh, K., Klein, M., Petrucci, G., Kabal, V.: Speech/music discrimination for multimedia applications. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2000)
9.
go back to reference Dietz, M., et al.: Overview of the EVS codec aarchitecture. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015) Dietz, M., et al.: Overview of the EVS codec aarchitecture. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015)
10.
go back to reference Neuendorf, M., et al.: A novel scheme for low bitrate unified speech and audio coding MPEG RM0. In: Proceedings of the AES 126th Convention (2009) Neuendorf, M., et al.: A novel scheme for low bitrate unified speech and audio coding MPEG RM0. In: Proceedings of the AES 126th Convention (2009)
11.
go back to reference Freeman, D.K., Cosier, G., Southcott, C.B., Boyd, I.: The voice activity detector for the pan-european digital cellular mobile telephone service. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1989) Freeman, D.K., Cosier, G., Southcott, C.B., Boyd, I.: The voice activity detector for the pan-european digital cellular mobile telephone service. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1989)
12.
go back to reference Fuchs, G.: A robust speech/music discriminator for switched audio coding. In: Proceedings of the European Signal Processing Conference on (EUSIPCO) (2015) Fuchs, G.: A robust speech/music discriminator for switched audio coding. In: Proceedings of the European Signal Processing Conference on (EUSIPCO) (2015)
13.
go back to reference Gray, A.H., Markel, J.D.: A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. Acoust. Speech Sig. Proc. 22, 207–217 (1974)CrossRef Gray, A.H., Markel, J.D.: A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. Acoust. Speech Sig. Proc. 22, 207–217 (1974)CrossRef
14.
go back to reference Harb, H., Chen, L.: Robust speech music discrimination using spectrum’s first order statistics and neural networks. In: Proceedings of the International Symposium on Signal Processing and It’s Applications (2003) Harb, H., Chen, L.: Robust speech music discrimination using spectrum’s first order statistics and neural networks. In: Proceedings of the International Symposium on Signal Processing and It’s Applications (2003)
15.
go back to reference Hellmuth, O., Allamanche, E., Herre, J., Kastner, T., Cremer, M., Hirsch, W.: Advanced audio identification using MPEG-7 content description. In: Proceedings of the AES 111th Convection (2001) Hellmuth, O., Allamanche, E., Herre, J., Kastner, T., Cremer, M., Hirsch, W.: Advanced audio identification using MPEG-7 content description. In: Proceedings of the AES 111th Convection (2001)
16.
go back to reference Hermansky, H.: Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRef Hermansky, H.: Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRef
17.
go back to reference Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)CrossRef Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)CrossRef
18.
go back to reference Hoyt, J., Wechsler, H.: Detection of human speech in structured noise. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1994) Hoyt, J., Wechsler, H.: Detection of human speech in structured noise. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1994)
19.
go back to reference Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Analysis Mach. Intell. 22, 4–37 (2000)CrossRef Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Analysis Mach. Intell. 22, 4–37 (2000)CrossRef
20.
go back to reference Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm detection for speech-music discrimination in MPEG compressed domain. In: Proceedings of the 14th International Conference on Digital Signal Processing (2002) Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm detection for speech-music discrimination in MPEG compressed domain. In: Proceedings of the 14th International Conference on Digital Signal Processing (2002)
21.
go back to reference Karnebäck, S.: Discrimination between speech and music based on a low frequency modulation feature. In: Proceedings of the Eurospeech, Aalborg, Denmark (2001) Karnebäck, S.: Discrimination between speech and music based on a low frequency modulation feature. In: Proceedings of the Eurospeech, Aalborg, Denmark (2001)
22.
go back to reference Lehner, B., Widmer, W., Sonnleitner, R.: On the reduction of false positives in singing voice detection. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2014) Lehner, B., Widmer, W., Sonnleitner, R.: On the reduction of false positives in singing voice detection. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2014)
23.
go back to reference Loizou, P.C.: Speech quality assessment. In: Lin, W., et al. (eds.) Multimedia Analysis, Processing and Communications. Springer, Heidelberg (2011) Loizou, P.C.: Speech quality assessment. In: Lin, W., et al. (eds.) Multimedia Analysis, Processing and Communications. Springer, Heidelberg (2011)
24.
go back to reference Malenovsky, V., Jelinek, M.: Improving the detection efficiency of the VMR-WB VAD algorithm on music signals. In: Proceedings of the European Signal Processing Conference on (EUSIPCO) (2008) Malenovsky, V., Jelinek, M.: Improving the detection efficiency of the VMR-WB VAD algorithm on music signals. In: Proceedings of the European Signal Processing Conference on (EUSIPCO) (2008)
25.
go back to reference Martin, R.: Spectral subtraction based on minimum statistics. In: Proceedings of the European Signal Processing Conference (EUSIPCO) (1994) Martin, R.: Spectral subtraction based on minimum statistics. In: Proceedings of the European Signal Processing Conference (EUSIPCO) (1994)
26.
go back to reference Masri, P.: Computer modelling of sound for transformation and synthesis of musical signals. Ph.D. thesis, University of Bristol (1996) Masri, P.: Computer modelling of sound for transformation and synthesis of musical signals. Ph.D. thesis, University of Bristol (1996)
27.
go back to reference Mesgarani, N., Slaney, M., Shamma, S.: Discrimination of speech from non-speech based on multiscale spectro-temporal modulations. IEEE Trans. Audio Speech Lang. Process. 14(3), 920–930 (2006)CrossRef Mesgarani, N., Slaney, M., Shamma, S.: Discrimination of speech from non-speech based on multiscale spectro-temporal modulations. IEEE Trans. Audio Speech Lang. Process. 14(3), 920–930 (2006)CrossRef
28.
go back to reference Moattar, M.H., Homayounpour, M.M.: A simple but efficient real-time voice activity detection algorithm. In: Proceedings of the 17th European Signal Processing Conference on (EUSIPCO) (2009) Moattar, M.H., Homayounpour, M.M.: A simple but efficient real-time voice activity detection algorithm. In: Proceedings of the 17th European Signal Processing Conference on (EUSIPCO) (2009)
29.
go back to reference Pinquier, J., Rouas, J.-L., André-Obrecht, R.: A fusion study in speech/music classification. In: Proceedings of the International Conference on Multimedia and Expo, ICME (2003) Pinquier, J., Rouas, J.-L., André-Obrecht, R.: A fusion study in speech/music classification. In: Proceedings of the International Conference on Multimedia and Expo, ICME (2003)
30.
go back to reference Ramirez, J., Gorriz, J.M., Segura, J.C.: Voice activity detection. fundamentals and speech recognition system robustness. In: Grimm, M., Kroschel, K. (eds.) Robust Speech Recognition and Understanding. I-Tech (2007) Ramirez, J., Gorriz, J.M., Segura, J.C.: Voice activity detection. fundamentals and speech recognition system robustness. In: Grimm, M., Kroschel, K. (eds.) Robust Speech Recognition and Understanding. I-Tech (2007)
31.
go back to reference Ross, M.J., Shaffer, H.L., Cohen, A., Freudenberg, R., Manley, H.J.: Average magnitude difference function pitch extractor. IEEE Trans. Acoustics Speech Signal Proc., 22(5) (1974) Ross, M.J., Shaffer, H.L., Cohen, A., Freudenberg, R., Manley, H.J.: Average magnitude difference function pitch extractor. IEEE Trans. Acoustics Speech Signal Proc., 22(5) (1974)
32.
go back to reference Saunders, J.: Real-time discrimination of broadcast speech/music. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1996) Saunders, J.: Real-time discrimination of broadcast speech/music. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1996)
33.
go back to reference Scheirer, E., Slaney, M.: Construction and evaluation of a robust multifeature speech/music discriminator. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1997) Scheirer, E., Slaney, M.: Construction and evaluation of a robust multifeature speech/music discriminator. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1997)
34.
go back to reference Skovenborg, E., Lund, T.: Level normalization of feature films using loudness versus speech. In: Proceedings of the AES 135th Convection (2013) Skovenborg, E., Lund, T.: Level normalization of feature films using loudness versus speech. In: Proceedings of the AES 135th Convection (2013)
35.
go back to reference Sonnleitner, R., Niedermayer, B., Widmer, G., Schlueter, J.: A simple and effective spectral feature for speech detection in mixed audio signals. In: Proceedings of the International Conference on Digital Audio Effects (DAFx) (2012) Sonnleitner, R., Niedermayer, B., Widmer, G., Schlueter, J.: A simple and effective spectral feature for speech detection in mixed audio signals. In: Proceedings of the International Conference on Digital Audio Effects (DAFx) (2012)
36.
go back to reference Srinivasan, K., Gersho, A.: Voice activity detection for cellular networks. In: Proceedings of the IEEE Workshop on Speech Coding (1993) Srinivasan, K., Gersho, A.: Voice activity detection for cellular networks. In: Proceedings of the IEEE Workshop on Speech Coding (1993)
37.
go back to reference Tancerel, L., Ragot, S., Ruoppila, V.T., Lefebvre, R.: Combined speech and audio coding by discrimination. In: Proceedings of the IEEE Workshop on Speech Coding (2000) Tancerel, L., Ragot, S., Ruoppila, V.T., Lefebvre, R.: Combined speech and audio coding by discrimination. In: Proceedings of the IEEE Workshop on Speech Coding (2000)
38.
go back to reference Tchorz, J., Kollmeier, B.: Speech detection and SNR prediction basing on amplitude modulation pattern recognition. In: Proceedings of the Eurospeech (1999) Tchorz, J., Kollmeier, B.: Speech detection and SNR prediction basing on amplitude modulation pattern recognition. In: Proceedings of the Eurospeech (1999)
39.
go back to reference Thoshkahna, B., Sudha, V., Ramakrishnan, K.: A speech-music discriminator using HILN-features. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2006) Thoshkahna, B., Sudha, V., Ramakrishnan, K.: A speech-music discriminator using HILN-features. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2006)
40.
go back to reference Tong, S., Chen, N., Qian, Y., Yu, K.: Evaluating VAD for automatic speech recognition. In: Proceedings of the International Conference on Signal Proceesing (ICSP) (2014) Tong, S., Chen, N., Qian, Y., Yu, K.: Evaluating VAD for automatic speech recognition. In: Proceedings of the International Conference on Signal Proceesing (ICSP) (2014)
41.
go back to reference Tucker, R.: Voice activity detection using a periodicity measure. In: IEE Proceedings I - Communications, Speech and Vision (1992) Tucker, R.: Voice activity detection using a periodicity measure. In: IEE Proceedings I - Communications, Speech and Vision (1992)
42.
go back to reference Uhle, C.: An investigation of low-level signal descriptor characterizing the noise nature of an audio signal. In: Proceedings of the AES 128th Convection (2010) Uhle, C.: An investigation of low-level signal descriptor characterizing the noise nature of an audio signal. In: Proceedings of the AES 128th Convection (2010)
43.
go back to reference Uhle, C., Hellmuth, O., Weigel, J.: Speech enhancement of movie sound. In: Proceedings of the AES 125th Convection (2008) Uhle, C., Hellmuth, O., Weigel, J.: Speech enhancement of movie sound. In: Proceedings of the AES 125th Convection (2008)
44.
go back to reference Williams, G., Ellis, D.: Speech/music discrimination based on posterior probability features. In: Proceedings of the Eurospeech (1999) Williams, G., Ellis, D.: Speech/music discrimination based on posterior probability features. In: Proceedings of the Eurospeech (1999)
Metadata
Title
Voice Activity Detection
Authors
Tom Bäckström
Christian Uhle
Copyright Year
2017
DOI
https://doi.org/10.1007/978-3-319-50204-5_13