Skip to main content
Top

2015 | OriginalPaper | Chapter

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

Authors : Jin Wang, Chao Liang, Xiaochen Wang, Zhongyuan Wang

Published in: Advances in Multimedia Information Processing -- PCM 2015

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Voice activity detection (VAD) is a technology to identify whether the persons in multimedia are speaking. Most of the research efforts focused on utilizing audio and visual information to implement voice activity detection, which outperform audio or visual approach alone proposed earlier. However, current methods explore a supervised classifiers using new feature consist of audio and visual information. In the paper, we propose a novel method to detect voice activity by audio-visual alignment. Since the temporal order relationship of voice activity detection over the whole audio and visual information, we use Needleman-Wunsch algorithm to align two different sequences. Compared to existing VAD algorithms,our experimental results indicate that the proposed approach presents better results, and the accuracy rate reaches about 85 % in real-life environment.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Liang, C., Xu, C., Cheng, J., et al.: TVparser: An automatic TV video parsing method. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2011) Liang, C., Xu, C., Cheng, J., et al.: TVparser: An automatic TV video parsing method. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2011)
2.
go back to reference Sodoyer, D., Rivet, B., Girin, L., Savariaux, C., Schwartz, J.-L.: A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125(2), 1184–1196 (2009)CrossRef Sodoyer, D., Rivet, B., Girin, L., Savariaux, C., Schwartz, J.-L.: A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125(2), 1184–1196 (2009)CrossRef
3.
go back to reference Woo, K., Yang, T., Park, K., Lee, C.: Robust voice activity detection algorithm for estimating noise spectrum. IET Electron. Lett. 36(2), 180–181 (2000)CrossRef Woo, K., Yang, T., Park, K., Lee, C.: Robust voice activity detection algorithm for estimating noise spectrum. IET Electron. Lett. 36(2), 180–181 (2000)CrossRef
4.
go back to reference Soleimani, S.A., Ahadi, S.M.: Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceedings of the 3rd International Conference on Information and Communication Technologies (2008) Soleimani, S.A., Ahadi, S.M.: Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceedings of the 3rd International Conference on Information and Communication Technologies (2008)
5.
go back to reference Lee, B., Muhkerjee, D.: Spectral entropy-based voice activity detector for videoconferencing systems. In: Kobayashi, T., Hirose, K., Nakamura, S., (eds.) Proceedings of INTERSPEECH (2010) Lee, B., Muhkerjee, D.: Spectral entropy-based voice activity detector for videoconferencing systems. In: Kobayashi, T., Hirose, K., Nakamura, S., (eds.) Proceedings of INTERSPEECH (2010)
6.
go back to reference Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)CrossRef Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)CrossRef
7.
go back to reference Ramirez, J., Segura, J., Benitez, C., Garcia, L., Rubio, A.: Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)CrossRef Ramirez, J., Segura, J., Benitez, C., Garcia, L., Rubio, A.: Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)CrossRef
8.
go back to reference Suh, Y., Kim, H.: Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012)CrossRef Suh, Y., Kim, H.: Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012)CrossRef
9.
go back to reference Wang, L., Wang, X., Xu, J.: Lip detection and tracking using variance based Haar-like features and Kalman filter. In: Proceedings of the 5th International Conference on Frontier of Computer Science and Technology (FCST) (2010) Wang, L., Wang, X., Xu, J.: Lip detection and tracking using variance based Haar-like features and Kalman filter. In: Proceedings of the 5th International Conference on Frontier of Computer Science and Technology (FCST) (2010)
10.
go back to reference Liu, P., Wang, Z.: Voice activity detection using visual information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada (2004) Liu, P., Wang, Z.: Voice activity detection using visual information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada (2004)
11.
go back to reference Libal, V., Connell, J., Potamianos, G.: An embedded system of in-vehicle visual speech activity detection. In: International Workshop on Multimedia Signal Process (MMSP) (2007) Libal, V., Connell, J., Potamianos, G.: An embedded system of in-vehicle visual speech activity detection. In: International Workshop on Multimedia Signal Process (MMSP) (2007)
12.
go back to reference Siatras, S., Nikolaidis, N., Krinidis, M., et al.: Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)CrossRef Siatras, S., Nikolaidis, N., Krinidis, M., et al.: Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)CrossRef
13.
go back to reference Tiawongsombata, P., Jeongb, M.-H., Yun, J.-S.: Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)CrossRef Tiawongsombata, P., Jeongb, M.-H., Yun, J.-S.: Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)CrossRef
14.
go back to reference Almajai, I., Milner, B.: Using audio-visual features for robust voice activity detection in clean and noisy speech. In: Proceedings of the 16th European Signal Processing Conference (EUSIPCO 2008) (2008) Almajai, I., Milner, B.: Using audio-visual features for robust voice activity detection in clean and noisy speech. In: Proceedings of the 16th European Signal Processing Conference (EUSIPCO 2008) (2008)
15.
go back to reference Hashiba, T., Tamura, S., Takeuchi, S., Hayamizu, S.: Voice activity detectionbased on fusion of audio and visual information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing (2009) Hashiba, T., Tamura, S., Takeuchi, S., Hayamizu, S.: Voice activity detectionbased on fusion of audio and visual information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing (2009)
16.
go back to reference Minotto, V.P., Lopes, C.B.O., Scharcanski, J., et al.: Audiovisual voice activity detection based on microphone arrays and color information. IEEE J. Sel. Top. Sign. Process. 7(1), 147–156 (2013)CrossRef Minotto, V.P., Lopes, C.B.O., Scharcanski, J., et al.: Audiovisual voice activity detection based on microphone arrays and color information. IEEE J. Sel. Top. Sign. Process. 7(1), 147–156 (2013)CrossRef
17.
go back to reference Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision(IJCV) 57(2), 137–154 (2004) Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision(IJCV) 57(2), 137–154 (2004)
18.
go back to reference Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014) Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014)
19.
go back to reference Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI (1981) Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI (1981)
20.
go back to reference Rabiner, L.R., et al.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989). AT&T Bell Lab, Murray HillCrossRef Rabiner, L.R., et al.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989). AT&T Bell Lab, Murray HillCrossRef
21.
go back to reference Tan, L.N., Borgstrom, B.J., Alwan, A.: Voice activity detection using harmonic frequency components in likelihood ratio test. In: Acoustics Speech and Signal Processing (ICASSP) (2010) Tan, L.N., Borgstrom, B.J., Alwan, A.: Voice activity detection using harmonic frequency components in likelihood ratio test. In: Acoustics Speech and Signal Processing (ICASSP) (2010)
22.
go back to reference Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef
23.
Metadata
Title
Real-Life Voice Activity Detection Based on Audio-Visual Alignment
Authors
Jin Wang
Chao Liang
Xiaochen Wang
Zhongyuan Wang
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-24078-7_11