Top

Published in:

2015 | OriginalPaper | Chapter

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

Authors : Jin Wang, Chao Liang, Xiaochen Wang, Zhongyuan Wang

Published in: Advances in Multimedia Information Processing -- PCM 2015

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Voice activity detection (VAD) is a technology to identify whether the persons in multimedia are speaking. Most of the research efforts focused on utilizing audio and visual information to implement voice activity detection, which outperform audio or visual approach alone proposed earlier. However, current methods explore a supervised classifiers using new feature consist of audio and visual information. In the paper, we propose a novel method to detect voice activity by audio-visual alignment. Since the temporal order relationship of voice activity detection over the whole audio and visual information, we use Needleman-Wunsch algorithm to align two different sequences. Compared to existing VAD algorithms,our experimental results indicate that the proposed approach presents better results, and the accuracy rate reaches about 85 % in real-life environment.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Perceptual Quality Improvement for Synthesis Imaging of Chinese Spectral Radioheliograph

next chapter Emotion Recognition from EEG Signals by Leveraging Stimulus Videos

Liang, C., Xu, C., Cheng, J., et al.: TVparser: An automatic TV video parsing method. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2011)

Sodoyer, D., Rivet, B., Girin, L., Savariaux, C., Schwartz, J.-L.: A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125(2), 1184–1196 (2009)CrossRef

Woo, K., Yang, T., Park, K., Lee, C.: Robust voice activity detection algorithm for estimating noise spectrum. IET Electron. Lett. 36(2), 180–181 (2000)CrossRef

Soleimani, S.A., Ahadi, S.M.: Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceedings of the 3rd International Conference on Information and Communication Technologies (2008)

Lee, B., Muhkerjee, D.: Spectral entropy-based voice activity detector for videoconferencing systems. In: Kobayashi, T., Hirose, K., Nakamura, S., (eds.) Proceedings of INTERSPEECH (2010)

Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)CrossRef

Ramirez, J., Segura, J., Benitez, C., Garcia, L., Rubio, A.: Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)CrossRef

Suh, Y., Kim, H.: Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012)CrossRef

Wang, L., Wang, X., Xu, J.: Lip detection and tracking using variance based Haar-like features and Kalman filter. In: Proceedings of the 5th International Conference on Frontier of Computer Science and Technology (FCST) (2010)

10.

Liu, P., Wang, Z.: Voice activity detection using visual information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada (2004)

11.

Libal, V., Connell, J., Potamianos, G.: An embedded system of in-vehicle visual speech activity detection. In: International Workshop on Multimedia Signal Process (MMSP) (2007)

12.

Siatras, S., Nikolaidis, N., Krinidis, M., et al.: Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)CrossRef

13.

Tiawongsombata, P., Jeongb, M.-H., Yun, J.-S.: Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)CrossRef

14.

Almajai, I., Milner, B.: Using audio-visual features for robust voice activity detection in clean and noisy speech. In: Proceedings of the 16th European Signal Processing Conference (EUSIPCO 2008) (2008)

15.

Hashiba, T., Tamura, S., Takeuchi, S., Hayamizu, S.: Voice activity detectionbased on fusion of audio and visual information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing (2009)

16.

Minotto, V.P., Lopes, C.B.O., Scharcanski, J., et al.: Audiovisual voice activity detection based on microphone arrays and color information. IEEE J. Sel. Top. Sign. Process. 7(1), 147–156 (2013)CrossRef

17.

Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision(IJCV) 57(2), 137–154 (2004)

18.

Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014)

19.

Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI (1981)

20.

Rabiner, L.R., et al.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989). AT&T Bell Lab, Murray HillCrossRef

21.

Tan, L.N., Borgstrom, B.J., Alwan, A.: Voice activity detection using harmonic frequency components in likelihood ratio test. In: Acoustics Speech and Signal Processing (ICASSP) (2010)

22.

Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef

23.

ITU-T Rec. G.729, Annex B (2007)

Title: Real-Life Voice Activity Detection Based on Audio-Visual Alignment
Authors: Jin Wang
Chao Liang
Xiaochen Wang
Zhongyuan Wang
Publisher: Springer International Publishing
Book: Advances in Multimedia Information Processing -- PCM 2015
Print ISBN: 978-3-319-24077-0

Electronic ISBN: 978-3-319-24078-7

Copyright Year: 2015
DOI: https://doi.org/10.1007/978-3-319-24078-7_11

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"