Skip to main content

2015 | OriginalPaper | Buchkapitel

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

verfasst von : Jin Wang, Chao Liang, Xiaochen Wang, Zhongyuan Wang

Erschienen in: Advances in Multimedia Information Processing -- PCM 2015

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Voice activity detection (VAD) is a technology to identify whether the persons in multimedia are speaking. Most of the research efforts focused on utilizing audio and visual information to implement voice activity detection, which outperform audio or visual approach alone proposed earlier. However, current methods explore a supervised classifiers using new feature consist of audio and visual information. In the paper, we propose a novel method to detect voice activity by audio-visual alignment. Since the temporal order relationship of voice activity detection over the whole audio and visual information, we use Needleman-Wunsch algorithm to align two different sequences. Compared to existing VAD algorithms,our experimental results indicate that the proposed approach presents better results, and the accuracy rate reaches about 85 % in real-life environment.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Liang, C., Xu, C., Cheng, J., et al.: TVparser: An automatic TV video parsing method. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2011) Liang, C., Xu, C., Cheng, J., et al.: TVparser: An automatic TV video parsing method. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2011)
2.
Zurück zum Zitat Sodoyer, D., Rivet, B., Girin, L., Savariaux, C., Schwartz, J.-L.: A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125(2), 1184–1196 (2009)CrossRef Sodoyer, D., Rivet, B., Girin, L., Savariaux, C., Schwartz, J.-L.: A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125(2), 1184–1196 (2009)CrossRef
3.
Zurück zum Zitat Woo, K., Yang, T., Park, K., Lee, C.: Robust voice activity detection algorithm for estimating noise spectrum. IET Electron. Lett. 36(2), 180–181 (2000)CrossRef Woo, K., Yang, T., Park, K., Lee, C.: Robust voice activity detection algorithm for estimating noise spectrum. IET Electron. Lett. 36(2), 180–181 (2000)CrossRef
4.
Zurück zum Zitat Soleimani, S.A., Ahadi, S.M.: Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceedings of the 3rd International Conference on Information and Communication Technologies (2008) Soleimani, S.A., Ahadi, S.M.: Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceedings of the 3rd International Conference on Information and Communication Technologies (2008)
5.
Zurück zum Zitat Lee, B., Muhkerjee, D.: Spectral entropy-based voice activity detector for videoconferencing systems. In: Kobayashi, T., Hirose, K., Nakamura, S., (eds.) Proceedings of INTERSPEECH (2010) Lee, B., Muhkerjee, D.: Spectral entropy-based voice activity detector for videoconferencing systems. In: Kobayashi, T., Hirose, K., Nakamura, S., (eds.) Proceedings of INTERSPEECH (2010)
6.
Zurück zum Zitat Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)CrossRef Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)CrossRef
7.
Zurück zum Zitat Ramirez, J., Segura, J., Benitez, C., Garcia, L., Rubio, A.: Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)CrossRef Ramirez, J., Segura, J., Benitez, C., Garcia, L., Rubio, A.: Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)CrossRef
8.
Zurück zum Zitat Suh, Y., Kim, H.: Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012)CrossRef Suh, Y., Kim, H.: Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection. IEEE Signal Process. Lett. 19(8), 507–510 (2012)CrossRef
9.
Zurück zum Zitat Wang, L., Wang, X., Xu, J.: Lip detection and tracking using variance based Haar-like features and Kalman filter. In: Proceedings of the 5th International Conference on Frontier of Computer Science and Technology (FCST) (2010) Wang, L., Wang, X., Xu, J.: Lip detection and tracking using variance based Haar-like features and Kalman filter. In: Proceedings of the 5th International Conference on Frontier of Computer Science and Technology (FCST) (2010)
10.
Zurück zum Zitat Liu, P., Wang, Z.: Voice activity detection using visual information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada (2004) Liu, P., Wang, Z.: Voice activity detection using visual information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada (2004)
11.
Zurück zum Zitat Libal, V., Connell, J., Potamianos, G.: An embedded system of in-vehicle visual speech activity detection. In: International Workshop on Multimedia Signal Process (MMSP) (2007) Libal, V., Connell, J., Potamianos, G.: An embedded system of in-vehicle visual speech activity detection. In: International Workshop on Multimedia Signal Process (MMSP) (2007)
12.
Zurück zum Zitat Siatras, S., Nikolaidis, N., Krinidis, M., et al.: Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)CrossRef Siatras, S., Nikolaidis, N., Krinidis, M., et al.: Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)CrossRef
13.
Zurück zum Zitat Tiawongsombata, P., Jeongb, M.-H., Yun, J.-S.: Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)CrossRef Tiawongsombata, P., Jeongb, M.-H., Yun, J.-S.: Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)CrossRef
14.
Zurück zum Zitat Almajai, I., Milner, B.: Using audio-visual features for robust voice activity detection in clean and noisy speech. In: Proceedings of the 16th European Signal Processing Conference (EUSIPCO 2008) (2008) Almajai, I., Milner, B.: Using audio-visual features for robust voice activity detection in clean and noisy speech. In: Proceedings of the 16th European Signal Processing Conference (EUSIPCO 2008) (2008)
15.
Zurück zum Zitat Hashiba, T., Tamura, S., Takeuchi, S., Hayamizu, S.: Voice activity detectionbased on fusion of audio and visual information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing (2009) Hashiba, T., Tamura, S., Takeuchi, S., Hayamizu, S.: Voice activity detectionbased on fusion of audio and visual information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing (2009)
16.
Zurück zum Zitat Minotto, V.P., Lopes, C.B.O., Scharcanski, J., et al.: Audiovisual voice activity detection based on microphone arrays and color information. IEEE J. Sel. Top. Sign. Process. 7(1), 147–156 (2013)CrossRef Minotto, V.P., Lopes, C.B.O., Scharcanski, J., et al.: Audiovisual voice activity detection based on microphone arrays and color information. IEEE J. Sel. Top. Sign. Process. 7(1), 147–156 (2013)CrossRef
17.
Zurück zum Zitat Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision(IJCV) 57(2), 137–154 (2004) Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision(IJCV) 57(2), 137–154 (2004)
18.
Zurück zum Zitat Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014) Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014)
19.
Zurück zum Zitat Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI (1981) Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI (1981)
20.
Zurück zum Zitat Rabiner, L.R., et al.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989). AT&T Bell Lab, Murray HillCrossRef Rabiner, L.R., et al.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989). AT&T Bell Lab, Murray HillCrossRef
21.
Zurück zum Zitat Tan, L.N., Borgstrom, B.J., Alwan, A.: Voice activity detection using harmonic frequency components in likelihood ratio test. In: Acoustics Speech and Signal Processing (ICASSP) (2010) Tan, L.N., Borgstrom, B.J., Alwan, A.: Voice activity detection using harmonic frequency components in likelihood ratio test. In: Acoustics Speech and Signal Processing (ICASSP) (2010)
22.
Zurück zum Zitat Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef
23.
Metadaten
Titel
Real-Life Voice Activity Detection Based on Audio-Visual Alignment
verfasst von
Jin Wang
Chao Liang
Xiaochen Wang
Zhongyuan Wang
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-24078-7_11

Neuer Inhalt