Top

Published in:

2016 | OriginalPaper | Chapter

Cross-Modal Supervision for Learning Active Speaker Detection in Video

Authors : Punarjay Chakravarty, Tinne Tuytelaars

Published in: Computer Vision – ECCV 2016

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Online Action Detection

next chapter Recurrent Temporal Deep Field for Semantic Video Labeling

http://people.cs.ubc.ca/~schmidtm/Software/minFunc.html.

https://youtu.be/6GzxbrO0DHM.

Khoury, E., Sénac, C., Joly, P.: Audiovisual diarization of people in video content. Multimedia Tools Appl. 68(3), 747–775 (2014)CrossRef

Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is... buffy”-automatic naming of characters in tv video. In: BMVC, vol. 2, pp. 6 (2006)

Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automatic naming of characters in TV video. Image Vis. Comput. 27(5), 545–559 (2009)CrossRef

Haider, F., Al Moubayed, S.: Towards speaker detection using lips movements for humanmachine multiparty dialogue. In: 2012 FONETIK (2012)

Chakravarty, P., Mirzaei, S., Tuytelaars, T., Vanhamme, H.: Who’s speaking? audio-supervised classification of active speakers in video. In: ACM International Conference on Multimodal Interaction (ICMI) (2015)

Germain, F., Sun, D.L., Mysore, G.J.: Speaker and noise independent voice activity detection. In: INTERSPEECH, pp. 732–736 (2013)

Bilen, H., Namboodiri, V.P., Gool, L.J.: Object and action classification with latent window parameters. Int. J. Comput. Vis. 106(3), 237–251 (2014)CrossRef

Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with posterior regularization. In: British Machine Vision Conference (2014)

Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1081–1089 (2015)

10.

Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. Int. J. Comput. Vis. 100(3), 275–293 (2012)MathSciNetCrossRef

11.

Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024 (2014)

12.

Nguyen, M.H., Torresani, L., de la Torre, F., Rother, C.: Weakly supervised discriminative localization and classification: a joint learning process. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1925–1932. IEEE (2009)

13.

Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2280–2287. IEEE (2013)

14.

Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the damage of dataset bias. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 158–171. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33718-5_12

15.

Tommasi, T., Quadrianto, N., Caputo, B., Lampert, C.H.: Beyond dataset bias: multi-task unaligned shared knowledge transfer. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part I. LNCS, vol. 7724, pp. 1–15. Springer, Heidelberg (2013)

16.

Aljundi, R., Emonet, R., Muselet, D., Sebban, M.: Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In: Computer Vision and Pattern Recognition (CVPR 2015) (2015)

17.

Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2960–2967. IEEE (2013)

18.

Aytar, Y., Zisserman, A.: Tabula rasa: model transfer for object category detection. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2252–2259. IEEE (2011)

19.

Tommasi, T., Caputo, B.: The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In: BMVC, Number LIDIAP-CONF-2009-049 (2009)

20.

Tommasi, T., Orabona, F., Caputo, B.: Safety in numbers: learning categories from few examples with multi model knowledge transfer. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3081–3088. IEEE (2010)

21.

Chen, J., Liu, X., Tu, P., Aragones, A.: Person-specific expression recognition with transfer learning. In: 2012 19th IEEE International Conference on Image Processing (ICIP), pp. 2621–2624. IEEE (2012)

22.

Chu, W.S., De la Torre, F., Cohn, J.F.: Selective transfer machine for personalized facial action unit detection. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3515–3522. IEEE (2013)

23.

Zen, G., Sangineto, E., Ricci, E., Sebe, N.: Unsupervised domain adaptation for personalized facial emotion recognition. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 128–135. ACM (2014)

24.

Gavves, E., Mensink, T., Tommasi, T., Snoek, C.G., Tuytelaars, T.: Active transfer learning with zero-shot priors: reusing past datasets for future tasks. arXiv preprint arXiv:1510.01544 (2015)

25.

Mirzaei, S., Van hamme, H., Norouzi, Y.: Blind audio source separation of stereo mixtures using bayesian non-negative matrix factorization. In: Signal Processing Conference (EUSIPCO), pp. 621–625, September 2014

26.

Pletscher, P., Ong, C.S., Buhmann, J.M.: Entropy and margin maximization for structured output learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 83–98. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15939-8_6 CrossRef

27.

Uijlings, J.R., Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)CrossRef

28.

Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, Sydney, Australia, pp. 3551–3558, December 2013

29.

Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRef

30.

Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 480–492 (2012)CrossRef

31.

Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CoRR abs/1405.4506 (2014)

32.

Girshick, R.B., Felzenszwalb, P.F., McAllester, D.: Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/rbg/latent-release5/

Title: Cross-Modal Supervision for Learning Active Speaker Detection in Video
Authors: Punarjay Chakravarty
Tinne Tuytelaars
Publisher: Springer International Publishing
Book: Computer Vision – ECCV 2016
Print ISBN: 978-3-319-46453-4

Electronic ISBN: 978-3-319-46454-1

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-319-46454-1_18

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner