Skip to main content
Top

2016 | OriginalPaper | Chapter

Cross-Modal Supervision for Learning Active Speaker Detection in Video

Authors : Punarjay Chakravarty, Tinne Tuytelaars

Published in: Computer Vision – ECCV 2016

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Khoury, E., Sénac, C., Joly, P.: Audiovisual diarization of people in video content. Multimedia Tools Appl. 68(3), 747–775 (2014)CrossRef Khoury, E., Sénac, C., Joly, P.: Audiovisual diarization of people in video content. Multimedia Tools Appl. 68(3), 747–775 (2014)CrossRef
2.
go back to reference Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is... buffy”-automatic naming of characters in tv video. In: BMVC, vol. 2, pp. 6 (2006) Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is... buffy”-automatic naming of characters in tv video. In: BMVC, vol. 2, pp. 6 (2006)
3.
go back to reference Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automatic naming of characters in TV video. Image Vis. Comput. 27(5), 545–559 (2009)CrossRef Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automatic naming of characters in TV video. Image Vis. Comput. 27(5), 545–559 (2009)CrossRef
4.
go back to reference Haider, F., Al Moubayed, S.: Towards speaker detection using lips movements for humanmachine multiparty dialogue. In: 2012 FONETIK (2012) Haider, F., Al Moubayed, S.: Towards speaker detection using lips movements for humanmachine multiparty dialogue. In: 2012 FONETIK (2012)
5.
go back to reference Chakravarty, P., Mirzaei, S., Tuytelaars, T., Vanhamme, H.: Who’s speaking? audio-supervised classification of active speakers in video. In: ACM International Conference on Multimodal Interaction (ICMI) (2015) Chakravarty, P., Mirzaei, S., Tuytelaars, T., Vanhamme, H.: Who’s speaking? audio-supervised classification of active speakers in video. In: ACM International Conference on Multimodal Interaction (ICMI) (2015)
6.
go back to reference Germain, F., Sun, D.L., Mysore, G.J.: Speaker and noise independent voice activity detection. In: INTERSPEECH, pp. 732–736 (2013) Germain, F., Sun, D.L., Mysore, G.J.: Speaker and noise independent voice activity detection. In: INTERSPEECH, pp. 732–736 (2013)
7.
go back to reference Bilen, H., Namboodiri, V.P., Gool, L.J.: Object and action classification with latent window parameters. Int. J. Comput. Vis. 106(3), 237–251 (2014)CrossRef Bilen, H., Namboodiri, V.P., Gool, L.J.: Object and action classification with latent window parameters. Int. J. Comput. Vis. 106(3), 237–251 (2014)CrossRef
8.
go back to reference Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with posterior regularization. In: British Machine Vision Conference (2014) Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with posterior regularization. In: British Machine Vision Conference (2014)
9.
go back to reference Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1081–1089 (2015) Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1081–1089 (2015)
10.
go back to reference Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. Int. J. Comput. Vis. 100(3), 275–293 (2012)MathSciNetCrossRef Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. Int. J. Comput. Vis. 100(3), 275–293 (2012)MathSciNetCrossRef
11.
go back to reference Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024 (2014) Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. arXiv preprint arXiv:​1403.​1024 (2014)
12.
go back to reference Nguyen, M.H., Torresani, L., de la Torre, F., Rother, C.: Weakly supervised discriminative localization and classification: a joint learning process. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1925–1932. IEEE (2009) Nguyen, M.H., Torresani, L., de la Torre, F., Rother, C.: Weakly supervised discriminative localization and classification: a joint learning process. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1925–1932. IEEE (2009)
13.
go back to reference Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2280–2287. IEEE (2013) Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2280–2287. IEEE (2013)
14.
go back to reference Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the damage of dataset bias. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 158–171. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33718-5_12 Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the damage of dataset bias. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 158–171. Springer, Heidelberg (2012). doi:10.​1007/​978-3-642-33718-5_​12
15.
go back to reference Tommasi, T., Quadrianto, N., Caputo, B., Lampert, C.H.: Beyond dataset bias: multi-task unaligned shared knowledge transfer. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part I. LNCS, vol. 7724, pp. 1–15. Springer, Heidelberg (2013) Tommasi, T., Quadrianto, N., Caputo, B., Lampert, C.H.: Beyond dataset bias: multi-task unaligned shared knowledge transfer. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part I. LNCS, vol. 7724, pp. 1–15. Springer, Heidelberg (2013)
16.
go back to reference Aljundi, R., Emonet, R., Muselet, D., Sebban, M.: Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In: Computer Vision and Pattern Recognition (CVPR 2015) (2015) Aljundi, R., Emonet, R., Muselet, D., Sebban, M.: Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In: Computer Vision and Pattern Recognition (CVPR 2015) (2015)
17.
go back to reference Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2960–2967. IEEE (2013) Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2960–2967. IEEE (2013)
18.
go back to reference Aytar, Y., Zisserman, A.: Tabula rasa: model transfer for object category detection. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2252–2259. IEEE (2011) Aytar, Y., Zisserman, A.: Tabula rasa: model transfer for object category detection. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2252–2259. IEEE (2011)
19.
go back to reference Tommasi, T., Caputo, B.: The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In: BMVC, Number LIDIAP-CONF-2009-049 (2009) Tommasi, T., Caputo, B.: The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In: BMVC, Number LIDIAP-CONF-2009-049 (2009)
20.
go back to reference Tommasi, T., Orabona, F., Caputo, B.: Safety in numbers: learning categories from few examples with multi model knowledge transfer. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3081–3088. IEEE (2010) Tommasi, T., Orabona, F., Caputo, B.: Safety in numbers: learning categories from few examples with multi model knowledge transfer. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3081–3088. IEEE (2010)
21.
go back to reference Chen, J., Liu, X., Tu, P., Aragones, A.: Person-specific expression recognition with transfer learning. In: 2012 19th IEEE International Conference on Image Processing (ICIP), pp. 2621–2624. IEEE (2012) Chen, J., Liu, X., Tu, P., Aragones, A.: Person-specific expression recognition with transfer learning. In: 2012 19th IEEE International Conference on Image Processing (ICIP), pp. 2621–2624. IEEE (2012)
22.
go back to reference Chu, W.S., De la Torre, F., Cohn, J.F.: Selective transfer machine for personalized facial action unit detection. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3515–3522. IEEE (2013) Chu, W.S., De la Torre, F., Cohn, J.F.: Selective transfer machine for personalized facial action unit detection. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3515–3522. IEEE (2013)
23.
go back to reference Zen, G., Sangineto, E., Ricci, E., Sebe, N.: Unsupervised domain adaptation for personalized facial emotion recognition. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 128–135. ACM (2014) Zen, G., Sangineto, E., Ricci, E., Sebe, N.: Unsupervised domain adaptation for personalized facial emotion recognition. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 128–135. ACM (2014)
24.
go back to reference Gavves, E., Mensink, T., Tommasi, T., Snoek, C.G., Tuytelaars, T.: Active transfer learning with zero-shot priors: reusing past datasets for future tasks. arXiv preprint arXiv:1510.01544 (2015) Gavves, E., Mensink, T., Tommasi, T., Snoek, C.G., Tuytelaars, T.: Active transfer learning with zero-shot priors: reusing past datasets for future tasks. arXiv preprint arXiv:​1510.​01544 (2015)
25.
go back to reference Mirzaei, S., Van hamme, H., Norouzi, Y.: Blind audio source separation of stereo mixtures using bayesian non-negative matrix factorization. In: Signal Processing Conference (EUSIPCO), pp. 621–625, September 2014 Mirzaei, S., Van hamme, H., Norouzi, Y.: Blind audio source separation of stereo mixtures using bayesian non-negative matrix factorization. In: Signal Processing Conference (EUSIPCO), pp. 621–625, September 2014
26.
go back to reference Pletscher, P., Ong, C.S., Buhmann, J.M.: Entropy and margin maximization for structured output learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 83–98. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15939-8_6 CrossRef Pletscher, P., Ong, C.S., Buhmann, J.M.: Entropy and margin maximization for structured output learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 83–98. Springer, Heidelberg (2010). doi:10.​1007/​978-3-642-15939-8_​6 CrossRef
27.
go back to reference Uijlings, J.R., Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)CrossRef Uijlings, J.R., Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)CrossRef
28.
go back to reference Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, Sydney, Australia, pp. 3551–3558, December 2013 Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, Sydney, Australia, pp. 3551–3558, December 2013
29.
go back to reference Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRef Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRef
30.
go back to reference Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 480–492 (2012)CrossRef Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 480–492 (2012)CrossRef
31.
go back to reference Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CoRR abs/1405.4506 (2014) Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CoRR abs/1405.4506 (2014)
Metadata
Title
Cross-Modal Supervision for Learning Active Speaker Detection in Video
Authors
Punarjay Chakravarty
Tinne Tuytelaars
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-46454-1_18

Premium Partner