Skip to main content
Top

2018 | OriginalPaper | Chapter

Learnable PINs: Cross-modal Embeddings for Person Identity

Authors : Arsha Nagrani, Samuel Albanie, Andrew Zisserman

Published in: Computer Vision – ECCV 2018

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice.
We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
For a given face image and voice sampled from different speaking face-tracks, the false negative rate of the labelling diminishes as the number of identities represented in the videos grows.
 
2
It is difficult to tune this parameter based on the loss alone, since a stagnating loss curve is not necessarily indicative of a lack of progress. As the network improves its performance at a certain difficulty, it will be presented with more difficult pairs and continue to incur a high loss. Hence we observe the mean distance between positive pairs in a minibatch, mean distance between negative pairs in the minibatch, and mean distance between active pairs (those that contribute to the loss term) in the minibatch, and found that it was effective to increase \(\tau \) by \(10\%\) every two epochs, starting from \(30\%\) up until \(80\%\), and keeping it constant thereafter.
 
Literature
1.
go back to reference Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: Proceedings of the 2018 ACM on Multimedia Conference. ACM (2018) Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: Proceedings of the 2018 ACM on Multimedia Conference. ACM (2018)
2.
go back to reference Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV, pp. 609–617. IEEE (2017) Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV, pp. 609–617. IEEE (2017)
4.
go back to reference Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: NIPS, pp. 892–900 (2016) Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: NIPS, pp. 892–900 (2016)
5.
go back to reference Barnard, K., et al.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)MATH Barnard, K., et al.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)MATH
6.
go back to reference Bruce, V., Young, A.: Understanding face recognition. Br. J. Psychol. 77(3), 305–327 (1986)CrossRef Bruce, V., Young, A.: Understanding face recognition. Br. J. Psychol. 77(3), 305–327 (1986)CrossRef
7.
go back to reference Brunelli, R., Falavigna, D.: Person identification using multiple cues. IEEE Trans. Pattern Anal. Mach. Intell. 17(10), 955–966 (1995)CrossRef Brunelli, R., Falavigna, D.: Person identification using multiple cues. IEEE Trans. Pattern Anal. Mach. Intell. 17(10), 955–966 (1995)CrossRef
8.
go back to reference Budnik, M., Poignant, J., Besacier, L., Quénot, G.: Automatic propagation of manual annotations for multimodal person identification in TV shows. In: 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2014) Budnik, M., Poignant, J., Besacier, L., Quénot, G.: Automatic propagation of manual annotations for multimodal person identification in TV shows. In: 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2014)
9.
go back to reference Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: a dataset for recognising faces across pose and age. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (2018) Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: a dataset for recognising faces across pose and age. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (2018)
10.
go back to reference Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of BMVC (2011) Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of BMVC (2011)
11.
go back to reference Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of CVPR, vol. 1, pp. 539–546. IEEE (2005) Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of CVPR, vol. 1, pp. 539–546. IEEE (2005)
13.
go back to reference Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: INTERSPEECH (2018) Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: INTERSPEECH (2018)
14.
go back to reference Cinbis, R.G., Verbeek, J., Schmid, C.: Unsupervised metric learning for face identification in TV video. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1559–1566. IEEE (2011) Cinbis, R.G., Verbeek, J., Schmid, C.: Unsupervised metric learning for face identification in TV video. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1559–1566. IEEE (2011)
15.
go back to reference Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015) Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)
16.
go back to reference Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47979-1_7CrossRef Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). https://​doi.​org/​10.​1007/​3-540-47979-1_​7CrossRef
17.
go back to reference Gordo, A., Larlus, D.: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Gordo, A., Larlus, D.: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
18.
go back to reference Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: challenge of recognizing one million celebrities in the real world. Electron. Imaging 2016(11), 1–6 (2016)CrossRef Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: challenge of recognizing one million celebrities in the real world. Electron. Imaging 2016(11), 1–6 (2016)CrossRef
19.
go back to reference Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR, vol. 2, pp. 1735–1742. IEEE (2006) Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR, vol. 2, pp. 1735–1742. IEEE (2006)
20.
go back to reference Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Advances in Neural Information Processing Systems, pp. 1858–1866 (2016) Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Advances in Neural Information Processing Systems, pp. 1858–1866 (2016)
21.
go back to reference Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification (2017). arXiv preprint: arXiv:1703.07737 Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification (2017). arXiv preprint: arXiv:​1703.​07737
22.
go back to reference Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv preprint: arXiv:1502.03167 Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv preprint: arXiv:​1502.​03167
23.
go back to reference Kemelmacher-Shlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882 (2016) Kemelmacher-Shlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882 (2016)
24.
go back to reference Khoury, E., El Shafey, L., McCool, C., Günther, M., Marcel, S.: Bi-modal biometric authentication on mobile phones in challenging conditions. Image Vis. Comput. 32(12), 1147–1160 (2014)CrossRef Khoury, E., El Shafey, L., McCool, C., Günther, M., Marcel, S.: Bi-modal biometric authentication on mobile phones in challenging conditions. Image Vis. Comput. 32(12), 1147–1160 (2014)CrossRef
25.
go back to reference Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 88–95. IEEE (2005) Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 88–95. IEEE (2005)
26.
go back to reference Kim, H., et al.: Deep video portraits. In: SIGGRAPH (2018) Kim, H., et al.: Deep video portraits. In: SIGGRAPH (2018)
27.
go back to reference Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014). arXiv preprint: arXiv:1411.2539 Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014). arXiv preprint: arXiv:​1411.​2539
29.
go back to reference Le, N., Odobez, J.M.: Improving speaker turn embedding by cross-modal transfer learning from face embedding (2017). arXiv preprint: arXiv:1707.02749 Le, N., Odobez, J.M.: Improving speaker turn embedding by cross-modal transfer learning from face embedding (2017). arXiv preprint: arXiv:​1707.​02749
30.
go back to reference Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003) Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)
31.
go back to reference van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)MATH van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)MATH
32.
go back to reference McLaren, M., Ferrer, L., Castan, D., Lawson, A.: The speakers in the wild (SITW) speaker recognition database. In: Interspeech, pp. 818–822 (2016) McLaren, M., Ferrer, L., Castan, D., Lawson, A.: The speakers in the wild (SITW) speaker recognition database. In: Interspeech, pp. 818–822 (2016)
33.
go back to reference Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of CVPR (2018) Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of CVPR (2018)
34.
go back to reference Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017) Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
35.
go back to reference Nagrani, A., Zisserman, A.: From benedict cumberbatch to sherlock holmes: character identification in TV series without a script. In: Proceedings of BMVC (2017) Nagrani, A., Zisserman, A.: From benedict cumberbatch to sherlock holmes: character identification in TV series without a script. In: Proceedings of BMVC (2017)
36.
go back to reference Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011) Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)
38.
go back to reference Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015) Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015)
39.
go back to reference Rothe, R., Timofte, R., Van Gool, L.: Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. 126, 144–157 (2018)MathSciNetCrossRef Rothe, R., Timofte, R., Van Gool, L.: Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. 126, 144–157 (2018)MathSciNetCrossRef
40.
go back to reference Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of CVPR (2015) Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of CVPR (2015)
41.
go back to reference Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016) Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
42.
go back to reference Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4004–4012. IEEE (2016) Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4004–4012. IEEE (2016)
43.
go back to reference Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2222–2230 (2012) Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2222–2230 (2012)
44.
go back to reference Sung, K.K.: Learning and example selection for object and pattern detection (1996) Sung, K.K.: Learning and example selection for object and pattern detection (1996)
45.
go back to reference Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of CVPR, pp. 1701–1708 (2014) Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of CVPR, pp. 1701–1708 (2014)
46.
go back to reference Tapaswi, M., Bäuml, M., Stiefelhagen, R.: “Knock! knock! who is it?” probabilistic person identification in tv-series. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2658–2665. IEEE (2012) Tapaswi, M., Bäuml, M., Stiefelhagen, R.: “Knock! knock! who is it?” probabilistic person identification in tv-series. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2658–2665. IEEE (2012)
47.
go back to reference Thornhill, R., Møller, A.P.: Developmental stability, disease and medicine. Biol. Rev. 72, 497–548 (1997)CrossRef Thornhill, R., Møller, A.P.: Developmental stability, disease and medicine. Biol. Rev. 72, 497–548 (1997)CrossRef
48.
go back to reference Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016) Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
50.
go back to reference Wells, T., Baguley, T., Sergeant, M., Dunn, A.: Perceptions of human attractiveness comprising face and voice cues. Arch. Sex. Behav. 42, 805–811 (2013)CrossRef Wells, T., Baguley, T., Sergeant, M., Dunn, A.: Perceptions of human attractiveness comprising face and voice cues. Arch. Sex. Behav. 42, 805–811 (2013)CrossRef
51.
go back to reference Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization (2016). arXiv preprint: arXiv:1611.03530 Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization (2016). arXiv preprint: arXiv:​1611.​03530
52.
go back to reference Zhang, C., Koishida, K., Hansen, J.H.: Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1633–1644 (2018)CrossRef Zhang, C., Koishida, K., Hansen, J.H.: Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1633–1644 (2018)CrossRef
Metadata
Title
Learnable PINs: Cross-modal Embeddings for Person Identity
Authors
Arsha Nagrani
Samuel Albanie
Andrew Zisserman
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-01261-8_5

Premium Partner