Skip to main content
Top
Published in: International Journal of Computer Vision 10/2018

11-07-2018

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Authors: Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

Published in: International Journal of Computer Vision | Issue 10/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The sound of crashing waves, the roar of fast-moving cars—sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. (in: European conference on computer vision, 2016b), with additional experiments and discussion.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
For conciseness, we sometimes call these “sound-making” objects, even if they are not literally the source of the sound.
 
2
As a result, this model has a larger pool5 layer than the other methods: \(7 \times 7\) versus \(6 \times 6\). Likewise, the fc6 layer of Wang and Gupta (2015) is smaller (1024 vs. 4096 dims.).
 
Literature
go back to reference Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In IEEE international conference on computer vision. Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In IEEE international conference on computer vision.
go back to reference Andrew, G., Arora, R., Bilmes, J. A., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning. Andrew, G., Arora, R., Bilmes, J. A., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning.
go back to reference Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. ICCV. Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. ICCV.
go back to reference Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems. Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems.
go back to reference Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. CVPR. Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. CVPR.
go back to reference de Sa, V. R. (1994a). Learning classification with unlabeled data. Advances in neural information processing systems (pp 112) de Sa, V. R. (1994a). Learning classification with unlabeled data. Advances in neural information processing systems (pp 112)
go back to reference de Sa, V. R. (1994b). Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School (pp. 300.). Psychology Press. de Sa, V. R. (1994b). Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School (pp. 300.). Psychology Press.
go back to reference Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.
go back to reference Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In IEEE international conference on computer vision. Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In IEEE international conference on computer vision.
go back to reference Doersch, C., & Zisserman, A. (2017). Multi-task self-supervised visual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2051–2060). Doersch, C., & Zisserman, A. (2017). Multi-task self-supervised visual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2051–2060).
go back to reference Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems. Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems.
go back to reference Ellis, D. P., Zeng, X., McDermott, J. H. (2011). Classifying soundtracks with audio texture features. In IEEE international conference on acoustics, speech, and signal processing. Ellis, D. P., Zeng, X., McDermott, J. H. (2011). Classifying soundtracks with audio texture features. In IEEE international conference on acoustics, speech, and signal processing.
go back to reference Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al. (2006). Audio-based context recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 14(1), 321–329.CrossRef Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al. (2006). Audio-based context recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 14(1), 321–329.CrossRef
go back to reference Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRef Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRef
go back to reference Fisher III J. W., Darrell, T., Freeman, W. T., Viola, P. A. (2000). Learning joint statistical models for audio–visual fusion and segregation. In Advances in neural information processing systems. Fisher III J. W., Darrell, T., Freeman, W. T., Viola, P. A. (2000). Learning joint statistical models for audio–visual fusion and segregation. In Advances in neural information processing systems.
go back to reference Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, 5(1), 1–29.CrossRef Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, 5(1), 1–29.CrossRef
go back to reference Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dartaset for audio events. In IEEE international conference on acoustics, speech, and signal processing. Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dartaset for audio events. In IEEE international conference on acoustics, speech, and signal processing.
go back to reference Girshick, R. (2015). Fast r-cnn. In IEEE international conference on computer vision. Girshick, R. (2015). Fast r-cnn. In IEEE international conference on computer vision.
go back to reference Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y. (2015). Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518. Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y. (2015). Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518.
go back to reference Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition. Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition.
go back to reference Hershey, J. R., & Movellan, J. R. (1999). Audio vision: Using audio–visual synchrony to locate sounds. In Advances in neural information processing systems. Hershey, J. R., & Movellan, J. R. (1999). Audio vision: Using audio–visual synchrony to locate sounds. In Advances in neural information processing systems.
go back to reference Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM symposium on theory of computing. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM symposium on theory of computing.
go back to reference Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.
go back to reference Isola, P. (2015). The discovery of perceptual structure from visual co-occurrences in space and time. PhD thesis Isola, P. (2015). The discovery of perceptual structure from visual co-occurrences in space and time. PhD thesis
go back to reference Isola, P., Zoran, D., Krishnan, D., & Adelson, E.H. (2016). Learning visual groups from co-occurrences in space and time. In International conference on learning representations, Workshop. Isola, P., Zoran, D., Krishnan, D., & Adelson, E.H. (2016). Learning visual groups from co-occurrences in space and time. In International conference on learning representations, Workshop.
go back to reference Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to ego-motion. In IEEE international conference on computer vision. Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to ego-motion. In IEEE international conference on computer vision.
go back to reference Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference.
go back to reference Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. In IEEE conference on computer vision and pattern recognition. Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. In IEEE conference on computer vision and pattern recognition.
go back to reference Krähenbühl, P., Doersch, C., Donahue, J., & Darrell, T. (2016). Data-dependent initializations of convolutional neural networks. In International conference on learning representations Krähenbühl, P., Doersch, C., Donahue, J., & Darrell, T. (2016). Data-dependent initializations of convolutional neural networks. In International conference on learning representations
go back to reference Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
go back to reference Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In International conference on machine learning Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In International conference on machine learning
go back to reference Lee, K., Ellis, D. P., & Loui, A. C. (2010). Detecting local semantic concepts in environmental sounds using markov model based clustering. In IEEE international conference on acoustics, speech, and signal processing. Lee, K., Ellis, D. P., & Loui, A. C. (2010). Detecting local semantic concepts in environmental sounds using markov model based clustering. In IEEE international conference on acoustics, speech, and signal processing.
go back to reference Leung, T., & Malik, J. (2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1), 29–44.CrossRefMATH Leung, T., & Malik, J. (2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1), 29–44.CrossRefMATH
go back to reference Lin, M., Chen, Q., & Yan, S. (2014). Network in network. International conference on learning representations. Lin, M., Chen, Q., & Yan, S. (2014). Network in network. International conference on learning representations.
go back to reference McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71(5), 926–940.CrossRef McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71(5), 926–940.CrossRef
go back to reference Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video. In International conference on machine learning. Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video. In International conference on machine learning.
go back to reference Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International Conference on Machine Learning. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International Conference on Machine Learning.
go back to reference Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Conference on computer vision and pattern recognition. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Conference on computer vision and pattern recognition.
go back to reference Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a). Visually indicated sounds. In CVPR. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a). Visually indicated sounds. In CVPR.
go back to reference Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In European conference on computer vision. Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In European conference on computer vision.
go back to reference Pathak, D., Girshick, R., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. In CVPR. Pathak, D., Girshick, R., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. In CVPR.
go back to reference Salakhutdinov, R., & Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7), 969–978.CrossRef Salakhutdinov, R., & Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7), 969–978.CrossRef
go back to reference Slaney, M., & Covell, M. (2000). Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in neural information processing systems. Slaney, M., & Covell, M. (2000). Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in neural information processing systems.
go back to reference Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial life, 11(1–2), 13–29.CrossRef Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial life, 11(1–2), 13–29.CrossRef
go back to reference Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems. Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems.
go back to reference Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L. J. (2015). The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817. Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L. J. (2015). The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817.
go back to reference Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In IEEE international conference on computer vision Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In IEEE international conference on computer vision
go back to reference Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In Advances in neural information processing systems. Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In Advances in neural information processing systems.
go back to reference Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.
go back to reference Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision pp. 649–666. Springer Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision pp. 649–666. Springer
go back to reference Zhang, R., Isola, P., & Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR. Zhang, R., Isola, P., & Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.
go back to reference Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.
go back to reference Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International conference on learning representations. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International conference on learning representations.
go back to reference Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In The IEEE conference on computer vision and pattern recognition (CVPR). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In The IEEE conference on computer vision and pattern recognition (CVPR).
Metadata
Title
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
Authors
Andrew Owens
Jiajun Wu
Josh H. McDermott
William T. Freeman
Antonio Torralba
Publication date
11-07-2018
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 10/2018
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-018-1083-5

Other articles of this Issue 10/2018

International Journal of Computer Vision 10/2018 Go to the issue

Premium Partner