Top

International Journal of Computer Vision

Published in:

11-07-2018

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Authors: Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

Published in: International Journal of Computer Vision | Issue 10/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The sound of crashing waves, the roar of fast-moving cars—sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. (in: European conference on computer vision, 2016b), with additional experiments and discussion.

previous article RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment

next article Subspace Learning by -Induced Sparsity

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

For conciseness, we sometimes call these “sound-making” objects, even if they are not literally the source of the sound.

As a result, this model has a larger pool5 layer than the other methods: \(7 \times 7\) versus \(6 \times 6\). Likewise, the fc6 layer of Wang and Gupta (2015) is smaller (1024 vs. 4096 dims.).

Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In IEEE international conference on computer vision.

Andrew, G., Arora, R., Bilmes, J. A., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning.

Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. ICCV.

Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems.

Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. CVPR.

de Sa, V. R. (1994a). Learning classification with unlabeled data. Advances in neural information processing systems (pp 112)

de Sa, V. R. (1994b). Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School (pp. 300.). Psychology Press.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.

Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In IEEE international conference on computer vision.

Doersch, C., & Zisserman, A. (2017). Multi-task self-supervised visual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2051–2060).

Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems.

Ellis, D. P., Zeng, X., McDermott, J. H. (2011). Classifying soundtracks with audio texture features. In IEEE international conference on acoustics, speech, and signal processing.

Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al. (2006). Audio-based context recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 14(1), 321–329.CrossRef

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRef

Fisher III J. W., Darrell, T., Freeman, W. T., Viola, P. A. (2000). Learning joint statistical models for audio–visual fusion and segregation. In Advances in neural information processing systems.

Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, 5(1), 1–29.CrossRef

Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dartaset for audio events. In IEEE international conference on acoustics, speech, and signal processing.

Girshick, R. (2015). Fast r-cnn. In IEEE international conference on computer vision.

Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y. (2015). Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518.

Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Hershey, J. R., & Movellan, J. R. (1999). Audio vision: Using audio–visual synchrony to locate sounds. In Advances in neural information processing systems.

Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM symposium on theory of computing.

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.

Isola, P. (2015). The discovery of perceptual structure from visual co-occurrences in space and time. PhD thesis

Isola, P., Zoran, D., Krishnan, D., & Adelson, E.H. (2016). Learning visual groups from co-occurrences in space and time. In International conference on learning representations, Workshop.

Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to ego-motion. In IEEE international conference on computer vision.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference.

Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. In IEEE conference on computer vision and pattern recognition.

Krähenbühl, P., Doersch, C., Donahue, J., & Darrell, T. (2016). Data-dependent initializations of convolutional neural networks. In International conference on learning representations

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In International conference on machine learning

Lee, K., Ellis, D. P., & Loui, A. C. (2010). Detecting local semantic concepts in environmental sounds using markov model based clustering. In IEEE international conference on acoustics, speech, and signal processing.

Leung, T., & Malik, J. (2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1), 29–44.CrossRefMATH

Lin, M., Chen, Q., & Yan, S. (2014). Network in network. International conference on learning representations.

McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71(5), 926–940.CrossRef

Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422.

Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video. In International conference on machine learning.

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International Conference on Machine Learning.

Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Conference on computer vision and pattern recognition.

Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a). Visually indicated sounds. In CVPR.

Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In European conference on computer vision.

Pathak, D., Girshick, R., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. In CVPR.

Salakhutdinov, R., & Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7), 969–978.CrossRef

Slaney, M., & Covell, M. (2000). Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in neural information processing systems.

Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial life, 11(1–2), 13–29.CrossRef

Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems.

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L. J. (2015). The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817.

Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In IEEE international conference on computer vision

Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In Advances in neural information processing systems.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.

Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision pp. 649–666. Springer

Zhang, R., Isola, P., & Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International conference on learning representations.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In The IEEE conference on computer vision and pattern recognition (CVPR).

Title: Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
Authors: Andrew Owens
Jiajun Wu
Josh H. McDermott
William T. Freeman
Antonio Torralba
Publication date: 11-07-2018
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 10/2018
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-018-1083-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 10/2018

Subspace Learning by -Induced Sparsity

Editor’s Note: Special Issue on Novel Representations and Learning Methods in Computer Vision

Focal Flow: Velocity and Depth from Differential Defocus Through Motion

RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment

Top-Down Neural Attention by Excitation Backprop

Premium Partner