Top

International Journal of Computer Vision

Published in:

20-06-2023 | S.I. : BMVC'21

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Authors: Rishabh Garg, Ruohan Gao, Kristen Grauman

Published in: International Journal of Computer Vision | Issue 10/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce two new large video datasets: one with realistic binaural audio simulated for real-world scanned environments, and the other with pseudo-binaural audio obtained from ambisonic sounds in YouTube \(360^{\circ }\) videos. On three datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

previous article Conditional Temporal Variational AutoEncoder for Action Video Prediction

next article Transformer-Based Context Condensation for Boosting Feature Pyramids in Object Detection

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

The SimBinaural dataset was constructed at, and will be released by, The University of Texas at Austin.

SoundSpaces (Chen et al. 2020a) provides room impulse responses at a spatial resolution of 1 m. These state-of-the-art RIRs capture how sound from each source propagates and interacts with the surrounding geometry and materials, modeling all the major real-world features of the RIR: direct sounds, early specular/diffuse reflections, reverberations, binaural spatialization, and frequency dependent effects from materials and air absorption.

This is the typical case. However, there can be instances where a sound source is not visualized in the video at all; for example, if music is playing from a small radio and the radio is not visible to the camera. While the binaural data generated in such cases is still correct, it might be harder for any model (including ours) to learn from such samples. Empirically, the number of such clips forms a very small portion of the data.

The pre-trained model provided by PseudoBinaural (Xu et al. 2021) is trained on a different split instead of the standard split from Gao and Grauman (2019a) and hence it is not directly comparable in Table 2. We evaluate on the new split in Table 3.

Afouras, T., Chung, J. S., & Zisserman, A. (2019). My lips are concealed: Audio-visual speech enhancement through obstructions. In ICASSP.

Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In ICCV.

Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In ECCV.

Aytar, Y., Vondrick, C., & Torralba, A. (2016). Learning sound representations from unlabeled video. NeurIPS: Soundnet.

Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y. (2017). Matterport3d: Learning from RGB-D data in indoor environments. In International conference on 3D vision (3DV). MatterPort3D dataset license available at: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf

Chen, C., Al-Halah, Z., & Grauman, K. (2021). Semantic audio-visual navigation. In CVPR.

Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual acoustic matching. In CVPR.

Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson, P., & Grauman, K. (2020). Soundspaces: Audio-visual navigation in 3D environments. In ECCV.

Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (2020). Learning to set waypoints for audio-visual navigation. In ICLR.

Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (2020). Generating visually aligned sound from videos. In IEEE TIP.

Christensen, J. H., Hornauer, S., & Stella, X. Y. (2020). Batvision: Learning to see 3d spatial layout with two ears. In ICRA.

Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In CVPR.

Dean, V., Tulsiani, S., & Gupta, A. (2020). See, hear, explore: Curiosity via audio-visual association. In NeurIPS.

Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. In ICLR.

Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH.

Font, F., Roma, G., & Serra, X. (2013). Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia.

Gabbay, A., Shamir, A., & Peleg, S. (2018). Visual speech enhancement. In INTERSPEECH.

Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (2020). Foley music: Learning to generate music from videos. In ECCV.

Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music gesture for visual sound separation. In CVPR.

Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B. (2020). Look, listen, and act: Towards audio-visual embodied navigation. ICRA.

Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). Visualechoes: Spatial image representation learning through echolocation. In ECCV.

Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In ECCV.

Gao, R., & Grauman, K. (2019a). 2.5d visual sound. In CVPR.

Gao, R., & Grauman, K. (2019b). Co-separating sounds of visual objects. In ICCV.

Gao, R., & Grauman, K. (2021). Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR.

Gao, R., Oh, T.-H., Grauman, K., & Torresani, L. (2020). Listen to look: Action recognition by previewing audio. In CVPR.

Garg, R., Gao, R., & Grauman, K. (2021). Geometry-aware multi-task learning for binaural audio generation from video. In BMVC.

Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. In IEEE Transactions on Acoustics, Speech, and Signal Processing.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

Hu, D., & Li, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In CVPR.

Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., & Dou, D. (2020). Discriminative sounding objects localization via self-supervised audiovisual matching. In NeurIPS.

Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

Korbar, B., Tran, D., & Torresani, L. (2018). Co-training of audio and video representations from self-supervised temporal synchronization. In NeurIPS.

Lu, Y.-D., Lee, H.-Y., Tseng, H.-Y., & Yang, M.-H. (2019). Self-supervised audio spatialization with correspondence classifier. In ICIP.

Majumder, S., Al-Halah, Z., & Grauman, K. (2021). Move2Hear: Active audio-visual source separation. In ICCV.

Majumder, S., & Grauman, K. (2022). Active audio-visual separation of dynamic sound sources. In ECCV.

Morgado, P., Li, Y., & Nvasconcelos, N. (2020). Learning representations from audio-visual spatial alignment. In NeurIPS.

Morgado, P., Vasconcelos, N., Langlois, T., & Wang, O. (2018). Self-supervised generation of spatial audio for 360\({}^\circ \) video. In: NeurIPS.

Murphy, D. T., & Shelley, S. (2010). Openair: An interactive auralization web resource and database. In Audio Engineering Society Convention 129.

Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In ECCV.

Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016). Visually indicated sounds. In CVPR.

Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient sound provides supervision for visual learning. In ECCV.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, Z., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.

Perraudin, N., Balazs, P., & Søndergaard, P. L. (2013). A fast griffin-lim algorithm. In WASPAA.

Purushwalkam, S., Gari, S. V. A., Ithapu, V. K., Schissler, C., Robinson, P., Gupta, A., & Grauman, K. (2021). Audio-visual floorplan reconstruction. In ICCV.

Rayleigh, L. (1875). On our perception of the direction of a source of sound. In Proceedings of the Musical Association.

Richard, A., Markovic, D., Gebru, I. D., Krenn, S., Butler, G., de la Torre, F., & Sheikh, Y. (2021). Neural synthesis of binaural speech from mono audio. In ICLR.

Ronneberger, O., Fischer, P., Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention.

Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (2019). Self-supervised audio-visual co-segmentation. In ICASSP.

Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., & Batra D. (2019). Habitat: A platform for embodied ai research. In ICCV.

Schissler, C., Loftin, C., & Manocha, D. (2017). Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics.

Schroeder, M. R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 37(6), 1187–1188.

Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & So Kweon, I. (2018). Learning to localize sound source in visual scenes. In CVPR.

Tang, Z., Bryan, N.J., Li, D., Langlois, T. R., & Manocha, D. (2020). Scene-aware audio rendering via deep acoustic analysis. In IEEE Transactions on Visualization and Computer Graphics.

Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV.

Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In ECCV.

Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., & Hershey, J. R. (2021). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In ICLR.

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. In JMLR.

Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In ICCV.

Xu, X., Dai, B., & Lin, D. (2019). Recursive visual sound separation using minus-plus net. In ICCV.

Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (2021). Visually informed binaural audio generation without binaural audios. In CVPR.

Yang, K., Russell, B., & Salamon, J. (2020). Telling left from right: Learning spatial correspondence of sight and sound. In CVPR.

Yu, J., Zhang, S.-X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP.

Zaunschirm, M., Schörkhuber, C., & Höldrich, R. (2018). Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. The Journal of the Acoustical Society of America, 143(6), 3616–3627.CrossRef

Zhao, H., Gan, C., Ma, W.-C., & Torralba, A. (2019). The sound of motions. In ICCV.

Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In ECCV.

Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In AAAI.

Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In ECCV.

Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (2018). Visual to sound: Generating natural sound for videos in the wild. In CVPR.

Title: Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning
Authors: Rishabh Garg
Ruohan Gao
Kristen Grauman
Publication date: 20-06-2023
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 10/2023
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-023-01816-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 10/2023

What Limits the Performance of Local Self-attention?

When Multi-Focus Image Fusion Networks Meet Traditional Edge-Preservation Technology

Anti-Bandit for Neural Architecture Search

A Survey of Methods for Automated Quality Control Based on Images

Don’t Be So Dense: Sparse-to-Sparse GAN Training Without Sacrificing Performance

Importance First: Generating Scene Graph of Human Interest

Premium Partner