Skip to main content
Top
Published in: International Journal of Computer Vision 10/2023

20-06-2023 | S.I. : BMVC'21

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Authors: Rishabh Garg, Ruohan Gao, Kristen Grauman

Published in: International Journal of Computer Vision | Issue 10/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce two new large video datasets: one with realistic binaural audio simulated for real-world scanned environments, and the other with pseudo-binaural audio obtained from ambisonic sounds in YouTube \(360^{\circ }\) videos. On three datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
The SimBinaural dataset was constructed at, and will be released by, The University of Texas at Austin.
 
2
SoundSpaces (Chen et al. 2020a) provides room impulse responses at a spatial resolution of 1 m. These state-of-the-art RIRs capture how sound from each source propagates and interacts with the surrounding geometry and materials, modeling all the major real-world features of the RIR: direct sounds, early specular/diffuse reflections, reverberations, binaural spatialization, and frequency dependent effects from materials and air absorption.
 
3
This is the typical case. However, there can be instances where a sound source is not visualized in the video at all; for example, if music is playing from a small radio and the radio is not visible to the camera. While the binaural data generated in such cases is still correct, it might be harder for any model (including ours) to learn from such samples. Empirically, the number of such clips forms a very small portion of the data.
 
4
The pre-trained model provided by PseudoBinaural (Xu et al. 2021) is trained on a different split instead of the standard split from Gao and Grauman (2019a) and hence it is not directly comparable in Table 2. We evaluate on the new split in Table 3.
 
Literature
go back to reference Afouras, T., Chung, J. S., & Zisserman, A. (2019). My lips are concealed: Audio-visual speech enhancement through obstructions. In ICASSP. Afouras, T., Chung, J. S., & Zisserman, A. (2019). My lips are concealed: Audio-visual speech enhancement through obstructions. In ICASSP.
go back to reference Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In ICCV. Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In ICCV.
go back to reference Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In ECCV. Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In ECCV.
go back to reference Aytar, Y., Vondrick, C., & Torralba, A. (2016). Learning sound representations from unlabeled video. NeurIPS: Soundnet. Aytar, Y., Vondrick, C., & Torralba, A. (2016). Learning sound representations from unlabeled video. NeurIPS: Soundnet.
go back to reference Chen, C., Al-Halah, Z., & Grauman, K. (2021). Semantic audio-visual navigation. In CVPR. Chen, C., Al-Halah, Z., & Grauman, K. (2021). Semantic audio-visual navigation. In CVPR.
go back to reference Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual acoustic matching. In CVPR. Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual acoustic matching. In CVPR.
go back to reference Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson, P., & Grauman, K. (2020). Soundspaces: Audio-visual navigation in 3D environments. In ECCV. Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson, P., & Grauman, K. (2020). Soundspaces: Audio-visual navigation in 3D environments. In ECCV.
go back to reference Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (2020). Learning to set waypoints for audio-visual navigation. In ICLR. Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (2020). Learning to set waypoints for audio-visual navigation. In ICLR.
go back to reference Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (2020). Generating visually aligned sound from videos. In IEEE TIP. Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (2020). Generating visually aligned sound from videos. In IEEE TIP.
go back to reference Christensen, J. H., Hornauer, S., & Stella, X. Y. (2020). Batvision: Learning to see 3d spatial layout with two ears. In ICRA. Christensen, J. H., Hornauer, S., & Stella, X. Y. (2020). Batvision: Learning to see 3d spatial layout with two ears. In ICRA.
go back to reference Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In CVPR. Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In CVPR.
go back to reference Dean, V., Tulsiani, S., & Gupta, A. (2020). See, hear, explore: Curiosity via audio-visual association. In NeurIPS. Dean, V., Tulsiani, S., & Gupta, A. (2020). See, hear, explore: Curiosity via audio-visual association. In NeurIPS.
go back to reference Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. In ICLR. Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. In ICLR.
go back to reference Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH.
go back to reference Font, F., Roma, G., & Serra, X. (2013). Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia. Font, F., Roma, G., & Serra, X. (2013). Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia.
go back to reference Gabbay, A., Shamir, A., & Peleg, S. (2018). Visual speech enhancement. In INTERSPEECH. Gabbay, A., Shamir, A., & Peleg, S. (2018). Visual speech enhancement. In INTERSPEECH.
go back to reference Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (2020). Foley music: Learning to generate music from videos. In ECCV. Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (2020). Foley music: Learning to generate music from videos. In ECCV.
go back to reference Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music gesture for visual sound separation. In CVPR. Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music gesture for visual sound separation. In CVPR.
go back to reference Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B. (2020). Look, listen, and act: Towards audio-visual embodied navigation. ICRA. Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B. (2020). Look, listen, and act: Towards audio-visual embodied navigation. ICRA.
go back to reference Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). Visualechoes: Spatial image representation learning through echolocation. In ECCV. Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). Visualechoes: Spatial image representation learning through echolocation. In ECCV.
go back to reference Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In ECCV. Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In ECCV.
go back to reference Gao, R., & Grauman, K. (2019a). 2.5d visual sound. In CVPR. Gao, R., & Grauman, K. (2019a). 2.5d visual sound. In CVPR.
go back to reference Gao, R., & Grauman, K. (2019b). Co-separating sounds of visual objects. In ICCV. Gao, R., & Grauman, K. (2019b). Co-separating sounds of visual objects. In ICCV.
go back to reference Gao, R., & Grauman, K. (2021). Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR. Gao, R., & Grauman, K. (2021). Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR.
go back to reference Gao, R., Oh, T.-H., Grauman, K., & Torresani, L. (2020). Listen to look: Action recognition by previewing audio. In CVPR. Gao, R., Oh, T.-H., Grauman, K., & Torresani, L. (2020). Listen to look: Action recognition by previewing audio. In CVPR.
go back to reference Garg, R., Gao, R., & Grauman, K. (2021). Geometry-aware multi-task learning for binaural audio generation from video. In BMVC. Garg, R., Gao, R., & Grauman, K. (2021). Geometry-aware multi-task learning for binaural audio generation from video. In BMVC.
go back to reference Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. In IEEE Transactions on Acoustics, Speech, and Signal Processing. Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. In IEEE Transactions on Acoustics, Speech, and Signal Processing.
go back to reference He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
go back to reference Hu, D., & Li, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In CVPR. Hu, D., & Li, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In CVPR.
go back to reference Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., & Dou, D. (2020). Discriminative sounding objects localization via self-supervised audiovisual matching. In NeurIPS. Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., & Dou, D. (2020). Discriminative sounding objects localization via self-supervised audiovisual matching. In NeurIPS.
go back to reference Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR. Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
go back to reference Korbar, B., Tran, D., & Torresani, L. (2018). Co-training of audio and video representations from self-supervised temporal synchronization. In NeurIPS. Korbar, B., Tran, D., & Torresani, L. (2018). Co-training of audio and video representations from self-supervised temporal synchronization. In NeurIPS.
go back to reference Lu, Y.-D., Lee, H.-Y., Tseng, H.-Y., & Yang, M.-H. (2019). Self-supervised audio spatialization with correspondence classifier. In ICIP. Lu, Y.-D., Lee, H.-Y., Tseng, H.-Y., & Yang, M.-H. (2019). Self-supervised audio spatialization with correspondence classifier. In ICIP.
go back to reference Majumder, S., Al-Halah, Z., & Grauman, K. (2021). Move2Hear: Active audio-visual source separation. In ICCV. Majumder, S., Al-Halah, Z., & Grauman, K. (2021). Move2Hear: Active audio-visual source separation. In ICCV.
go back to reference Majumder, S., & Grauman, K. (2022). Active audio-visual separation of dynamic sound sources. In ECCV. Majumder, S., & Grauman, K. (2022). Active audio-visual separation of dynamic sound sources. In ECCV.
go back to reference Morgado, P., Li, Y., & Nvasconcelos, N. (2020). Learning representations from audio-visual spatial alignment. In NeurIPS. Morgado, P., Li, Y., & Nvasconcelos, N. (2020). Learning representations from audio-visual spatial alignment. In NeurIPS.
go back to reference Morgado, P., Vasconcelos, N., Langlois, T., & Wang, O. (2018). Self-supervised generation of spatial audio for 360\({}^\circ \) video. In: NeurIPS. Morgado, P., Vasconcelos, N., Langlois, T., & Wang, O. (2018). Self-supervised generation of spatial audio for 360\({}^\circ \) video. In: NeurIPS.
go back to reference Murphy, D. T., & Shelley, S. (2010). Openair: An interactive auralization web resource and database. In Audio Engineering Society Convention 129. Murphy, D. T., & Shelley, S. (2010). Openair: An interactive auralization web resource and database. In Audio Engineering Society Convention 129.
go back to reference Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In ECCV. Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In ECCV.
go back to reference Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016). Visually indicated sounds. In CVPR. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016). Visually indicated sounds. In CVPR.
go back to reference Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient sound provides supervision for visual learning. In ECCV. Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient sound provides supervision for visual learning. In ECCV.
go back to reference Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, Z., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, Z., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
go back to reference Perraudin, N., Balazs, P., & Søndergaard, P. L. (2013). A fast griffin-lim algorithm. In WASPAA. Perraudin, N., Balazs, P., & Søndergaard, P. L. (2013). A fast griffin-lim algorithm. In WASPAA.
go back to reference Purushwalkam, S., Gari, S. V. A., Ithapu, V. K., Schissler, C., Robinson, P., Gupta, A., & Grauman, K. (2021). Audio-visual floorplan reconstruction. In ICCV. Purushwalkam, S., Gari, S. V. A., Ithapu, V. K., Schissler, C., Robinson, P., Gupta, A., & Grauman, K. (2021). Audio-visual floorplan reconstruction. In ICCV.
go back to reference Rayleigh, L. (1875). On our perception of the direction of a source of sound. In Proceedings of the Musical Association. Rayleigh, L. (1875). On our perception of the direction of a source of sound. In Proceedings of the Musical Association.
go back to reference Richard, A., Markovic, D., Gebru, I. D., Krenn, S., Butler, G., de la Torre, F., & Sheikh, Y. (2021). Neural synthesis of binaural speech from mono audio. In ICLR. Richard, A., Markovic, D., Gebru, I. D., Krenn, S., Butler, G., de la Torre, F., & Sheikh, Y. (2021). Neural synthesis of binaural speech from mono audio. In ICLR.
go back to reference Ronneberger, O., Fischer, P., Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention. Ronneberger, O., Fischer, P., Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention.
go back to reference Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (2019). Self-supervised audio-visual co-segmentation. In ICASSP. Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (2019). Self-supervised audio-visual co-segmentation. In ICASSP.
go back to reference Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., & Batra D. (2019). Habitat: A platform for embodied ai research. In ICCV. Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., & Batra D. (2019). Habitat: A platform for embodied ai research. In ICCV.
go back to reference Schissler, C., Loftin, C., & Manocha, D. (2017). Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics. Schissler, C., Loftin, C., & Manocha, D. (2017). Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics.
go back to reference Schroeder, M. R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 37(6), 1187–1188. Schroeder, M. R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 37(6), 1187–1188.
go back to reference Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & So Kweon, I. (2018). Learning to localize sound source in visual scenes. In CVPR. Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & So Kweon, I. (2018). Learning to localize sound source in visual scenes. In CVPR.
go back to reference Tang, Z., Bryan, N.J., Li, D., Langlois, T. R., & Manocha, D. (2020). Scene-aware audio rendering via deep acoustic analysis. In IEEE Transactions on Visualization and Computer Graphics. Tang, Z., Bryan, N.J., Li, D., Langlois, T. R., & Manocha, D. (2020). Scene-aware audio rendering via deep acoustic analysis. In IEEE Transactions on Visualization and Computer Graphics.
go back to reference Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV. Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV.
go back to reference Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In ECCV. Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In ECCV.
go back to reference Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., & Hershey, J. R. (2021). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In ICLR. Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., & Hershey, J. R. (2021). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In ICLR.
go back to reference Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. In JMLR. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. In JMLR.
go back to reference Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In ICCV. Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In ICCV.
go back to reference Xu, X., Dai, B., & Lin, D. (2019). Recursive visual sound separation using minus-plus net. In ICCV. Xu, X., Dai, B., & Lin, D. (2019). Recursive visual sound separation using minus-plus net. In ICCV.
go back to reference Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (2021). Visually informed binaural audio generation without binaural audios. In CVPR. Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (2021). Visually informed binaural audio generation without binaural audios. In CVPR.
go back to reference Yang, K., Russell, B., & Salamon, J. (2020). Telling left from right: Learning spatial correspondence of sight and sound. In CVPR. Yang, K., Russell, B., & Salamon, J. (2020). Telling left from right: Learning spatial correspondence of sight and sound. In CVPR.
go back to reference Yu, J., Zhang, S.-X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP. Yu, J., Zhang, S.-X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP.
go back to reference Zaunschirm, M., Schörkhuber, C., & Höldrich, R. (2018). Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. The Journal of the Acoustical Society of America, 143(6), 3616–3627.CrossRef Zaunschirm, M., Schörkhuber, C., & Höldrich, R. (2018). Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. The Journal of the Acoustical Society of America, 143(6), 3616–3627.CrossRef
go back to reference Zhao, H., Gan, C., Ma, W.-C., & Torralba, A. (2019). The sound of motions. In ICCV. Zhao, H., Gan, C., Ma, W.-C., & Torralba, A. (2019). The sound of motions. In ICCV.
go back to reference Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In ECCV. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In ECCV.
go back to reference Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In AAAI. Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In AAAI.
go back to reference Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In ECCV. Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In ECCV.
go back to reference Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (2018). Visual to sound: Generating natural sound for videos in the wild. In CVPR. Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (2018). Visual to sound: Generating natural sound for videos in the wild. In CVPR.
Metadata
Title
Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning
Authors
Rishabh Garg
Ruohan Gao
Kristen Grauman
Publication date
20-06-2023
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 10/2023
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-023-01816-8

Other articles of this Issue 10/2023

International Journal of Computer Vision 10/2023 Go to the issue

Premium Partner