Skip to main content
Erschienen in: International Journal of Computer Vision 10/2023

20.06.2023 | S.I. : BMVC'21

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

verfasst von: Rishabh Garg, Ruohan Gao, Kristen Grauman

Erschienen in: International Journal of Computer Vision | Ausgabe 10/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce two new large video datasets: one with realistic binaural audio simulated for real-world scanned environments, and the other with pseudo-binaural audio obtained from ambisonic sounds in YouTube \(360^{\circ }\) videos. On three datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The SimBinaural dataset was constructed at, and will be released by, The University of Texas at Austin.
 
2
SoundSpaces (Chen et al. 2020a) provides room impulse responses at a spatial resolution of 1 m. These state-of-the-art RIRs capture how sound from each source propagates and interacts with the surrounding geometry and materials, modeling all the major real-world features of the RIR: direct sounds, early specular/diffuse reflections, reverberations, binaural spatialization, and frequency dependent effects from materials and air absorption.
 
3
This is the typical case. However, there can be instances where a sound source is not visualized in the video at all; for example, if music is playing from a small radio and the radio is not visible to the camera. While the binaural data generated in such cases is still correct, it might be harder for any model (including ours) to learn from such samples. Empirically, the number of such clips forms a very small portion of the data.
 
4
The pre-trained model provided by PseudoBinaural (Xu et al. 2021) is trained on a different split instead of the standard split from Gao and Grauman (2019a) and hence it is not directly comparable in Table 2. We evaluate on the new split in Table 3.
 
Literatur
Zurück zum Zitat Afouras, T., Chung, J. S., & Zisserman, A. (2019). My lips are concealed: Audio-visual speech enhancement through obstructions. In ICASSP. Afouras, T., Chung, J. S., & Zisserman, A. (2019). My lips are concealed: Audio-visual speech enhancement through obstructions. In ICASSP.
Zurück zum Zitat Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In ICCV. Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In ICCV.
Zurück zum Zitat Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In ECCV. Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In ECCV.
Zurück zum Zitat Aytar, Y., Vondrick, C., & Torralba, A. (2016). Learning sound representations from unlabeled video. NeurIPS: Soundnet. Aytar, Y., Vondrick, C., & Torralba, A. (2016). Learning sound representations from unlabeled video. NeurIPS: Soundnet.
Zurück zum Zitat Chen, C., Al-Halah, Z., & Grauman, K. (2021). Semantic audio-visual navigation. In CVPR. Chen, C., Al-Halah, Z., & Grauman, K. (2021). Semantic audio-visual navigation. In CVPR.
Zurück zum Zitat Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual acoustic matching. In CVPR. Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual acoustic matching. In CVPR.
Zurück zum Zitat Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson, P., & Grauman, K. (2020). Soundspaces: Audio-visual navigation in 3D environments. In ECCV. Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson, P., & Grauman, K. (2020). Soundspaces: Audio-visual navigation in 3D environments. In ECCV.
Zurück zum Zitat Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (2020). Learning to set waypoints for audio-visual navigation. In ICLR. Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (2020). Learning to set waypoints for audio-visual navigation. In ICLR.
Zurück zum Zitat Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (2020). Generating visually aligned sound from videos. In IEEE TIP. Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (2020). Generating visually aligned sound from videos. In IEEE TIP.
Zurück zum Zitat Christensen, J. H., Hornauer, S., & Stella, X. Y. (2020). Batvision: Learning to see 3d spatial layout with two ears. In ICRA. Christensen, J. H., Hornauer, S., & Stella, X. Y. (2020). Batvision: Learning to see 3d spatial layout with two ears. In ICRA.
Zurück zum Zitat Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In CVPR. Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In CVPR.
Zurück zum Zitat Dean, V., Tulsiani, S., & Gupta, A. (2020). See, hear, explore: Curiosity via audio-visual association. In NeurIPS. Dean, V., Tulsiani, S., & Gupta, A. (2020). See, hear, explore: Curiosity via audio-visual association. In NeurIPS.
Zurück zum Zitat Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. In ICLR. Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. In ICLR.
Zurück zum Zitat Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH.
Zurück zum Zitat Font, F., Roma, G., & Serra, X. (2013). Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia. Font, F., Roma, G., & Serra, X. (2013). Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia.
Zurück zum Zitat Gabbay, A., Shamir, A., & Peleg, S. (2018). Visual speech enhancement. In INTERSPEECH. Gabbay, A., Shamir, A., & Peleg, S. (2018). Visual speech enhancement. In INTERSPEECH.
Zurück zum Zitat Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (2020). Foley music: Learning to generate music from videos. In ECCV. Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (2020). Foley music: Learning to generate music from videos. In ECCV.
Zurück zum Zitat Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music gesture for visual sound separation. In CVPR. Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music gesture for visual sound separation. In CVPR.
Zurück zum Zitat Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B. (2020). Look, listen, and act: Towards audio-visual embodied navigation. ICRA. Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B. (2020). Look, listen, and act: Towards audio-visual embodied navigation. ICRA.
Zurück zum Zitat Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). Visualechoes: Spatial image representation learning through echolocation. In ECCV. Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). Visualechoes: Spatial image representation learning through echolocation. In ECCV.
Zurück zum Zitat Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In ECCV. Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In ECCV.
Zurück zum Zitat Gao, R., & Grauman, K. (2019a). 2.5d visual sound. In CVPR. Gao, R., & Grauman, K. (2019a). 2.5d visual sound. In CVPR.
Zurück zum Zitat Gao, R., & Grauman, K. (2019b). Co-separating sounds of visual objects. In ICCV. Gao, R., & Grauman, K. (2019b). Co-separating sounds of visual objects. In ICCV.
Zurück zum Zitat Gao, R., & Grauman, K. (2021). Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR. Gao, R., & Grauman, K. (2021). Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR.
Zurück zum Zitat Gao, R., Oh, T.-H., Grauman, K., & Torresani, L. (2020). Listen to look: Action recognition by previewing audio. In CVPR. Gao, R., Oh, T.-H., Grauman, K., & Torresani, L. (2020). Listen to look: Action recognition by previewing audio. In CVPR.
Zurück zum Zitat Garg, R., Gao, R., & Grauman, K. (2021). Geometry-aware multi-task learning for binaural audio generation from video. In BMVC. Garg, R., Gao, R., & Grauman, K. (2021). Geometry-aware multi-task learning for binaural audio generation from video. In BMVC.
Zurück zum Zitat Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. In IEEE Transactions on Acoustics, Speech, and Signal Processing. Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. In IEEE Transactions on Acoustics, Speech, and Signal Processing.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Zurück zum Zitat Hu, D., & Li, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In CVPR. Hu, D., & Li, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In CVPR.
Zurück zum Zitat Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., & Dou, D. (2020). Discriminative sounding objects localization via self-supervised audiovisual matching. In NeurIPS. Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., & Dou, D. (2020). Discriminative sounding objects localization via self-supervised audiovisual matching. In NeurIPS.
Zurück zum Zitat Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR. Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Zurück zum Zitat Korbar, B., Tran, D., & Torresani, L. (2018). Co-training of audio and video representations from self-supervised temporal synchronization. In NeurIPS. Korbar, B., Tran, D., & Torresani, L. (2018). Co-training of audio and video representations from self-supervised temporal synchronization. In NeurIPS.
Zurück zum Zitat Lu, Y.-D., Lee, H.-Y., Tseng, H.-Y., & Yang, M.-H. (2019). Self-supervised audio spatialization with correspondence classifier. In ICIP. Lu, Y.-D., Lee, H.-Y., Tseng, H.-Y., & Yang, M.-H. (2019). Self-supervised audio spatialization with correspondence classifier. In ICIP.
Zurück zum Zitat Majumder, S., Al-Halah, Z., & Grauman, K. (2021). Move2Hear: Active audio-visual source separation. In ICCV. Majumder, S., Al-Halah, Z., & Grauman, K. (2021). Move2Hear: Active audio-visual source separation. In ICCV.
Zurück zum Zitat Majumder, S., & Grauman, K. (2022). Active audio-visual separation of dynamic sound sources. In ECCV. Majumder, S., & Grauman, K. (2022). Active audio-visual separation of dynamic sound sources. In ECCV.
Zurück zum Zitat Morgado, P., Li, Y., & Nvasconcelos, N. (2020). Learning representations from audio-visual spatial alignment. In NeurIPS. Morgado, P., Li, Y., & Nvasconcelos, N. (2020). Learning representations from audio-visual spatial alignment. In NeurIPS.
Zurück zum Zitat Morgado, P., Vasconcelos, N., Langlois, T., & Wang, O. (2018). Self-supervised generation of spatial audio for 360\({}^\circ \) video. In: NeurIPS. Morgado, P., Vasconcelos, N., Langlois, T., & Wang, O. (2018). Self-supervised generation of spatial audio for 360\({}^\circ \) video. In: NeurIPS.
Zurück zum Zitat Murphy, D. T., & Shelley, S. (2010). Openair: An interactive auralization web resource and database. In Audio Engineering Society Convention 129. Murphy, D. T., & Shelley, S. (2010). Openair: An interactive auralization web resource and database. In Audio Engineering Society Convention 129.
Zurück zum Zitat Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In ECCV. Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In ECCV.
Zurück zum Zitat Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016). Visually indicated sounds. In CVPR. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016). Visually indicated sounds. In CVPR.
Zurück zum Zitat Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient sound provides supervision for visual learning. In ECCV. Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient sound provides supervision for visual learning. In ECCV.
Zurück zum Zitat Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, Z., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, Z., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
Zurück zum Zitat Perraudin, N., Balazs, P., & Søndergaard, P. L. (2013). A fast griffin-lim algorithm. In WASPAA. Perraudin, N., Balazs, P., & Søndergaard, P. L. (2013). A fast griffin-lim algorithm. In WASPAA.
Zurück zum Zitat Purushwalkam, S., Gari, S. V. A., Ithapu, V. K., Schissler, C., Robinson, P., Gupta, A., & Grauman, K. (2021). Audio-visual floorplan reconstruction. In ICCV. Purushwalkam, S., Gari, S. V. A., Ithapu, V. K., Schissler, C., Robinson, P., Gupta, A., & Grauman, K. (2021). Audio-visual floorplan reconstruction. In ICCV.
Zurück zum Zitat Rayleigh, L. (1875). On our perception of the direction of a source of sound. In Proceedings of the Musical Association. Rayleigh, L. (1875). On our perception of the direction of a source of sound. In Proceedings of the Musical Association.
Zurück zum Zitat Richard, A., Markovic, D., Gebru, I. D., Krenn, S., Butler, G., de la Torre, F., & Sheikh, Y. (2021). Neural synthesis of binaural speech from mono audio. In ICLR. Richard, A., Markovic, D., Gebru, I. D., Krenn, S., Butler, G., de la Torre, F., & Sheikh, Y. (2021). Neural synthesis of binaural speech from mono audio. In ICLR.
Zurück zum Zitat Ronneberger, O., Fischer, P., Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention. Ronneberger, O., Fischer, P., Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention.
Zurück zum Zitat Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (2019). Self-supervised audio-visual co-segmentation. In ICASSP. Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (2019). Self-supervised audio-visual co-segmentation. In ICASSP.
Zurück zum Zitat Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., & Batra D. (2019). Habitat: A platform for embodied ai research. In ICCV. Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., & Batra D. (2019). Habitat: A platform for embodied ai research. In ICCV.
Zurück zum Zitat Schissler, C., Loftin, C., & Manocha, D. (2017). Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics. Schissler, C., Loftin, C., & Manocha, D. (2017). Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics.
Zurück zum Zitat Schroeder, M. R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 37(6), 1187–1188. Schroeder, M. R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 37(6), 1187–1188.
Zurück zum Zitat Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & So Kweon, I. (2018). Learning to localize sound source in visual scenes. In CVPR. Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & So Kweon, I. (2018). Learning to localize sound source in visual scenes. In CVPR.
Zurück zum Zitat Tang, Z., Bryan, N.J., Li, D., Langlois, T. R., & Manocha, D. (2020). Scene-aware audio rendering via deep acoustic analysis. In IEEE Transactions on Visualization and Computer Graphics. Tang, Z., Bryan, N.J., Li, D., Langlois, T. R., & Manocha, D. (2020). Scene-aware audio rendering via deep acoustic analysis. In IEEE Transactions on Visualization and Computer Graphics.
Zurück zum Zitat Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV. Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV.
Zurück zum Zitat Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In ECCV. Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In ECCV.
Zurück zum Zitat Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., & Hershey, J. R. (2021). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In ICLR. Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., & Hershey, J. R. (2021). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In ICLR.
Zurück zum Zitat Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. In JMLR. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. In JMLR.
Zurück zum Zitat Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In ICCV. Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In ICCV.
Zurück zum Zitat Xu, X., Dai, B., & Lin, D. (2019). Recursive visual sound separation using minus-plus net. In ICCV. Xu, X., Dai, B., & Lin, D. (2019). Recursive visual sound separation using minus-plus net. In ICCV.
Zurück zum Zitat Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (2021). Visually informed binaural audio generation without binaural audios. In CVPR. Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (2021). Visually informed binaural audio generation without binaural audios. In CVPR.
Zurück zum Zitat Yang, K., Russell, B., & Salamon, J. (2020). Telling left from right: Learning spatial correspondence of sight and sound. In CVPR. Yang, K., Russell, B., & Salamon, J. (2020). Telling left from right: Learning spatial correspondence of sight and sound. In CVPR.
Zurück zum Zitat Yu, J., Zhang, S.-X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP. Yu, J., Zhang, S.-X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP.
Zurück zum Zitat Zaunschirm, M., Schörkhuber, C., & Höldrich, R. (2018). Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. The Journal of the Acoustical Society of America, 143(6), 3616–3627.CrossRef Zaunschirm, M., Schörkhuber, C., & Höldrich, R. (2018). Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. The Journal of the Acoustical Society of America, 143(6), 3616–3627.CrossRef
Zurück zum Zitat Zhao, H., Gan, C., Ma, W.-C., & Torralba, A. (2019). The sound of motions. In ICCV. Zhao, H., Gan, C., Ma, W.-C., & Torralba, A. (2019). The sound of motions. In ICCV.
Zurück zum Zitat Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In ECCV. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In ECCV.
Zurück zum Zitat Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In AAAI. Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In AAAI.
Zurück zum Zitat Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In ECCV. Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In ECCV.
Zurück zum Zitat Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (2018). Visual to sound: Generating natural sound for videos in the wild. In CVPR. Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (2018). Visual to sound: Generating natural sound for videos in the wild. In CVPR.
Metadaten
Titel
Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning
verfasst von
Rishabh Garg
Ruohan Gao
Kristen Grauman
Publikationsdatum
20.06.2023
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 10/2023
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-023-01816-8

Weitere Artikel der Ausgabe 10/2023

International Journal of Computer Vision 10/2023 Zur Ausgabe

S.I. : Computer Vision Approach for Animal Tracking and Modeling

DOVE: Learning Deformable 3D Objects by Watching Videos

Premium Partner