Skip to main content
Top

2025 | OriginalPaper | Chapter

VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Authors : Xiang Fan, Anand Bhattad, Ranjay Krishna

Published in: Computer Vision – ECCV 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Bhattad, A., Forsyth, D.A.: Cut-and-paste object insertion by enabling deep image prior for reshading. In: 2022 International Conference on 3D Vision (3DV), pp. 332–341. IEEE (2022) Bhattad, A., Forsyth, D.A.: Cut-and-paste object insertion by enabling deep image prior for reshading. In: 2022 International Conference on 3D Vision (3DV), pp. 332–341. IEEE (2022)
2.
go back to reference Bhattad, A., McKee, D., Hoiem, D., Forsyth, D.: StyleGAN knows normal, depth, albedo, and more. In: Advances in Neural Information Processing Systems 36 (2024) Bhattad, A., McKee, D., Hoiem, D., Forsyth, D.: StyleGAN knows normal, depth, albedo, and more. In: Advances in Neural Information Processing Systems 36 (2024)
3.
go back to reference Bhattad, A., Soole, J., Forsyth, D.: StyLitGAN: image-based relighting via latent control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4231–4240 (2024) Bhattad, A., Soole, J., Forsyth, D.: StyLitGAN: image-based relighting via latent control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4231–4240 (2024)
4.
go back to reference Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets (2023) Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets (2023)
5.
go back to reference Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023) Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)
6.
go back to reference Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion (2023) Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion (2023)
7.
go back to reference Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now (2019) Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now (2019)
8.
go back to reference Chang, S.Y., Chen, H.T., Liu, T.L.: DiffusionAtlas: high-fidelity consistent diffusion video editing (2023) Chang, S.Y., Chen, H.T., Liu, T.L.: DiffusionAtlas: high-fidelity consistent diffusion video editing (2023)
9.
go back to reference Chen, H., et al.: VideoCrafter2: overcoming data limitations for high-quality video diffusion models (2024) Chen, H., et al.: VideoCrafter2: overcoming data limitations for high-quality video diffusion models (2024)
10.
go back to reference Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: AnyDoor: zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023) Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: AnyDoor: zero-shot object-level image customization. arXiv preprint arXiv:​2307.​09481 (2023)
11.
go back to reference Couairon, P., Rambour, C., Haugeard, J.E., Thome, N.: VidEdit: zero-shot and spatially aware text-driven video editing (2023) Couairon, P., Rambour, C., Haugeard, J.E., Thome, N.: VidEdit: zero-shot and spatially aware text-driven video editing (2023)
12.
go back to reference Decatur, D., Lang, I., Aberman, K., Hanocka, R.: 3D Paintbrush: local stylization of 3D shapes with cascaded score distillation. arXiv preprint arXiv:2311.09571 (2023) Decatur, D., Lang, I., Aberman, K., Hanocka, R.: 3D Paintbrush: local stylization of 3D shapes with cascaded score distillation. arXiv preprint arXiv:​2311.​09571 (2023)
13.
go back to reference Decatur, D., Lang, I., Hanocka, R.: 3D highlighter: localizing regions on 3D shapes via text descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20930–20939 (2023) Decatur, D., Lang, I., Hanocka, R.: 3D highlighter: localizing regions on 3D shapes via text descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20930–20939 (2023)
14.
go back to reference Du, X., Kolkin, N., Shakhnarovich, G., Bhattad, A.: Generative models: what do they know? Do they know things? Let’s find out! arXiv preprint arXiv:2311.17137 (2023) Du, X., Kolkin, N., Shakhnarovich, G., Bhattad, A.: Generative models: what do they know? Do they know things? Let’s find out! arXiv preprint arXiv:​2311.​17137 (2023)
15.
go back to reference Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022) Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:​2208.​01618 (2022)
16.
go back to reference Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing (2023) Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing (2023)
17.
go back to reference Goel, V., et al.: Pair-diffusion: object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546 (2023) Goel, V., et al.: Pair-diffusion: object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:​2303.​17546 (2023)
18.
go back to reference Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27 (2014) Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27 (2014)
19.
go back to reference Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010) Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
20.
go back to reference Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022) Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:​2208.​01626 (2022)
21.
go back to reference Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020) Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020)
22.
go back to reference Hu, Y., et al.: TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering (2023) Hu, Y., et al.: TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering (2023)
25.
go back to reference Jeong, H., Ye, J.C.: Ground-a-video: zero-shot grounded video editing using text-to-image diffusion models (2024) Jeong, H., Ye, J.C.: Ground-a-video: zero-shot grounded video editing using text-to-image diffusion models (2024)
26.
go back to reference Kahatapitiya, K., Karjauv, A., Abati, D., Porikli, F., Asano, Y.M., Habibian, A.: Object-centric diffusion for efficient video editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) ECCV 2024. LNCS, vol. 15115, pp. 91–108. Springer, Cham (2014). https://doi.org/10.1007/978-3-031-72998-0_6CrossRef Kahatapitiya, K., Karjauv, A., Abati, D., Porikli, F., Asano, Y.M., Habibian, A.: Object-centric diffusion for efficient video editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) ECCV 2024. LNCS, vol. 15115, pp. 91–108. Springer, Cham (2014). https://​doi.​org/​10.​1007/​978-3-031-72998-0_​6CrossRef
27.
go back to reference Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10134 (2023) Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10134 (2023)
28.
go back to reference Kara, O., Kurtkaya, B., Yesiltepe, H., Rehg, J.M., Yanardag, P.: RAVE: randomized noise shuffling for fast and consistent video editing with diffusion models (2023) Kara, O., Kurtkaya, B., Yesiltepe, H., Rehg, J.M., Yanardag, P.: RAVE: randomized noise shuffling for fast and consistent video editing with diffusion models (2023)
29.
go back to reference Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: it is better to track together (2023) Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: it is better to track together (2023)
30.
go back to reference Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models (2022) Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models (2022)
31.
go back to reference Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019) Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
32.
go back to reference Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing (2021) Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing (2021)
33.
34.
go back to reference Mackay, W., Pagani, D.: Video mosaic: laying out time in a physical space. In: Proceedings of the Second ACM International Conference on Multimedia, pp. 165–172 (1994) Mackay, W., Pagani, D.: Video mosaic: laying out time in a physical space. In: Proceedings of the Second ACM International Conference on Multimedia, pp. 165–172 (1994)
35.
go back to reference Meiri, B., Samuel, D., Darshan, N., Chechik, G., Avidan, S., Ben-Ari, R.: Fixed-point inversion for text-to-image diffusion models (2023) Meiri, B., Samuel, D., Darshan, N., Chechik, G., Avidan, S., Ben-Ari, R.: Fixed-point inversion for text-to-image diffusion models (2023)
36.
go back to reference Menapace, W., et al.: Snap video: scaled spatiotemporal transformers for text-to-video synthesis. arXiv preprint arXiv:2402.14797 (2024) Menapace, W., et al.: Snap video: scaled spatiotemporal transformers for text-to-video synthesis. arXiv preprint arXiv:​2402.​14797 (2024)
37.
go back to reference Michel, O., Bhattad, A., VanderBilt, E., Krishna, R., Kembhavi, A., Gupta, T.: OBJECT 3DIT: language-guided 3D-aware image editing. In: Advances in Neural Information Processing Systems 36 (2024) Michel, O., Bhattad, A., VanderBilt, E., Krishna, R., Kembhavi, A., Gupta, T.: OBJECT 3DIT: language-guided 3D-aware image editing. In: Advances in Neural Information Processing Systems 36 (2024)
38.
go back to reference Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models (2022) Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models (2022)
40.
go back to reference Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing (2023) Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing (2023)
41.
go back to reference Radford, A., et al.: Learning transferable visual models from natural language supervision (2021) Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
42.
go back to reference Ren, Y., et al.: Customize-a-video: one-shot motion customization of text-to-video diffusion models (2024) Ren, Y., et al.: Customize-a-video: one-shot motion customization of text-to-video diffusion models (2024)
43.
go back to reference Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022)
44.
go back to reference Santosa, S., Chevalier, F., Balakrishnan, R., Singh, K.: Direct space-time trajectory control for visual media editing. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1149–1158 (2013) Santosa, S., Chevalier, F., Balakrishnan, R., Singh, K.: Direct space-time trajectory control for visual media editing. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1149–1158 (2013)
45.
go back to reference Sarkar, A., Mai, H., Mahapatra, A., Lazebnik, S., Forsyth, D.A., Bhattad, A.: Shadows don’t lie and lines can’t bend! Generative models don’t know projective geometry... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28140–28149 (2024) Sarkar, A., Mai, H., Mahapatra, A., Lazebnik, S., Forsyth, D.A., Bhattad, A.: Shadows don’t lie and lines can’t bend! Generative models don’t know projective geometry... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28140–28149 (2024)
46.
go back to reference Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: StyleGAN-T: unlocking the power of GANs for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515 (2023) Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: StyleGAN-T: unlocking the power of GANs for fast large-scale text-to-image synthesis. arXiv preprint arXiv:​2301.​09515 (2023)
47.
go back to reference Shin, C., Kim, H., Lee, C.H., Lee, S.G., Yoon, S.: Edit-a-video: single video editing with object-aware consistency (2023) Shin, C., Kim, H., Lee, C.H., Lee, S.G., Yoon, S.: Edit-a-video: single video editing with object-aware consistency (2023)
48.
49.
go back to reference Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models (2022) Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models (2022)
50.
go back to reference Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow (2020) Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow (2020)
51.
go back to reference Vincent, P.: A connection between score matching and denoising autoencoders. Neural Comput. 23(7), 1661–1674 (2011)MathSciNetCrossRef Vincent, P.: A connection between score matching and denoising autoencoders. Neural Comput. 23(7), 1661–1674 (2011)MathSciNetCrossRef
52.
go back to reference Wallace, B., Gokul, A., Naik, N.: EDICT: exact diffusion inversion via coupled transformations (2022) Wallace, B., Gokul, A., Naik, N.: EDICT: exact diffusion inversion via coupled transformations (2022)
53.
go back to reference Wang, F.Y., et al.: AnimateLCM: accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning (2024) Wang, F.Y., et al.: AnimateLCM: accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning (2024)
54.
go back to reference Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report (2023) Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report (2023)
55.
go back to reference Wang, X., et al.: VideoComposer: compositional video synthesis with motion controllability (2023) Wang, X., et al.: VideoComposer: compositional video synthesis with motion controllability (2023)
56.
go back to reference Xue, H., et al.: Advancing high-resolution video-language representation with large-scale video transcriptions (2022) Xue, H., et al.: Advancing high-resolution video-language representation with large-scale video transcriptions (2022)
57.
go back to reference Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation (2023) Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation (2023)
58.
go back to reference Yatim, D., Fridman, R., Bar-Tal, O., Kasten, Y., Dekel, T.: Space-time diffusion features for zero-shot text-driven motion transfer (2023) Yatim, D., Fridman, R., Bar-Tal, O., Kasten, Y., Dekel, T.: Space-time diffusion features for zero-shot text-driven motion transfer (2023)
59.
go back to reference Yenphraphai, J., Pan, X., Liu, S., Panozzo, D., Xie, S.: Image sculpting: precise object editing with 3d geometry control. arXiv preprint arXiv:2401.01702 (2024) Yenphraphai, J., Pan, X., Liu, S., Panozzo, D., Xie, S.: Image sculpting: precise object editing with 3d geometry control. arXiv preprint arXiv:​2401.​01702 (2024)
60.
go back to reference Yin, W., Yin, H., Baraka, K., Kragic, D., Björkman, M.: Dance style transfer with cross-modal transformer (2023) Yin, W., Yin, H., Baraka, K., Kragic, D., Björkman, M.: Dance style transfer with cross-modal transformer (2023)
61.
go back to reference Zhang, D.J., et al.: Show-1: marrying pixel and latent diffusion models for text-to-video generation (2023) Zhang, D.J., et al.: Show-1: marrying pixel and latent diffusion models for text-to-video generation (2023)
62.
go back to reference Zhang, G., Lewis, J.P., Kleijn, W.B.: Exact diffusion inversion via bi-directional integration approximation (2023) Zhang, G., Lewis, J.P., Kleijn, W.B.: Exact diffusion inversion via bi-directional integration approximation (2023)
63.
go back to reference Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: MagicBrush: a manually annotated dataset for instruction-guided image editing (2023) Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: MagicBrush: a manually annotated dataset for instruction-guided image editing (2023)
64.
go back to reference Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023) Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
65.
go back to reference Zuo, Z., et al.: Cut-and-paste: subject-driven video editing with attention control (2023) Zuo, Z., et al.: Cut-and-paste: subject-driven video editing with attention control (2023)
Metadata
Title
VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
Authors
Xiang Fan
Anand Bhattad
Ranjay Krishna
Copyright Year
2025
DOI
https://doi.org/10.1007/978-3-031-73254-6_14

Premium Partner