Skip to main content
Top

2025 | OriginalPaper | Chapter

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Authors : Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, Gang Hua

Published in: Computer Vision – ECCV 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed “VD-IT”, tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks. Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code is available at https://​github.​com/​buxiangzhiren/​VD-IT.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:​2311.​15127 (2023)
2.
go back to reference Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR, pp. 22563–22575 (2023) Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR, pp. 22563–22575 (2023)
3.
go back to reference Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4985–4995 (2022) Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4985–4995 (2022)
4.
go back to reference Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19830–19843 (2023) Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19830–19843 (2023)
5.
go back to reference Chen, W., et al.: Multi-attention network for compressed video referring object segmentation. In: ACM MM, pp. 4416–4425 (2022) Chen, W., et al.: Multi-attention network for compressed video referring object segmentation. In: ACM MM, pp. 4416–4425 (2022)
6.
7.
go back to reference Ding, H., Liu, C., Wang, S., Jiang, X.: VLT: vision-language transformer and query generation for referring segmentation. IEEE TPAMI (2022) Ding, H., Liu, C., Wang, S., Jiang, X.: VLT: vision-language transformer and query generation for referring segmentation. IEEE TPAMI (2022)
8.
go back to reference Ding, Z., Hui, T., Huang, J., Wei, X., Han, J., Liu, S.: Language-bridged spatial-temporal interaction for referring video object segmentation. In: CVPR, pp. 4964–4973 (2022) Ding, Z., Hui, T., Huang, J., Wei, X., Han, J., Liu, S.: Language-bridged spatial-temporal interaction for referring video object segmentation. In: CVPR, pp. 4964–4973 (2022)
9.
go back to reference Ding, Z., et al.: Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge 8 (2021) Ding, Z., et al.: Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge 8 (2021)
10.
go back to reference Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR, pp. 12873–12883 (2021) Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR, pp. 12873–12883 (2021)
11.
go back to reference Fan, W.C., Chen, Y.C., Chen, D., Cheng, Y., Yuan, L., Wang, Y.C.F.: Frido: feature pyramid diffusion for complex scene image synthesis. In: Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI) (2023) Fan, W.C., Chen, Y.C., Chen, D., Cheng, Y., Yuan, L., Wang, Y.C.F.: Frido: feature pyramid diffusion for complex scene image synthesis. In: Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI) (2023)
12.
go back to reference Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR, pp. 5958–5966 (2018) Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR, pp. 5958–5966 (2018)
13.
go back to reference Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2022) (2022) Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2022) (2022)
16.
go back to reference Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics 52 (1955) Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics 52 (1955)
17.
go back to reference Li, D., et al.: You only infer once: Cross-modal meta-transfer for referring video object segmentation. In: AAAI, pp. 1297–1305 (2022) Li, D., et al.: You only infer once: Cross-modal meta-transfer for referring video object segmentation. In: AAAI, pp. 1297–1305 (2022)
18.
go back to reference Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. NeurIPS 34, 19652–19664 (2021) Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. NeurIPS 34, 19652–19664 (2021)
19.
20.
go back to reference Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7667–7676 (2023) Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7667–7676 (2023)
21.
go back to reference Li, Z., Wang, M., Mei, J., Liu, Y.: MaIL: a unified mask-image-language trimodal network for referring image segmentation. arXiv preprint arXiv:2111.10747 (2021) Li, Z., Wang, M., Mei, J., Liu, Y.: MaIL: a unified mask-image-language trimodal network for referring image segmentation. arXiv preprint arXiv:​2111.​10747 (2021)
22.
go back to reference Liang, C., Wu, Y., Luo, Y., Yang, Y.: Clawcranenet: leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:2103.10702 (2021) Liang, C., Wu, Y., Luo, Y., Yang, Y.: Clawcranenet: leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:​2103.​10702 (2021)
23.
go back to reference Liang, C., et al.: Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061 (2021) Liang, C., et al.: Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:​2106.​01061 (2021)
24.
go back to reference Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017) Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
25.
go back to reference Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., Li, G.: Cross-modal progressive comprehension for referring segmentation. IEEE TPAMI 44(9), 4761–4775 (2021) Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., Li, G.: Cross-modal progressive comprehension for referring segmentation. IEEE TPAMI 44(9), 4761–4775 (2021)
27.
go back to reference Liu, Z., et al.: Video swin transformer. In: CVPR, pp. 3202–3211 (2022) Liu, Z., et al.: Video swin transformer. In: CVPR, pp. 3202–3211 (2022)
28.
go back to reference Mei, J., Piergiovanni, A., Hwang, J.N., Li, W.: SLVP: self-supervised language-video pre-training for referring video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 507–517 (2024) Mei, J., Piergiovanni, A., Hwang, J.N., Li, W.: SLVP: self-supervised language-video pre-training for referring video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 507–517 (2024)
29.
go back to reference Miao, B., Bennamoun, M., Gao, Y., Mian, A.: Spectrum-guided multi-granularity referring video object segmentation. In: ICCV, pp. 920–930 (2023) Miao, B., Bennamoun, M., Gao, Y., Mian, A.: Spectrum-guided multi-granularity referring video object segmentation. In: ICCV, pp. 920–930 (2023)
31.
go back to reference Park, H., Yoo, J., Jeong, S., Venkatesh, G., Kwak, N.: Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In: CVPR, pp. 8405–8414 (2021) Park, H., Yoo, J., Jeong, S., Venkatesh, G., Kwak, N.: Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In: CVPR, pp. 8405–8414 (2021)
32.
go back to reference Pnvr, K., Singh, B., Ghosh, P., Siddiquie, B., Jacobs, D.: LD-ZNet: a latent diffusion approach for text-based image segmentation. In: ICCV, pp. 4157–4168 (2023) Pnvr, K., Singh, B., Ghosh, P., Siddiquie, B., Jacobs, D.: LD-ZNet: a latent diffusion approach for text-based image segmentation. In: ICCV, pp. 4157–4168 (2023)
33.
go back to reference Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021) Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
34.
go back to reference Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022) Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:​2204.​061251(2), 3 (2022)
35.
go back to reference Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019) Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
36.
go back to reference Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
37.
go back to reference Saharia, C., et al.: Palette: image-to-image diffusion models. In: ACM SIGGRAPH, pp. 1–10 (2022) Saharia, C., et al.: Palette: image-to-image diffusion models. In: ACM SIGGRAPH, pp. 1–10 (2022)
39.
go back to reference Tur, A.O., Dall’Asen, N., Beyan, C., Ricci, E.: Exploring diffusion models for unsupervised video anomaly detection. In: ICIP, pp. 2540–2544. IEEE (2023) Tur, A.O., Dall’Asen, N., Beyan, C., Ricci, E.: Exploring diffusion models for unsupervised video anomaly detection. In: ICIP, pp. 2540–2544. IEEE (2023)
40.
go back to reference Vaswani, A., et al.: Attention is all you need. NeurIPS 30 (2017) Vaswani, A., et al.: Attention is all you need. NeurIPS 30 (2017)
41.
go back to reference Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: ICCV, pp. 3939–3948 (2019) Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: ICCV, pp. 3939–3948 (2019)
42.
go back to reference Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 (2023) Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report. arXiv preprint arXiv:​2308.​06571 (2023)
44.
go back to reference Wang, Z., et al.: CRIS: clip-driven referring image segmentation. In: CVPR, pp. 11686–11695 (2022) Wang, Z., et al.: CRIS: clip-driven referring image segmentation. In: CVPR, pp. 11686–11695 (2022)
45.
go back to reference Wu, D., Dong, X., Shao, L., Shen, J.: Multi-level representation learning with semantic alignment for referring video object segmentation. In: CVPR, pp. 4996–5005 (2022) Wu, D., Dong, X., Shao, L., Shen, J.: Multi-level representation learning with semantic alignment for referring video object segmentation. In: CVPR, pp. 4996–5005 (2022)
46.
go back to reference Wu, D., Wang, T., Zhang, Y., Zhang, X., Shen, J.: Onlinerefer: a simple online baseline for referring video object segmentation. In: ICCV, pp. 2761–2770 (2023) Wu, D., Wang, T., Zhang, Y., Zhang, X., Shen, J.: Onlinerefer: a simple online baseline for referring video object segmentation. In: ICCV, pp. 2761–2770 (2023)
47.
go back to reference Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: CVPR, pp. 4974–4984 (2022) Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: CVPR, pp. 4974–4984 (2022)
48.
go back to reference Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: Diffumask: synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681 (2023) Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: Diffumask: synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:​2303.​11681 (2023)
49.
go back to reference Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR, pp. 2955–2966 (2023) Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR, pp. 2955–2966 (2023)
50.
go back to reference Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR, pp. 18155–18165 (2022) Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR, pp. 18155–18165 (2022)
51.
go back to reference Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR, pp. 10502–10511 (2019) Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR, pp. 10502–10511 (2019)
53.
go back to reference Zhang, J., et al.: A tale of two features: stable diffusion complements dino for zero-shot semantic correspondence. NeurIPS 36 (2024) Zhang, J., et al.: A tale of two features: stable diffusion complements dino for zero-shot semantic correspondence. NeurIPS 36 (2024)
54.
55.
go back to reference Zhao, S., et al.: Uni-controlnet: all-in-one control to text-to-image diffusion models. In: Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS 2023) (2023) Zhao, S., et al.: Uni-controlnet: all-in-one control to text-to-image diffusion models. In: Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS 2023) (2023)
56.
go back to reference Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022) Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: efficient video generation with latent diffusion models. arXiv preprint arXiv:​2211.​11018 (2022)
57.
go back to reference Zhu, W., Li, J., Lu, J., Zhou, J.: Separable structure modeling for semi-supervised video object segmentation. IEEE TCSVT 32(1), 330–344 (2021) Zhu, W., Li, J., Lu, J., Zhou, J.: Separable structure modeling for semi-supervised video object segmentation. IEEE TCSVT 32(1), 330–344 (2021)
58.
go back to reference Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020) Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:​2010.​04159 (2020)
Metadata
Title
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
Authors
Zixin Zhu
Xuelu Feng
Dongdong Chen
Junsong Yuan
Chunming Qiao
Gang Hua
Copyright Year
2025
DOI
https://doi.org/10.1007/978-3-031-73254-6_26

Premium Partner