Skip to main content
Top

2025 | OriginalPaper | Chapter

Self-supervised Video Copy Localization with Regional Token Representation

Authors : Minlong Lu, Yichen Lu, Siwei Nie, Xudong Yang, Xiaobo Zhang

Published in: Computer Vision – ECCV 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The task of video copy localization aims at finding the start and end timestamps of all copied segments within a pair of untrimmed videos. Recent approaches usually extract frame-level features and generate a frame-to-frame similarity map for the video pair. Learned detectors are used to identify distinctive patterns in the similarity map to localize the copied segments. There are two major limitations associated with these methods. First, they often rely on a single feature for each frame, which is inadequate in capturing local information for typical scenarios in video copy editing, such as picture-in-picture cases. Second, the training of the detectors requires a significant amount of human annotated data, which is highly expensive and time-consuming to acquire. In this paper, we propose a self-supervised video copy localization framework to tackle these issues. We incorporate a Regional Token into the Vision Transformer, which learns to focus on local regions within each frame using an asymmetric training procedure. A novel strategy that leverages the Transitivity Property is proposed to generate copied video pairs automatically, which facilitates the training of the detector. Extensive experiments and visualizations demonstrate the effectiveness of the proposed approach, which is able to outperform the state-of-the-art without using any human annotated data.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. In: ICLR (2022) Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. In: ICLR (2022)
2.
go back to reference Baraldi, L., Douze, M., Cucchiara, R., Jégou, H.: LAMV: learning to align and match videos with kernelized temporal layers. In: CVPR, pp. 7804–7813 (2018) Baraldi, L., Douze, M., Cucchiara, R., Jégou, H.: LAMV: learning to align and match videos with kernelized temporal layers. In: CVPR, pp. 7804–7813 (2018)
3.
go back to reference Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD, pp. 359–370 (1994) Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD, pp. 359–370 (1994)
4.
go back to reference Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020) Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)
5.
go back to reference Chou, C.L., Chen, H.T., Lee, S.Y.: Pattern-based near-duplicate video retrieval and localization on web-scale videos. TMM 17(3), 382–395 (2015) Chou, C.L., Chen, H.T., Lee, S.Y.: Pattern-based near-duplicate video retrieval and localization on web-scale videos. TMM 17(3), 382–395 (2015)
6.
go back to reference Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV, pp. 1422–1430 (2015) Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV, pp. 1422–1430 (2015)
7.
go back to reference Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929 (2020)
8.
go back to reference Douze, M., Jégou, H., Sandhawalia, H., Amsaleg, L., Schmid, C.: Evaluation of gist descriptors for web-scale image search. In: ACM International Conference on Image and Video Retrieval, pp. 1–8 (2009) Douze, M., Jégou, H., Sandhawalia, H., Amsaleg, L., Schmid, C.: Evaluation of gist descriptors for web-scale image search. In: ACM International Conference on Image and Video Retrieval, pp. 1–8 (2009)
10.
go back to reference Douze, M., Revaud, J., Verbeek, J., Jégou, H., Schmid, C.: Circulant temporal encoding for video retrieval and temporal alignment. IJCV 119, 291–306 (2016)MathSciNetCrossRef Douze, M., Revaud, J., Verbeek, J., Jégou, H., Schmid, C.: Circulant temporal encoding for video retrieval and temporal alignment. IJCV 119, 291–306 (2016)MathSciNetCrossRef
12.
go back to reference Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotemporal learners. In: NeurIPS, vol. 35, pp. 35946–35958 (2022) Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotemporal learners. In: NeurIPS, vol. 35, pp. 35946–35958 (2022)
14.
go back to reference Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: NeurIPS, vol. 33, pp. 21271–21284 (2020) Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: NeurIPS, vol. 33, pp. 21271–21284 (2020)
15.
go back to reference Han, Z., He, X., Tang, M., Lv, Y.: Video similarity and alignment learning on partial video copy detection. In: ACMMM, pp. 4165–4173 (2021) Han, Z., He, X., Tang, M., Lv, Y.: Video similarity and alignment learning on partial video copy detection. In: ACMMM, pp. 4165–4173 (2021)
16.
go back to reference He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022) He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)
17.
go back to reference He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9729–9738 (2020) He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9729–9738 (2020)
18.
go back to reference He, S., et al.: TransVCL: attention-enhanced video copy localization network with flexible supervision. In: AAAI, vol. 37, pp. 799–807 (2023) He, S., et al.: TransVCL: attention-enhanced video copy localization network with flexible supervision. In: AAAI, vol. 37, pp. 799–807 (2023)
19.
go back to reference He, S., et al.: A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. In: CVPR, pp. 21086–21095 (2022) He, S., et al.: A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection. In: CVPR, pp. 21086–21095 (2022)
20.
go back to reference He, X., Pan, Y., Tang, M., Lv, Y., Peng, Y.: Learn from unlabeled videos for near-duplicate video retrieval. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1002–1011 (2022) He, X., Pan, Y., Tang, M., Lv, Y., Peng, Y.: Learn from unlabeled videos for near-duplicate video retrieval. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1002–1011 (2022)
21.
go back to reference Jeon, S.: 2nd place solution to Facebook AI image similarity challenge matching track. arXiv e-prints, pp. arXiv–2111 (2021) Jeon, S.: 2nd place solution to Facebook AI image similarity challenge matching track. arXiv e-prints, pp. arXiv–2111 (2021)
22.
go back to reference Jiang, C., et al.: Learning segment similarity and alignment in large-scale content based video retrieval. In: ACMMM, pp. 1618–1626 (2021) Jiang, C., et al.: Learning segment similarity and alignment in large-scale content based video retrieval. In: ACMMM, pp. 1618–1626 (2021)
24.
go back to reference Kim, C.: Content-based image copy detection. Signal Process.: Image Commun. 18(3), 169–184 (2003) Kim, C.: Content-based image copy detection. Signal Process.: Image Commun. 18(3), 169–184 (2003)
25.
go back to reference Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, Y.: ViSiL: fine-grained spatio-temporal video similarity learning. In: ICCV (2020) Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, Y.: ViSiL: fine-grained spatio-temporal video similarity learning. In: ICCV (2020)
26.
go back to reference Kordopatis-Zilos, G., Tolias, G., Tzelepis, C., Kompatsiaris, I., Patras, I., Papadopoulos, S.: Self-supervised video similarity learning. In: CVPRW, pp. 4755–4765 (2023) Kordopatis-Zilos, G., Tolias, G., Tzelepis, C., Kompatsiaris, I., Patras, I., Papadopoulos, S.: Self-supervised video similarity learning. In: CVPRW, pp. 4755–4765 (2023)
27.
go back to reference Kordopatis-Zilos, G., Tzelepis, C., Papadopoulos, S., Kompatsiaris, I., Patras, I.: DnS: distill-and-select for efficient and accurate video indexing and retrieval. IJCV 130(10), 2385–2407 (2022)CrossRef Kordopatis-Zilos, G., Tzelepis, C., Papadopoulos, S., Kompatsiaris, I., Patras, I.: DnS: distill-and-select for efficient and accurate video indexing and retrieval. IJCV 130(10), 2385–2407 (2022)CrossRef
28.
go back to reference Liu, Z., Ma, F., Wang, T., Rao, F.: A similarity alignment model for video copy segment matching. arXiv preprint arXiv:2305.15679 (2023) Liu, Z., Ma, F., Wang, T., Rao, F.: A similarity alignment model for video copy segment matching. arXiv preprint arXiv:​2305.​15679 (2023)
31.
go back to reference Pizzi, E., Roy, S.D., Ravindra, S.N., Goyal, P., Douze, M.: A self-supervised descriptor for image copy detection. In: CVPR, pp. 14532–14542 (2022) Pizzi, E., Roy, S.D., Ravindra, S.N., Goyal, P., Douze, M.: A self-supervised descriptor for image copy detection. In: CVPR, pp. 14532–14542 (2022)
32.
go back to reference Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: CVPR, pp. 6964–6974 (2021) Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: CVPR, pp. 6964–6974 (2021)
33.
go back to reference Recasens, A., et al.: Broaden your views for self-supervised video learning. In: ICCV, pp. 1255–1265 (2021) Recasens, A., et al.: Broaden your views for self-supervised video learning. In: ICCV, pp. 1255–1265 (2021)
34.
go back to reference Tan, H.K., Ngo, C.W., Hong, R., Chua, T.S.: Scalable detection of partial near-duplicate videos by visual-temporal consistency. In: ACMMM, pp. 145–154 (2009) Tan, H.K., Ngo, C.W., Hong, R., Chua, T.S.: Scalable detection of partial near-duplicate videos by visual-temporal consistency. In: ACMMM, pp. 145–154 (2009)
35.
go back to reference Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015) Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:​1511.​05879 (2015)
36.
go back to reference Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: NeurIPS, vol. 35, pp. 10078–10093 (2022) Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: NeurIPS, vol. 35, pp. 10078–10093 (2022)
37.
go back to reference Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, pp. 10347–10357. PMLR (2021) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, pp. 10347–10357. PMLR (2021)
38.
go back to reference Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017) Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
39.
go back to reference Wang, W., Sun, Y., Zhang, W., Yang, Y.: D2LV: a data-driven and local-verification approach for image copy detection. arXiv preprint arXiv:2111.07090 (2021) Wang, W., Sun, Y., Zhang, W., Yang, Y.: D2LV: a data-driven and local-verification approach for image copy detection. arXiv preprint arXiv:​2111.​07090 (2021)
40.
go back to reference Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: CVPR, pp. 8052–8060 (2018) Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: CVPR, pp. 8052–8060 (2018)
41.
go back to reference Yokoo, S.: Contrastive learning with large memory bank and negative embedding subtraction for accurate copy detection. arXiv preprint arXiv:2112.04323 (2021) Yokoo, S.: Contrastive learning with large memory bank and negative embedding subtraction for accurate copy detection. arXiv preprint arXiv:​2112.​04323 (2021)
43.
go back to reference Zhou, W., Lu, Y., Li, H., Song, Y., Tian, Q.: Spatial coding for large scale partial-duplicate web image search. In: ACMMM, pp. 511–520 (2010) Zhou, W., Lu, Y., Li, H., Song, Y., Tian, Q.: Spatial coding for large scale partial-duplicate web image search. In: ACMMM, pp. 511–520 (2010)
Metadata
Title
Self-supervised Video Copy Localization with Regional Token Representation
Authors
Minlong Lu
Yichen Lu
Siwei Nie
Xudong Yang
Xiaobo Zhang
Copyright Year
2025
DOI
https://doi.org/10.1007/978-3-031-73254-6_2

Premium Partner