nach oben

International Journal of Multimedia Information Retrieval

Erschienen in:

01.06.2023 | Regular Paper

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

verfasst von: Li Mingyong, Li Yewen, Ge Mingyuan, Ma Longfei

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

As multi-modal data proliferates, people are no longer content with a single mode of data retrieval for access to information. Deep hashing retrieval algorithms have attracted much attention for their advantages of efficient storage and fast query speed. Currently, the existing unsupervised hashing methods generally have two limitations: (1) Existing methods fail to adequately capture the latent semantic relevance and coexistent information from the different modality data, resulting in the lack of effective feature and hash encoding representation to bridge the heterogeneous and semantic gaps in multi-modal data. (2) Existing unsupervised methods typically construct a similarity matrix to guide the hash code learning, which suffers from inaccurate similarity problems, resulting in sub-optimal retrieval performance. To address these issues, we propose a novel CLIP-based fusion-modal reconstructing hashing for Large-scale Unsupervised Cross-modal Retrieval. First, we use CLIP to encode cross-modal features of visual modalities, and learn the common representation space of the hash code using modality-specific autoencoders. Second, we propose an efficient fusion approach to construct a semantically complementary affinity matrix that can maximize the potential semantic relevance of different modal instances. Furthermore, to retain the intrinsic semantic similarity of all similar pairs in the learned hash codes, an objective function for similarity reconstruction based on semantic complementation is designed to learn high-quality hash code representations. Sufficient experiments were carried out on four multi-modal benchmark datasets (WIKI, MIRFLICKR, NUS-WIDE, and MS COCO), and the proposed method achieves state-of-the-art image-text retrieval performance compared to several representative unsupervised cross-modal hashing methods.

Vorheriger Artikel End-to-end residual learning-based deep neural network model deployment for human activity recognition

Nächster Artikel Deep learning for video-text retrieval: a review

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Yan C, Gong B, Wei Y, Gao Y (2020) Deep multi-view enhancement hashing for image retrieval. IEEE Trans Pattern Anal Mach Intell 43(4):1445–1451CrossRef

Gong Q, Wang L, Lai H, Pan Y, Yin J (2022) Vit2hash: unsupervised information-preserving hashing. arXiv preprint arXiv:2201.05541

Dubey SR, Singh SK, Chu W-T (2021) Vision transformer hashing for image retrieval. arXiv preprint arXiv:2109.12564

Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162

Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp 159–167

Cheng Q, Zhou Y, Fu P, Xu Y, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:4284–4297CrossRef

Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: Twenty-second international joint conference on artificial intelligence

Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796

Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531

10.

Lu X, Zhu L, Cheng Z, Li J, Nie X, Zhang H (2019) Flexible online multi-modal hashing for large-scale multimedia retrieval. In: Proceedings of the 27th ACM international conference on multimedia, pp 1129–1137

11.

Fan L, Ng KW, Ju C, Zhang T, Chan CS (2020) Deep polarized network for supervised learning of accurate binary hashing codes. In: IJCAI, pp 825–831

12.

Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035

13.

Li C, Deng C, Wang L, Xie D, Liu X (2019) Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp 176–183

14.

Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimedia 24:466–479CrossRef

15.

Shen X, Zhang H, Li L, Liu L (2021) Attention-guided semantic hashing for unsupervised cross-modal retrieval. In: 2021 IEEE international conference on multimedia and expo (ICME), pp 1–6. IEEE

16.

Shen, F., Xu, Y., Liu, L., Yang, Y., Huang, Z., Shen, H.T.: Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE transactions on pattern analysis and machine intelligence 40(12), 3034–3044 (2018)CrossRef

17.

Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proceedings of the European conference on computer vision (ECCV), pp 591–606

18.

Zhang D, Wu X-J, Xu T, Yin H (2021) Dah: discrete asymmetric hashing for efficient cross-media retrieval. IEEE Trans Knowl Data Eng 35(2):1365–1378

19.

Zhang D, Wu X-J, Yu J (2021) Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications TOMM 17(3):1–18CrossRef

20.

Zhang D, Wu X-J, Yu J (2021) Discrete bidirectional matrix factorization hashing for zero-shot cross-media retrieval. In: Chinese conference on pattern recognition and computer vision (PRCV), pp 524–536. Springer

21.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

22.

Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10s):1–41CrossRef

23.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

24.

Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, p 32

25.

Yu J, Zhou H, Zhan Y, Tao D (2021) Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35. pp 4626–4634

26.

Wu G, Lin Z, Han J, Liu L, Ding G, Zhang B, Shen J (2018) Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, pp 1:5

27.

Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52

28.

Liu S, Qian S, Guan Y, Zhan J, Ying L (2020) Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 1379–1388

29.

Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimedia 24:466–479CrossRef

30.

Jiang Q-Y, Li W-J (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240

31.

Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251

32.

Zhang D, Wu X-J (2022) Robust and discrete matrix factorization hashing for cross-modal retrieval. Pattern Recogn 122:108343

33.

Zhang D, Wu X-J, Xu T, Kittler J (2022) Watch: two-stage discrete cross-media hashing. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2022.3159131CrossRef

34.

Wang, X., Zou, X., Bakker, E.M., Wu, S.: Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing 400, 255–271 (2020)CrossRef

35.

Zhang P-F, Luo Y, Huang Z, Xu X-S, Song J (2021) High-order nonlocal hashing for unsupervised cross-modal retrieval. World Wide Web 24(2):563–583CrossRef

36.

Mikriukov G, Ravanbakhsh M, Demir B (2022) Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv preprint arXiv:2201.08125

37.

Cao J, Gan Z, Cheng Y, Yu L, Chen Y-C, Liu J (2020) Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In: European conference on computer vision, pp 565–580. Springer

38.

Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR

39.

Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530

40.

Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: universal image-text representation learning. In: European conference on computer vision, pp 104–120. Springer

41.

Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32

42.

Zeng Z, Mao W (2022) A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval. arXiv preprint arXiv:2201.02772

43.

Zhuo Y, Li Y, Hsiao J, Ho C, Li B (2022) Clip4hashing: unsupervised deep hashing for cross-modal video-text retrieval. In: Proceedings of the 2022 international conference on multimedia retrieval, pp 158–166

44.

Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing 27(7), 3210–3221 (2018)MathSciNetCrossRefMATH

45.

Zhang D, Wu X-J (2020) Scalable discrete matrix factorization and semantic autoencoder for cross-media retrieval. IEEE Transactions on Cybernetics 52(7):5947–60CrossRef

46.

Liu H, Lin M, Zhang S, Wu Y, Huang F, Ji R (2018) Dense auto-encoder hashing for robust cross-modality retrieval. In: Proceedings of the 26th ACM international conference on multimedia, pp 1589–1597

47.

Bai S, Bai X, Tian Q, Latecki LJ (2017) Regularized diffusion process for visual retrieval. In: Proceedings of the AAAI conference on artificial intelligence, p 31

48.

Bai, S., Bai, X., Tian, Q., Latecki, L.J.: Regularized diffusion process on bidirectional context for object retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(5), 1213–1226 (2018)CrossRef

49.

Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260

50.

Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39–43

51.

Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9

52.

Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755. Springer

53.

Shi Y, Zhao Y, Liu X, Zheng F, Ou W, You X, Peng Q (2022) Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Transactions on Circuits and Systems for Video Technology 32(10):7255–68

Titel: CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval
verfasst von: Li Mingyong
Li Yewen
Ge Mingyuan
Ma Longfei
Publikationsdatum: 01.06.2023
Verlag: Springer London
Erschienen in: International Journal of Multimedia Information Retrieval / Ausgabe 1/2023
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-023-00268-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2023

A deep image retrieval network using Max-m-Min pooling and morphological feature generating residual blocks

LG-MLFormer: local and global MLP for image captioning

Optical music recognition for homophonic scores with neural networks and synthetic music generation

Deep learning for video-text retrieval: a review

Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval

Emotion-aware music tower blocks (EmoMTB ): an intelligent audiovisual interface for music discovery and recommendation

Premium Partner