Top

Multimedia Systems

Published in:

13-05-2022 | Regular Article

Double-scale similarity with rich features for cross-modal retrieval

Authors: Kaiqiang Zhao, Hufei Wang, Dexin Zhao

Published in: Multimedia Systems | Issue 5/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This paper proposes a method named Double-scale Similarity with Rich Features for Cross-modal Retrieval (DSRF) to handle the retrieval task between images and texts. The difficulties of cross-modal retrieval are manifested in how to establish a good similarity metric and obtain rich accurate semantics features. Most existing approaches map different modalities data into a common space by category labels and pair relations, which is insufficient to model the complex semantic relationships of multimodal data. A new similarity measurement method (Double-scale similarity) is proposed, in which the similarity of multimodal data does not depend on category labels only, but also on the objects involved. The retrieval result in the same category without identical objects will be punished appropriately, while the distance between the correct result and query is further close. Moreover, a semantics features extraction framework is designed to provide rich semantics features for the similarity metric. Multiple attention maps are created to focus on local features from different perspectives and obtain numerous semantics features. Distinguish from other works that accumulate multiple semantic representations for averaging, we use LSTM only with forgetting gate to eliminate the redundancy of repetitive information. Specifically, the forgetting factor is generated for each semantics features, and a larger forgetting factor coefficient removes the useless semantics information. We evaluate DSRF on two public benchmark, DSRF achieves competitive performance.

previous article Fully automatic image segmentation based on FCN and graph cuts

next article A convolutional neural network and classical moments-based feature fusion model for gesture recognition

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, pp. 5797–5808. Association for Computational Linguistics, Stroudsburg (2019)

Staudemeyer, R. C. & Morris, E. R.: Understanding lstm: A tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586 (2019)

Wu, Y., Wang, S., Song, G., Huang, Q.: Learning fragment self-attention embeddings for image-text matching, pp. 2088–2096. ACM, New York (2019)

Li, W., et al.: Cross-modal retrieval with dual multi-angle self-attention. J. Assoc. Inf. Sci. Technol. 72(1), 46–65 (2021)CrossRef

Marin, J., et al.: Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. 43, 187–203 (2021)CrossRef

Carvalho, M., et al.: Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings, pp. 35–44. ACM, London (2018)

Thompson, B.: Canonical correlation analysis. American psychological association 6(2), (2000)

Melzer, T., Reiter, M., Bischof, H.: Kernel canonical correlation analysis. John Wiley, New Jersey (2001)MATH

Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International conference on machine learning, pp. 1247–1255 (2013)

10.

Wang, Z., et al.: CAMP: Cross-modal adaptive message passing for text-image retrieval. In: IEEE, pp. 5763–5772 (2019)

11.

Park, G., Han, C., Yoon, W., Kim, D.: MHSAN: Multi-head self-attention network for visual semantic embedding. In: IEEE, pp. 1507–1515 (2020)

12.

Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Computer Vision Foundation/IEEE, pp. 10394–10403 (2019)

13.

Khojasteh, H. A., Ansari, E., Razzaghi, P., Karimi, A.: Deep multimodal image-text embeddings for automatic cross-media retrieval. arXiv preprint arXiv:2002.10016 (2020)

14.

Fu, X., Zhao, Y., Wei, Y., Zhao, Y., Wei, S.: Rich features embedding for cross-modal retrieval: A simple baseline. IEEE Trans. Multimed. 22, 2354–2365 (2020)CrossRef

15.

Huang, P., Chang, X., Hauptmann, A.G.: Multi-head attention with diversity for learning grounded multilingual multimodal representations, pp. 1461–1467. Association for Computational Linguistics, Stroudsburg (2019)

16.

Kan, M., Shan, S., Zhang, H., Lao, S., Chen, X.: Multi-view discriminant analysis. IEEE Trans. Pattern Anal. Mach. Intell. 38, 188–194 (2016)CrossRef

17.

Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching, pp. 707–723. Springer, Berlin (2018)

18.

Li, Z., Ling, F., Zhang, C., Ma, H.: Combining global and local similarity for cross-media retrieval. IEEE Access 8, 21847–21856 (2020)CrossRef

19.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, pp. 770–778. IEEE Computer Society, Washington (2016)

20.

Deng, J., et al.: Imagenet:A large-scale hierarchical image database, pp. 248–255. IEEE Computer Society, Washington (2009)

21.

Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In: ACL, pp. 1532–1543 (2014)

22.

Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: ACL, pp. 1724–1734 (2014)

23.

Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)

24.

Van Der Westhuizen, J., Lasenby, J.: The unreasonable effectiveness of the forget gate. arXiv preprint arXiv:1804.04849 (2018)

25.

Huang, Y., Wu, Q., Song, C. , Wang, L.: Learning semantic concepts and order for image and sentence matching. In: Computer Vision Foundation/IEEE Computer Society, pp. 6163–6171 (2018)

Title: Double-scale similarity with rich features for cross-modal retrieval
Authors: Kaiqiang Zhao
Hufei Wang
Dexin Zhao
Publication date: 13-05-2022
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 5/2022
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-022-00933-7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 5/2022

Closed-loop reasoning with graph-aware dense interaction for visual dialog

HandO: a hybrid 3D hand–object reconstruction model for unknown objects

CED-Net: contextual encoder–decoder network for 3D face reconstruction

Novel design of cryptosystems for video/audio streaming via dynamic synchronized chaos-based random keys

RMVAE: one-class classification via divergence regularization and maximization mutual information

An improvement for PDF417 code authentication on mobile phone terminals based on code feature analysis and watermarking