Top

Multimedia Systems

Published in:

27-03-2022 | Special Issue Paper

Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Authors: Jing Ge, Qianxiang Wang, Guangyu Gao

Published in: Multimedia Systems | Issue 5/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Searching persons in large-scale image databases with the query of natural language description is a more practical and important application in video surveillance. Intuitively, for person search, the core issue should be the visual–textual association, which is still an extremely challenging task, due to the contradiction between the high abstraction of textual description and the intuitive expression of visual images. In this paper, aim for more consistent visual–textual features and better inter-class discriminate ability, we propose a text-based person search approach with visual–textual attention on the hardest and semi-hard negative pairs mining. First, for the visual and textual attentions, we designed a Smoothed Global Maximum Pooling (SGMP) to extract more concentrated visual features, and also the memory attention based on LSTM’s cell unit for more strictly correspondence matching. Second, while we only have labeled positive pairs, more valuable negative pairs are mined by defining the cross-modality-based hardest and semi-hard negative pairs. After that, we combine the triplet loss on the single modality with the hardest negative pairs, and the cross-entropy loss on cross-modalities with both the hardest and semi-hard negative pairs, to train the whole network. Finally, to evaluate the effectiveness and feasibility of the proposed approach, we conduct extensive experiments on the typical person search dataset: CUHK-PEDES, in which our approach achieves satisfactory performance, e.g, the top1 accuracy of \(55.32\%\). Besides, we also evaluate the semi-hard pair mining method in the COCO caption dataset and validate its effectiveness and complementary.

previous article Robust facial expression recognition with global-local joint representation learning

next article Micro-expression recognition with attention mechanism and region enhancement

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6036–6046 (2018)

Loy, C.C., Xiang, T., Gong, S.: Multi-camera activity correlation analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1988–1995 (2009)

Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)

Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)

Chen, D., Li, H., Liu, X., Shen, Y., Shao, J., Yuan, Z., Wang, X.: Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European Conference on Computer Vision, pp. 54–70 (2018)

Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)CrossRefMATH

Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11189–11196 (2020)

Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: Proceedings of the European Conference on Computer Vision, pp. 402–420 (2020)

Aggarwal, S., Radhakrishan, V.B., Chakraborty, A.: Text-based person search via attribute-aided matching. In: Proceedings of the Winter Conference on Applications of Computer Vision, pp. 2617–2625 (2020)

10.

Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., Ma, B.: Cross-modal knowledge adaptation for language-based person search. IEEE Trans. on Image Process. 30(2021), 4057–4069 (2021)

11.

Yang, X., Zhou, P., Wang, M.: Person reidentification via structural deep metric learning. IEEE Trans. Neural Netw. Learn. Syst. 30(10), 2987–2998 (2018)CrossRef

12.

Liu, X., Yang, X., Wang, M., Hong, R.: Deep neighborhood component analysis for visual similarity modeling. ACM Trans. Intell. Syst. Technol. (TIST) 11(3), 1–15 (2020)

13.

Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (2018)

14.

Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.-S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1339–1348 (2020)

15.

Han, C., Ye, J., Zhong, Y., Tan, X., Zhang, C., Gao, C., Sang, N.: Re-id driven localization refinement for person search. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9814–9823 (2019)

16.

Wang, C., Ma, B., Chang, H., Shan, S., Chen, X.: Tcts: a task-consistent two-stage framework for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11952–11961 (2020)

17.

Dong, W., Zhang, Z., Song, C., Tan, T.: Instance guided proposal network for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2585–2594 (2020)

18.

Chen, D., Zhang, S., Ouyang, W., Yang, J., Schiele, B.: Hierarchical online instance matching for person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10518–10525 (2020)

19.

Dong, W., Zhang, Z., Song, C., Tan, T.: Bi-directional interaction network for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2839–2848 (2020)

20.

Chen, D., Zhang, S., Yang, J., Schiele, B.: Norm-aware embedding for efficient person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12615–12624 (2020)

21.

Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM International Conference on Multimedia, pp. 7–16 (2014)

22.

Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of Images and Language. In: arXiv:1511.06361 (2015)

23.

Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)

24.

Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 201–216 (2018)

25.

Fan, H., Zhu, L., Yang, Y., Wu, F.: Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(3), 1–16 (2020)

26.

Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5754–5763 (2019)

27.

Yang, X., Feng, F., Ji, W., Wang, M., Chua, T.-S.: Deconfounded video moment retrieval with causal intervention. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (2021)

28.

Yang, X., Wang, S., Dong, J., Dong, J., Wang, M., Chua, T.-S.: Video moment retrieval with cross-modal neural architecture search. IEEE Trans. Image Process. 31(2022), 1204–1216 (2022)

29.

Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Proceedings of the European Conference on Computer Vision, pp. 447–464 (2020)

30.

Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)

31.

Wu, C.-Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: Proc. IEEE International Conference on Computer Vision, pp. 2840–2848 (2017)

32.

Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of the Neural Information Processing Systems, pp. 1857–1865 (2016)

33.

Smirnov, E., Melnikov, A., Novoselov, S., Luckyanets, E., Lavrentyeva, G.: Doppelganger mining for face representation learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1916–1923 (2017)

34.

Cornegruta, S., Bakewell, R., Withey, S., Montana, G.: Modelling radiological language with bidirectional long short-term memory networks. In: EMNLP the 7th International Workshop on Health Text Mining and Information Analysis, pp. 17–27 (2016)

35.

Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.-D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(2), 1–23 (2020)

36.

Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 686–701 (2018)

37.

Liu, J., Zha, Z.-J., Hong, R., Wang, M., Zhang, Y.: Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the ACM International Conference on Multimedia, pp. 665–673 (2019)

38.

Fan, H., Yang, Y.: Person tube retrieval via language description. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10754–10761 (2020)

39.

Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5814–5824 (2019)

Title: Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention
Authors: Jing Ge
Qianxiang Wang
Guangyu Gao
Publication date: 27-03-2022
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 5/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-022-00914-w

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 5/2023

An object detection-based few-shot learning approach for multimedia quality assessment

HCPSNet: heterogeneous cross-pseudo-supervision network with confidence evaluation for semi-supervised medical image segmentation

Numerical computation based few-shot learning for intelligent sea surface temperature prediction

RB-Net: integrating region and boundary features for image manipulation localization

Image and audio caps: automated captioning of background sounds and images using deep learning

Early-stage autism diagnosis using action videos and contrastive feature learning