Skip to main content
Top
Published in: Multimedia Systems 5/2023

27-03-2022 | Special Issue Paper

Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Authors: Jing Ge, Qianxiang Wang, Guangyu Gao

Published in: Multimedia Systems | Issue 5/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Searching persons in large-scale image databases with the query of natural language description is a more practical and important application in video surveillance. Intuitively, for person search, the core issue should be the visual–textual association, which is still an extremely challenging task, due to the contradiction between the high abstraction of textual description and the intuitive expression of visual images. In this paper, aim for more consistent visual–textual features and better inter-class discriminate ability, we propose a text-based person search approach with visual–textual attention on the hardest and semi-hard negative pairs mining. First, for the visual and textual attentions, we designed a Smoothed Global Maximum Pooling (SGMP) to extract more concentrated visual features, and also the memory attention based on LSTM’s cell unit for more strictly correspondence matching. Second, while we only have labeled positive pairs, more valuable negative pairs are mined by defining the cross-modality-based hardest and semi-hard negative pairs. After that, we combine the triplet loss on the single modality with the hardest negative pairs, and the cross-entropy loss on cross-modalities with both the hardest and semi-hard negative pairs, to train the whole network. Finally, to evaluate the effectiveness and feasibility of the proposed approach, we conduct extensive experiments on the typical person search dataset: CUHK-PEDES, in which our approach achieves satisfactory performance, e.g, the top1 accuracy of \(55.32\%\). Besides, we also evaluate the semi-hard pair mining method in the COCO caption dataset and validate its effectiveness and complementary.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6036–6046 (2018) Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6036–6046 (2018)
2.
go back to reference Loy, C.C., Xiang, T., Gong, S.: Multi-camera activity correlation analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1988–1995 (2009) Loy, C.C., Xiang, T., Gong, S.: Multi-camera activity correlation analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1988–1995 (2009)
3.
go back to reference Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017) Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)
4.
go back to reference Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899 (2017) Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)
5.
go back to reference Chen, D., Li, H., Liu, X., Shen, Y., Shao, J., Yuan, Z., Wang, X.: Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European Conference on Computer Vision, pp. 54–70 (2018) Chen, D., Li, H., Liu, X., Shen, Y., Shao, J., Yuan, Z., Wang, X.: Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European Conference on Computer Vision, pp. 54–70 (2018)
6.
go back to reference Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)CrossRefMATH Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)CrossRefMATH
7.
go back to reference Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11189–11196 (2020) Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11189–11196 (2020)
8.
go back to reference Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: Proceedings of the European Conference on Computer Vision, pp. 402–420 (2020) Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: Proceedings of the European Conference on Computer Vision, pp. 402–420 (2020)
9.
go back to reference Aggarwal, S., Radhakrishan, V.B., Chakraborty, A.: Text-based person search via attribute-aided matching. In: Proceedings of the Winter Conference on Applications of Computer Vision, pp. 2617–2625 (2020) Aggarwal, S., Radhakrishan, V.B., Chakraborty, A.: Text-based person search via attribute-aided matching. In: Proceedings of the Winter Conference on Applications of Computer Vision, pp. 2617–2625 (2020)
10.
go back to reference Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., Ma, B.: Cross-modal knowledge adaptation for language-based person search. IEEE Trans. on Image Process. 30(2021), 4057–4069 (2021) Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., Ma, B.: Cross-modal knowledge adaptation for language-based person search. IEEE Trans. on Image Process. 30(2021), 4057–4069 (2021)
11.
go back to reference Yang, X., Zhou, P., Wang, M.: Person reidentification via structural deep metric learning. IEEE Trans. Neural Netw. Learn. Syst. 30(10), 2987–2998 (2018)CrossRef Yang, X., Zhou, P., Wang, M.: Person reidentification via structural deep metric learning. IEEE Trans. Neural Netw. Learn. Syst. 30(10), 2987–2998 (2018)CrossRef
12.
go back to reference Liu, X., Yang, X., Wang, M., Hong, R.: Deep neighborhood component analysis for visual similarity modeling. ACM Trans. Intell. Syst. Technol. (TIST) 11(3), 1–15 (2020) Liu, X., Yang, X., Wang, M., Hong, R.: Deep neighborhood component analysis for visual similarity modeling. ACM Trans. Intell. Syst. Technol. (TIST) 11(3), 1–15 (2020)
13.
go back to reference Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (2018) Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (2018)
14.
go back to reference Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.-S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1339–1348 (2020) Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.-S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1339–1348 (2020)
15.
go back to reference Han, C., Ye, J., Zhong, Y., Tan, X., Zhang, C., Gao, C., Sang, N.: Re-id driven localization refinement for person search. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9814–9823 (2019) Han, C., Ye, J., Zhong, Y., Tan, X., Zhang, C., Gao, C., Sang, N.: Re-id driven localization refinement for person search. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9814–9823 (2019)
16.
go back to reference Wang, C., Ma, B., Chang, H., Shan, S., Chen, X.: Tcts: a task-consistent two-stage framework for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11952–11961 (2020) Wang, C., Ma, B., Chang, H., Shan, S., Chen, X.: Tcts: a task-consistent two-stage framework for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11952–11961 (2020)
17.
go back to reference Dong, W., Zhang, Z., Song, C., Tan, T.: Instance guided proposal network for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2585–2594 (2020) Dong, W., Zhang, Z., Song, C., Tan, T.: Instance guided proposal network for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2585–2594 (2020)
18.
go back to reference Chen, D., Zhang, S., Ouyang, W., Yang, J., Schiele, B.: Hierarchical online instance matching for person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10518–10525 (2020) Chen, D., Zhang, S., Ouyang, W., Yang, J., Schiele, B.: Hierarchical online instance matching for person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10518–10525 (2020)
19.
go back to reference Dong, W., Zhang, Z., Song, C., Tan, T.: Bi-directional interaction network for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2839–2848 (2020) Dong, W., Zhang, Z., Song, C., Tan, T.: Bi-directional interaction network for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2839–2848 (2020)
20.
go back to reference Chen, D., Zhang, S., Yang, J., Schiele, B.: Norm-aware embedding for efficient person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12615–12624 (2020) Chen, D., Zhang, S., Yang, J., Schiele, B.: Norm-aware embedding for efficient person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12615–12624 (2020)
21.
go back to reference Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM International Conference on Multimedia, pp. 7–16 (2014) Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM International Conference on Multimedia, pp. 7–16 (2014)
23.
go back to reference Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016) Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
24.
go back to reference Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 201–216 (2018) Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 201–216 (2018)
25.
go back to reference Fan, H., Zhu, L., Yang, Y., Wu, F.: Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(3), 1–16 (2020) Fan, H., Zhu, L., Yang, Y., Wu, F.: Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(3), 1–16 (2020)
26.
go back to reference Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5754–5763 (2019) Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5754–5763 (2019)
27.
go back to reference Yang, X., Feng, F., Ji, W., Wang, M., Chua, T.-S.: Deconfounded video moment retrieval with causal intervention. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (2021) Yang, X., Feng, F., Ji, W., Wang, M., Chua, T.-S.: Deconfounded video moment retrieval with causal intervention. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (2021)
28.
go back to reference Yang, X., Wang, S., Dong, J., Dong, J., Wang, M., Chua, T.-S.: Video moment retrieval with cross-modal neural architecture search. IEEE Trans. Image Process. 31(2022), 1204–1216 (2022) Yang, X., Wang, S., Dong, J., Dong, J., Wang, M., Chua, T.-S.: Video moment retrieval with cross-modal neural architecture search. IEEE Trans. Image Process. 31(2022), 1204–1216 (2022)
29.
go back to reference Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Proceedings of the European Conference on Computer Vision, pp. 447–464 (2020) Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Proceedings of the European Conference on Computer Vision, pp. 447–464 (2020)
30.
go back to reference Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015) Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
31.
go back to reference Wu, C.-Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: Proc. IEEE International Conference on Computer Vision, pp. 2840–2848 (2017) Wu, C.-Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: Proc. IEEE International Conference on Computer Vision, pp. 2840–2848 (2017)
32.
go back to reference Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of the Neural Information Processing Systems, pp. 1857–1865 (2016) Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of the Neural Information Processing Systems, pp. 1857–1865 (2016)
33.
go back to reference Smirnov, E., Melnikov, A., Novoselov, S., Luckyanets, E., Lavrentyeva, G.: Doppelganger mining for face representation learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1916–1923 (2017) Smirnov, E., Melnikov, A., Novoselov, S., Luckyanets, E., Lavrentyeva, G.: Doppelganger mining for face representation learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1916–1923 (2017)
34.
go back to reference Cornegruta, S., Bakewell, R., Withey, S., Montana, G.: Modelling radiological language with bidirectional long short-term memory networks. In: EMNLP the 7th International Workshop on Health Text Mining and Information Analysis, pp. 17–27 (2016) Cornegruta, S., Bakewell, R., Withey, S., Montana, G.: Modelling radiological language with bidirectional long short-term memory networks. In: EMNLP the 7th International Workshop on Health Text Mining and Information Analysis, pp. 17–27 (2016)
35.
go back to reference Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.-D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(2), 1–23 (2020) Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.-D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(2), 1–23 (2020)
36.
go back to reference Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 686–701 (2018) Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 686–701 (2018)
37.
go back to reference Liu, J., Zha, Z.-J., Hong, R., Wang, M., Zhang, Y.: Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the ACM International Conference on Multimedia, pp. 665–673 (2019) Liu, J., Zha, Z.-J., Hong, R., Wang, M., Zhang, Y.: Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the ACM International Conference on Multimedia, pp. 665–673 (2019)
38.
go back to reference Fan, H., Yang, Y.: Person tube retrieval via language description. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10754–10761 (2020) Fan, H., Yang, Y.: Person tube retrieval via language description. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10754–10761 (2020)
39.
go back to reference Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5814–5824 (2019) Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5814–5824 (2019)
Metadata
Title
Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention
Authors
Jing Ge
Qianxiang Wang
Guangyu Gao
Publication date
27-03-2022
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 5/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-022-00914-w

Other articles of this Issue 5/2023

Multimedia Systems 5/2023 Go to the issue