Skip to main content
Top
Published in: International Journal of Computer Vision 6/2021

19-04-2021

Polysemy Deciphering Network for Robust Human–Object Interaction Detection

Authors: Xubin Zhong, Changxing Ding, Xian Qu, Dacheng Tao

Published in: International Journal of Computer Vision | Issue 6/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Human–Object Interaction (HOI) detection is important to human-centric scene understanding tasks. Existing works tend to assume that the same verb has similar visual characteristics in different HOI categories, an approach that ignores the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection in three distinct ways. First, we refine features for HOI detection to be polysemy-aware through the use of two novel modules: namely, Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important elements in human and object appearance features for each HOI category to be identified; moreover, LPFA augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce intra-class variation for the same verb. Second, we introduce a novel Polysemy-Aware Modal Fusion module, which guides PD-Net to make decisions based on feature types deemed more important according to the language priors. Third, we propose to relieve the verb polysemy problem through sharing verb classifiers for semantically similar HOI categories. Furthermore, to expedite research on the verb polysemy problem, we build a new benchmark dataset named HOI-VerbPolysemy (HOI-VP), which includes common verbs (predicates) that have diverse semantic meanings in the real world. Finally, through deciphering the visual polysemy of verbs, our approach is demonstrated to outperform state-of-the-art methods by significant margins on the HICO-DET, V-COCO, and HOI-VP databases. Code and data in this paper are available at https://​github.​com/​MuchHair/​PD-Net.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Bansal, A., Rambhatla, S., Shrivastava, A., & Chellappa, R. (2020). Detecting human–object interactions via functional generalization. In AAAI (pp. 10460–10469). Bansal, A., Rambhatla, S., Shrivastava, A., & Chellappa, R. (2020). Detecting human–object interactions via functional generalization. In AAAI (pp. 10460–10469).
go back to reference Chao, Y., Liu, Y., Liu, X., Zeng, H., & Deng, J. (2018). Learning to detect human–object interactions. In WACV (pp. 381–389). Chao, Y., Liu, Y., Liu, X., Zeng, H., & Deng, J. (2018). Learning to detect human–object interactions. In WACV (pp. 381–389).
go back to reference Chao, Y., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). Hico: A benchmark for recognizing human–object interactions in images. In ICCV (pp. 1017–1025). Chao, Y., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). Hico: A benchmark for recognizing human–object interactions in images. In ICCV (pp. 1017–1025).
go back to reference Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR (pp. 5659–5667). Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR (pp. 5659–5667).
go back to reference Damen, D., Doughty, H., Maria Farinella, G., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., & Price, W. (2018). Scaling egocentric vision: The epic-kitchens dataset. In ECCV (pp. 720–736). Damen, D., Doughty, H., Maria Farinella, G., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., & Price, W. (2018). Scaling egocentric vision: The epic-kitchens dataset. In ECCV (pp. 720–736).
go back to reference Ding, C., Wang, K., Wang, P., & Tao, D. (2020). Multi-task learning with coarse priors for robust part-aware person re-identification. TPAMI. Ding, C., Wang, K., Wang, P., & Tao, D. (2020). Multi-task learning with coarse priors for robust part-aware person re-identification. TPAMI.
go back to reference Fang, H., Xie, S., Tai, Y., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In ICCV (pp. 382–391). Fang, H., Xie, S., Tai, Y., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In ICCV (pp. 382–391).
go back to reference Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S., Wang, X., & Li, H. (2019). Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR (pp. 6639–6648). Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S., Wang, X., & Li, H. (2019). Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR (pp. 6639–6648).
go back to reference Gao, H., Zou, Y., & Huang, J. (2018). iCAN: Instance-Centric Attention Network for Human–Object Interaction Detection. In BMVC (p. 41). Gao, H., Zou, Y., & Huang, J. (2018). iCAN: Instance-Centric Attention Network for Human–Object Interaction Detection. In BMVC (p. 41).
go back to reference Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In NeurIPS (pp. 34–45). Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In NeurIPS (pp. 34–45).
go back to reference Gkioxari, G., Girshick, R., Dollár, P., & He, K. (2018). Detecting and recognizing human–object interactions. In CVPR (pp. 8359–8367). Gkioxari, G., Girshick, R., Dollár, P., & He, K. (2018). Detecting and recognizing human–object interactions. In CVPR (pp. 8359–8367).
go back to reference Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., & Ling, M. (2019). Scene graph generation with external knowledge and image reconstruction. In CVPR (pp. 1969–1978). Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., & Ling, M. (2019). Scene graph generation with external knowledge and image reconstruction. In CVPR (pp. 1969–1978).
go back to reference Gupta, T., Schwing, A., & Hoiem, D. (2019). No-frills human–object interaction detection: Factorization, layout encodings, and training techniques. In ICCV (pp. 9677–9685). Gupta, T., Schwing, A., & Hoiem, D. (2019). No-frills human–object interaction detection: Factorization, layout encodings, and training techniques. In ICCV (pp. 9677–9685).
go back to reference He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV (pp. 2961–2969). He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV (pp. 2961–2969).
go back to reference He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
go back to reference Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR (pp. 7132–7141). Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR (pp. 7132–7141).
go back to reference Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., & Guadarrama, S., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR (pp. 7310–7311). Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., & Guadarrama, S., et al. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR (pp. 7310–7311).
go back to reference Huang, E., Socher, R., Manning, C., & Ng, A. (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 873–882). Huang, E., Socher, R., Manning, C., & Ng, A. (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 873–882).
go back to reference Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In NeurIPS (pp. 2017–2025). Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In NeurIPS (pp. 2017–2025).
go back to reference Ji, J., Krishna, R., Fei-Fei, L., & Niebles, J.C. (2020). Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR (pp. 10236–10247). Ji, J., Krishna, R., Fei-Fei, L., & Niebles, J.C. (2020). Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR (pp. 10236–10247).
go back to reference Kato, K., Li, Y., & Gupta, A. (2018). Compositional learning for human object interaction. In ECCV (pp. 234–251). Kato, K., Li, Y., & Gupta, A. (2018). Compositional learning for human object interaction. In ECCV (pp. 234–251).
go back to reference Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 32–73.MathSciNetCrossRef Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 32–73.MathSciNetCrossRef
go back to reference Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., & Ferrari, V. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 128(7), 1956–1981.CrossRef Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., & Ferrari, V. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 128(7), 1956–1981.CrossRef
go back to reference Li, B., Liang, J., & Wang, Y. (2019a). Compression artifact removal with stacked multi-context channel-wise attention network. In ICIP (pp. 3601–3605). Li, B., Liang, J., & Wang, Y. (2019a). Compression artifact removal with stacked multi-context channel-wise attention network. In ICIP (pp. 3601–3605).
go back to reference Li, Y., Liu, X., Lu, H., Wang, S., Liu, J., Li, J., & Lu, C. (2020a). Detailed 2d-3d joint representation for human–object interaction. In CVPR (pp. 10166–10175). Li, Y., Liu, X., Lu, H., Wang, S., Liu, J., Li, J., & Lu, C. (2020a). Detailed 2d-3d joint representation for human–object interaction. In CVPR (pp. 10166–10175).
go back to reference Li, Y., Xu, L., Liu, X., Huang, X., Xu, Y., Wang, S., Fang, HS., Ma, Z., Chen, M., & Lu, C. (2020b). Pastanet: Toward human activity knowledge engine. In CVPR (pp. 382–391). Li, Y., Xu, L., Liu, X., Huang, X., Xu, Y., Wang, S., Fang, HS., Ma, Z., Chen, M., & Lu, C. (2020b). Pastanet: Toward human activity knowledge engine. In CVPR (pp. 382–391).
go back to reference Li, Y., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, HS., Wang, Y., & Lu, C. (2019b). Transferable interactiveness knowledge for human–object interaction detection. In CVPR (pp. 3585–3594). Li, Y., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, HS., Wang, Y., & Lu, C. (2019b). Transferable interactiveness knowledge for human–object interaction detection. In CVPR (pp. 3585–3594).
go back to reference Li, W., Zhu, X., & Gong, S. (2018). Harmonious attention network for person re-identification. In CVPR (pp. 2285–2294). Li, W., Zhu, X., & Gong, S. (2018). Harmonious attention network for person re-identification. In CVPR (pp. 2285–2294).
go back to reference Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., & Feng, J. (2020). Ppdm: Parallel point detection and matching for real-time human–object interaction detection. In CVPR (pp. 482–490). Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., & Feng, J. (2020). Ppdm: Parallel point detection and matching for real-time human–object interaction detection. In CVPR (pp. 482–490).
go back to reference Lin, X., Ding, C., Zeng, J., & Tao, D. (2020). Gps-net: Graph property sensing network for scene graph generation. In CVPR (pp. 3746–3753). Lin, X., Ding, C., Zeng, J., & Tao, D. (2020). Gps-net: Graph property sensing network for scene graph generation. In CVPR (pp. 3746–3753).
go back to reference Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR (pp. 2117–2125). Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR (pp. 2117–2125).
go back to reference Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755).
go back to reference Liu, N., Tan, Q., Li, Y., Yang, H., Zhou, J., & Hu, X. (2019). Is a single vector enough? exploring node polysemy for network embedding. In ACM SIGKDD (pp. 932–940). Liu, N., Tan, Q., Li, Y., Yang, H., Zhou, J., & Hu, X. (2019). Is a single vector enough? exploring node polysemy for network embedding. In ACM SIGKDD (pp. 932–940).
go back to reference Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS (pp. 13–23). Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS (pp. 13–23).
go back to reference Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In ECCV (pp. 852–869). Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In ECCV (pp. 852–869).
go back to reference Ma, R., Jin, L., Liu, Q., Chen, L., & Yu, K. (2020). Addressing the polysemy problem in language modeling with attentional multi-sense embeddings. In ICASSP (pp. 8129–8133). Ma, R., Jin, L., Liu, Q., Chen, L., & Yu, K. (2020). Addressing the polysemy problem in language modeling with attentional multi-sense embeddings. In ICASSP (pp. 8129–8133).
go back to reference MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In the Processing of the fifth Berkeley symposium on mathematical statistics and probability (pp. 281–297). MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In the Processing of the fifth Berkeley symposium on mathematical statistics and probability (pp. 281–297).
go back to reference Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR (pp. 3195–3204). Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR (pp. 3195–3204).
go back to reference Meng, L., Zhao, B., Chang, B., Huang, G., Sun, W., Tung, F., & Sigal, L. (2019). Interpretable spatio-temporal attention for video action recognition. In ICCV workshops (pp. 1513–1522). Meng, L., Zhao, B., Chang, B., Huang, G., Sun, W., Tung, F., & Sigal, L. (2019). Interpretable spatio-temporal attention for video action recognition. In ICCV workshops (pp. 1513–1522).
go back to reference Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NeurIPS (pp. 3111–3119). Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NeurIPS (pp. 3111–3119).
go back to reference Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV (pp. 483–499). Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV (pp. 483–499).
go back to reference Oomoto, K., Oikawa, H., Yamamoto, E., Yoshida, M., Okabe, M., & Umemura, K. (2017). Polysemy detection in distributed representation of word sense. In KST (pp. 28–33). Oomoto, K., Oikawa, H., Yamamoto, E., Yoshida, M., Okabe, M., & Umemura, K. (2017). Polysemy detection in distributed representation of word sense. In KST (pp. 28–33).
go back to reference Pereira, S., Pinto, A., Amorim, J., Ribeiro, A., Alves, V., & Silva, C. A. (2019). Adaptive feature recombination and recalibration for semantic segmentation with fully convolutional networks. IEEE Transactions on Medical Imaging, 38(12), 2914–2925.CrossRef Pereira, S., Pinto, A., Amorim, J., Ribeiro, A., Alves, V., & Silva, C. A. (2019). Adaptive feature recombination and recalibration for semantic segmentation with fully convolutional networks. IEEE Transactions on Medical Imaging, 38(12), 2914–2925.CrossRef
go back to reference Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. (2018). FiLM: Visual reasoning with a general conditioning layer. In AAAI. Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. (2018). FiLM: Visual reasoning with a general conditioning layer. In AAAI.
go back to reference Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2019). Detecting unseen visual relations using analogies. In ICCV (pp. 1981–1990). Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2019). Detecting unseen visual relations using analogies. In ICCV (pp. 1981–1990).
go back to reference Qi, S., Wang, W., Jia, B., Shen, J., & Zhu, S.C. (2018). Learning human–object interactions by graph parsing neural networks. In ECCV (pp. 401–417). Qi, S., Wang, W., Jia, B., Shen, J., & Zhu, S.C. (2018). Learning human–object interactions by graph parsing neural networks. In ECCV (pp. 401–417).
go back to reference Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS (pp. 91–99). Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS (pp. 91–99).
go back to reference Shen, L., Yeung, S., Hoffman, J., Mori, G., & Li, F. (2018). Scaling human–object interaction recognition through zero-shot learning. In WACV (pp. 1568–1576). Shen, L., Yeung, S., Hoffman, J., Mori, G., & Li, F. (2018). Scaling human–object interaction recognition through zero-shot learning. In WACV (pp. 1568–1576).
go back to reference Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NeurIPS (pp. 568–576). Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NeurIPS (pp. 568–576).
go back to reference Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV (pp. 4489–4497). Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In ICCV (pp. 4489–4497).
go back to reference Ulutan, O., Iftekhar, A., & Manjunath, B. (2020). Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR (pp. 13617–13626). Ulutan, O., Iftekhar, A., & Manjunath, B. (2020). Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR (pp. 13617–13626).
go back to reference Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A., (2017). Attention is all you need. In NeurIPS (pp. 5998–6008). Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A., (2017). Attention is all you need. In NeurIPS (pp. 5998–6008).
go back to reference Wan, B., Zhou, D., Liu, Y., Li, R., & He, X. (2019). Pose-aware multi-level feature network for human object interaction detection. In ICCV (pp. 9469–9478). Wan, B., Zhou, D., Liu, Y., Li, R., & He, X. (2019). Pose-aware multi-level feature network for human object interaction detection. In ICCV (pp. 9469–9478).
go back to reference Wang, T., Anwer, RM., Khan, MH., Khan, FS., Pang, Y., Shao, L., & Laaksonen, J. (2019a). Deep contextual attention for human–object interaction detection. In ICCV (pp. 5694–5702). Wang, T., Anwer, RM., Khan, MH., Khan, FS., Pang, Y., Shao, L., & Laaksonen, J. (2019a). Deep contextual attention for human–object interaction detection. In ICCV (pp. 5694–5702).
go back to reference Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., & Tang, X. (2017). Residual attention network for image classification. In CVPR (pp. 3156–3164). Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., & Tang, X. (2017). Residual attention network for image classification. In CVPR (pp. 3156–3164).
go back to reference Wang, W., Wang, R., Shan, S., & Chen, X. (2019b). Exploring context and visual pattern of relationship for scene graph generation. In CVPR (pp. 8188–8197). Wang, W., Wang, R., Shan, S., & Chen, X. (2019b). Exploring context and visual pattern of relationship for scene graph generation. In CVPR (pp. 8188–8197).
go back to reference Wang, T., Yang, T., Danelljan, M., Khan, FS., Zhang, X., & Sun, J. (2020). Learning human–object interaction detection using interaction points. In CVPR (pp. 4116–4125). Wang, T., Yang, T., Danelljan, M., Khan, FS., Zhang, X., & Sun, J. (2020). Learning human–object interaction detection using interaction points. In CVPR (pp. 4116–4125).
go back to reference Wang, N., Zhang, Y., & Zhang, L. (2021). Dynamic selection network for image inpainting. TIP, 30, 1784–1798. Wang, N., Zhang, Y., & Zhang, L. (2021). Dynamic selection network for image inpainting. TIP, 30, 1784–1798.
go back to reference Xu, K., Ba, J, Kiros, R, Cho, K, Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In ICML (pp. 2048–2057). Xu, K., Ba, J, Kiros, R, Cho, K, Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In ICML (pp. 2048–2057).
go back to reference Xu, B., Wong, Y., Li, J., Zhao, Q., & Kankanhalli, M.S. (2019). Learning to detect human–object interactions with knowledge. In CVPR (pp. 2019–2028). Xu, B., Wong, Y., Li, J., Zhao, Q., & Kankanhalli, M.S. (2019). Learning to detect human–object interactions with knowledge. In CVPR (pp. 2019–2028).
go back to reference Xu, B., Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. S. (2020). Interact as you intend: Intention-driven human-object interaction detection. TMM, 22(6), 1423–1432. Xu, B., Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. S. (2020). Interact as you intend: Intention-driven human-object interaction detection. TMM, 22(6), 1423–1432.
go back to reference Ye, Q., Yuan, S., & Kim, T. (2016). Spatial attention deep net with partial pso for hierarchical hybrid hand pose estimation. In ECCV (pp. 346–361). Ye, Q., Yuan, S., & Kim, T. (2016). Spatial attention deep net with partial pso for hierarchical hybrid hand pose estimation. In ECCV (pp. 346–361).
go back to reference You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In CVPR (pp. 4651–4659). You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In CVPR (pp. 4651–4659).
go back to reference Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In CVPR (pp. 5831–5840). Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In CVPR (pp. 5831–5840).
go back to reference Zhang, H., Kyaw, Z., Chang, SF., & Chua, T.S. (2017). Visual translation embedding network for visual relation detection. In CVPR (pp. 5532–5540). Zhang, H., Kyaw, Z., Chang, SF., & Chua, T.S. (2017). Visual translation embedding network for visual relation detection. In CVPR (pp. 5532–5540).
go back to reference Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2020). Temporal action detection with structured segment networks. IJCV, 128(1), 74–95.MathSciNetCrossRef Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2020). Temporal action detection with structured segment networks. IJCV, 128(1), 74–95.MathSciNetCrossRef
go back to reference Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., & Zhu, S. C. (2015). Scene understanding by reasoning stability and safety. IJCV, 112(2), 221–238.MathSciNetCrossRef Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., & Zhu, S. C. (2015). Scene understanding by reasoning stability and safety. IJCV, 112(2), 221–238.MathSciNetCrossRef
go back to reference Zhong, X., Ding, C., Qu, X., & Tao, D. (2020). Polysemy deciphering network for human–object interaction detection. In ECCV (pp. 69-85). Zhong, X., Ding, C., Qu, X., & Tao, D. (2020). Polysemy deciphering network for human–object interaction detection. In ECCV (pp. 69-85).
go back to reference Zhong, X., Qu, X., Ding, C., & Tao, D. (2021). Glance and Gaze: Inferring action-aware points for one-stage human–object interaction detection. In CVPR. Zhong, X., Qu, X., Ding, C., & Tao, D. (2021). Glance and Gaze: Inferring action-aware points for one-stage human–object interaction detection. In CVPR.
go back to reference Zhou, P., & Chi, M. (2019). Relation parsing neural network for human–object interaction detection. In ICCV (pp. 843–851). Zhou, P., & Chi, M. (2019). Relation parsing neural network for human–object interaction detection. In ICCV (pp. 843–851).
go back to reference Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2019). Unified vision-language pre-training for image captioning and VQA. arXiv:1909.11059. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2019). Unified vision-language pre-training for image captioning and VQA. arXiv:​1909.​11059.
go back to reference Zhou, T., Wang, W., Qi, S., Ling, H., & Shen, J. (2020). Cascaded human–object interaction recognition. In CVPR (pp. 4263–4272). Zhou, T., Wang, W., Qi, S., Ling, H., & Shen, J. (2020). Cascaded human–object interaction recognition. In CVPR (pp. 4263–4272).
go back to reference Zhuang, B., Wu, Q., Shen, C., Reid, I., & Hengel, Avd. (2017). Care about you: towards large-scale human-centric visual relationship detection. arXiv:1705.09892. Zhuang, B., Wu, Q., Shen, C., Reid, I., & Hengel, Avd. (2017). Care about you: towards large-scale human-centric visual relationship detection. arXiv:​1705.​09892.
go back to reference Zhu, Y., Zhao, C., Guo, H., Wang, J., Zhao, X., & Lu, H. (2018). Attention couplenet: Fully convolutional attention coupling network for object detection. TIP, 28(1), 113–126.MathSciNet Zhu, Y., Zhao, C., Guo, H., Wang, J., Zhao, X., & Lu, H. (2018). Attention couplenet: Fully convolutional attention coupling network for object detection. TIP, 28(1), 113–126.MathSciNet
go back to reference Zoph, B., Vasudevan, V., Shlens, J., & Le, Q.V. (2018). Learning transferable architectures for scalable image recognition. In CVPR (pp. 8697–8710). Zoph, B., Vasudevan, V., Shlens, J., & Le, Q.V. (2018). Learning transferable architectures for scalable image recognition. In CVPR (pp. 8697–8710).
Metadata
Title
Polysemy Deciphering Network for Robust Human–Object Interaction Detection
Authors
Xubin Zhong
Changxing Ding
Xian Qu
Dacheng Tao
Publication date
19-04-2021
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 6/2021
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-021-01458-8

Other articles of this Issue 6/2021

International Journal of Computer Vision 6/2021 Go to the issue

Premium Partner