nach oben

International Journal of Computer Vision

Erschienen in:

30.07.2020

Multi-task Compositional Network for Visual Relationship Detection

verfasst von: Yibing Zhan, Jun Yu, Ting Yu, Dacheng Tao

Erschienen in: International Journal of Computer Vision | Ausgabe 8-9/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Previous methods treat visual relationship detection as a combination of object detection and predicate detection. However, natural images likely contain hundreds of objects and thousands of object pairs. Relying only on object detection and predicate detection is insufficient for effective visual relationship detection because the significant relationships are easily overwhelmed by the dominant less-significant relationships. In this paper, we propose a novel subtask for visual relationship detection, the significance detection, as the complement of object detection and predicate detection. Significance detection refers to the task of identifying object pairs with significant relationships. Meanwhile, we propose a novel multi-task compositional network (MCN) that simultaneously performs object detection, predicate detection, and significance detection. MCN consists of three modules, an object detector, a relationship generator, and a relationship predictor. The object detector detects objects. The relationship generator provides useful relationships, and the relationship predictor produces significance scores and predicts predicates. Furthermore, MCN proposes a multimodal feature fusion strategy based on visual, spatial, and label features and a novel correlated loss function to deeply combine object detection, predicate detection, and significance detection. MCN is validated on two datasets: visual relationship detection dataset and visual genome dataset. The experimental results compared with state-of-the-art methods verify the competitiveness of MCN and the usefulness of significance detection in visual relationship detection.

Vorheriger Artikel Rectified Wing Loss for Efficient and Robust Facial Landmark Localisation with Convolutional Neural Networks

Nächster Artikel Disentangled Representation Learning of Makeup Portraits in the Wild

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The source code is provided at https://github.com/Atmegal/MCN.

Chen, T., Yu, W., Chen, R., & Lin, L. (2019). Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6163–6171).

Chen, X., & Gupta, A. (2017). An implementation of faster rcnn with study for region sampling. arXiv preprint arXiv:1702.02138

Desai, C., & Ramanan, D. (2012). Detecting actions, poses, and objects with relational phraselets. In European conference on computer vision (pp 158–172). Springer.

Du Plessis, M., Niu, G., & Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In International conference on machine learning (pp. 1386–1394).

Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., & Zweig, G. (2015). From captions to visual concepts and back. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (vol 00, pp. 1473–1482). https://doi.org/10.1109/CVPR.2015.7298754.

Farhadi, A., & Sadeghi, M. A. (2011). Recognition using visual phrases. In CVPR 2011(CVPR) (vol 00, pp. 1745–1752). https://doi.org/10.1109/CVPR.2011.5995711.

Galleguillos, C., Rabinovich, A., & Belongie, S. (2008). Object categorization using co-occurrence, location and appearance. In IEEE conference on computer vision and pattern recognition, 2008 (pp. 1–8). CVPR 2008, IEEE.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

Gould, S., Rodgers, J., Cohen, D., Elidan, G., & Koller, D. (2008). Multi-class segmentation with relative location prior. International Journal of Computer Vision, 80(3), 300–316.CrossRef

Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., & Ling, M. (2019). Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1969–1978).

Han, C., Shen, F., Liu, L., & Yang, Y. (2018). Shen HT (2018) Visual spatial attention network for relationship detection. In ACM multimedia conference on multimedia conference, ACM (pp. 510–518).

Hsieh, C.J., Natarajan, N., & Dhillon, I.S. (2015). Pu learning for matrix completion. In ICML (pp. 2445–2453).

Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. In The IEEE conference on computer vision and pattern recognition (CVPR).

Jae Hwang, S., Ravi, SN., Tao, Z., Kim, H.J., Collins, M.D., & Singh, V. (2018). Tensorize, factorize and regularize: Robust visual relationship learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

Kaji, H., Yamaguchi, H., & Sugiyama, M. (2018). Multi task learning with positive and unlabeled data and its application to mental state prediction. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE (pp. 2301–2305).

Kanehira, A., & Harada, T. (2016). Multi-label ranking from positive and unlabeled data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5138–5146).

Kong, Y., & Fu, Y. (2017). Max-margin heterogeneous information machine for rgb-d action recognition. International Journal of Computer Vision, 123(3), 350–371.MathSciNetCrossRef

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.MathSciNetCrossRef

Li, X. L., Yu, P. S., Liu, B., & Ng, S. K. (2009). Positive unlabeled learning for data stream classification. In Proceedings of the 2009 SIAM international conference on data mining, SIAM (pp. 259–270).

Li, Y., Ouyang, W., Wang, X., & Tang, X. (2017a). Vip-cnn: Visual phrase guided convolutional neural network. In Computer vision and pattern recognition (pp. 7244–7253).

Li, Y., Ouyang, W., Zhou, B., Wang, K., & Wang, X. (2017b). Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision (pp. 1261–1270).

Liang, K., Guo, Y., Chang, H., & Chen, X. (2018). Visual relationship detection with deep structural ranking. In AAAI Conference on artificial intelligence.

Liang, X., Lee, L., & Xing, E. P. (2017). Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Computer vision and pattern recognition (pp. 4408–4417).

Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21–37). Springer.

Lu, C., Krishna, R., Bernstein, M. S., & Feifei, L. (2016). Visual relationship detection with language priors. In European conference on computer vision (pp. 852–869).

Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3994–4003).

Ouyang, W., Zeng, X., & Wang, X. (2016). Learning mutual visibility relationship for pedestrian detection with a deep model. International Journal of Computer Vision, 120(1), 14–27.MathSciNetCrossRef

Palmero, C., Clapés, A., Bahnsen, C., Møgelmose, A., Moeslund, T. B., & Escalera, S. (2016). Multi-modal rgb-depth-thermal human body segmentation. International Journal of Computer Vision, 118(2), 217–239.MathSciNetCrossRef

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2017). Weakly-supervised learning of visual relations. In international conference on computer vision (pp. 5189–5198).

Platanios, E., Poon, H., Mitchell, T. M., & Horvitz, E. J. (2017). Estimating accuracy from unlabeled data: A probabilistic logic approach. In Advances in neural information processing systems (pp. 4361–4370).

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).

Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, 6, 1137–1149.CrossRef

Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

Sansone, E., De Natale, F. G., & Zhou, Z. H. (2018). Efficient training for positive unlabeled learning. In IEEE Transactions on pattern analysis and machine intelligence.

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

Xu, D., Zhu, Y., Choy, C.B., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (vol. 2).

Yang, X., Zhang, H., & Cai, J. (2018). Shuffle-then-assemble: learning object-agnostic visual relationship features. arXiv preprint arXiv:1808.00171

Yao, T., Pan, Y., Li, Y., & Mei, T. (2018). Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV) (pp. 684–699).

Yin, G., Sheng, L., Liu, B., Yu, N., Wang, X., Shao, J., & Loy, C.C. (2018). Zoom-net: Mining deep feature interactions for visual relationship recognition. arXiv preprint arXiv:1807.04979

Yu, R., Li, A., Morariu, V. I., & Davis, L. S. (2017). Visual relationship detection with internal and external linguistic knowledge distillation. In International conference on computer vision (pp. 1068–1076).

Yu, Z., Yu, J., Fan, J., & Tao, D. (2017). Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 1821–1830).

Yu, Z., Yu, J., Xiang, C., Fan, J., & Tao, D. (2018). Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 99, 1–13.

Zhan, Y., Yu, J., Yu, T., & Tao, D. (2019). On exploring undetermined relationships for visual relationship detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5128–5137).

Zhang, H., Kyaw, Z., Chang, S., & Chua, T. (2017a). Visual translation embedding network for visual relation detection. In Computer vision and pattern recognition (pp. 3107–3115).

Zhang, H., Kyaw, Z., Yu, J., & Chang, S. (2017b). Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In International conference on computer vision (pp. 4243–4251).

Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., & Elhoseiny, M. (2019a). Large-scale visual relationship understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9185–9194.CrossRef

Zhang, J., Shih, K. J., Elgammal, A., Tao, A., & Catanzaro, B. (2019b). Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 11535–11543).

Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503.CrossRef

Zhang, X., & LeCun, Y. (2017). Universum prescription: Regularization using unlabeled data. In AAAI (pp. 2907–2913).

Zhou, J. T., Pan, S. J., Mao, Q., & Tsang, I. W. (2012). Multi-view positive and unlabeled learning. In Asian conference on machine learning (pp. 555–570).

Zhu, Y., & Jiang, S. (2018). Deep structured learning for visual relationship detection. In AAAI Conference on artificial intelligence.

Zhuang, B., Liu, L., Shen, C., & Reid, I. (2017). Towards context-aware interaction recognition for visual relationship detection. In Proceedings of the IEEE international conference on computer vision (pp. 589–598).

Titel: Multi-task Compositional Network for Visual Relationship Detection
verfasst von: Yibing Zhan
Jun Yu
Ting Yu
Dacheng Tao
Publikationsdatum: 30.07.2020
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 8-9/2020
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-020-01353-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 8-9/2020

Fine-Grained Multi-human Parsing

Learning an Evolutionary Embedding via Massive Knowledge Distillation

Unified Binary Generative Adversarial Network for Image Retrieval and Compression

Weakly-supervised Semantic Guided Hashing for Social Image Retrieval

A Survey of Deep Facial Attribute Analysis

Disentangled Representation Learning of Makeup Portraits in the Wild

Premium Partner