Skip to main content

2020 | OriginalPaper | Buchkapitel

Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

verfasst von : Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna, Vineet Gandhi

Erschienen in: Computer Vision – ECCV 2020 Workshops

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the given sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Akbari, H., Karaman, S., Bhargava, S., Chen, B., Vondrick, C., Chang, S.F.: Multi-level multimodal common semantic space for image-phrase grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12476–12486 (2019) Akbari, H., Karaman, S., Bhargava, S., Chen, B., Vondrick, C., Chang, S.F.: Multi-level multimodal common semantic space for image-phrase grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12476–12486 (2019)
2.
Zurück zum Zitat Caesar, H., et al.: Nuscenes: a multimodal dataset for autonomous driving (2019) Caesar, H., et al.: Nuscenes: a multimodal dataset for autonomous driving (2019)
3.
Zurück zum Zitat Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 824–832 (2017) Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 824–832 (2017)
4.
Zurück zum Zitat Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A.: Align2ground: weakly supervised phrase grounding guided by image-caption alignment. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2601–2610 (2019) Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A.: Align2ground: weakly supervised phrase grounding guided by image-caption alignment. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2601–2610 (2019)
5.
Zurück zum Zitat Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7746–7755 (2018) Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7746–7755 (2018)
6.
Zurück zum Zitat Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: a multimodal reasoner for visual grounding (2020) Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: a multimodal reasoner for visual grounding (2020)
7.
Zurück zum Zitat Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: taking control of your self-driving car. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019). 10.18653/v1/d19-1215, http://dx.doi.org/10.18653/v1/D19-1215 Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: taking control of your self-driving car. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019). 10.18653/v1/d19-1215, http://​dx.​doi.​org/​10.​18653/​v1/​D19-1215
8.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018)
9.
Zurück zum Zitat Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2019) Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
10.
Zurück zum Zitat Engilberge, M., Chevallier, L., Pérez, P., Cord, M.: Finding beans in burgers: deep semantic-visual embedding with localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3984–3993 (2018) Engilberge, M., Chevallier, L., Pérez, P., Cord, M.: Finding beans in burgers: deep semantic-visual embedding with localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3984–3993 (2018)
11.
Zurück zum Zitat Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Proceedings of the European conference on computer vision (ECCV) (2018) Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Proceedings of the European conference on computer vision (ECCV) (2018)
12.
Zurück zum Zitat Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
13.
Zurück zum Zitat Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning (2018) Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning (2018)
14.
Zurück zum Zitat Javed, S.A., Saxena, S., Gandhi, V.: Learning unsupervised visual grounding through semantic self-supervision. arXiv preprint arXiv:1803.06506 (2018) Javed, S.A., Saxena, S., Gandhi, V.: Learning unsupervised visual grounding through semantic self-supervision. arXiv preprint arXiv:​1803.​06506 (2018)
15.
Zurück zum Zitat Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping (2014) Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping (2014)
16.
Zurück zum Zitat Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014) Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
17.
Zurück zum Zitat Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach (2019) Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach (2019)
19.
Zurück zum Zitat Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015) Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
20.
Zurück zum Zitat Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019). http://arxiv.org/abs/1908.10084 Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019). http://​arxiv.​org/​abs/​1908.​10084
22.
Zurück zum Zitat Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter (2019)
23.
Zurück zum Zitat Sriram, N.N., Maniar, T., Kalyanasundaram, J., Gandhi, V., Bhowmick, B., Madhava Krishna, K.: Talk to the vehicle: language conditioned autonomous navigation of self driving cars. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5284–5290 (2019) Sriram, N.N., Maniar, T., Kalyanasundaram, J., Gandhi, V., Bhowmick, B., Madhava Krishna, K.: Talk to the vehicle: language conditioned autonomous navigation of self driving cars. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5284–5290 (2019)
24.
Zurück zum Zitat Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2019) Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2019)
25.
Zurück zum Zitat Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge (2020) Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge (2020)
26.
Zurück zum Zitat Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)CrossRef Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)CrossRef
27.
Zurück zum Zitat Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016) Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
28.
Zurück zum Zitat Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018) Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Metadaten
Titel
Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding
verfasst von
Nivedita Rufus
Unni Krishnan R Nair
K. Madhava Krishna
Vineet Gandhi
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-66096-3_4

Premium Partner