Skip to main content
Top

2018 | OriginalPaper | Chapter

NNEval: Neural Network Based Evaluation Metric for Image Captioning

Authors : Naeha Sharif, Lyndon White, Mohammed Bennamoun, Syed Afaq Ali Shah

Published in: Computer Vision – ECCV 2018

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The automatic evaluation of image descriptions is an intricate task, and it is highly important in the development and fine-grained analysis of captioning systems. Existing metrics to automatically evaluate image captioning systems fail to achieve a satisfactory level of correlation with human judgements at the sentence level. Moreover, these metrics, unlike humans, tend to focus on specific aspects of quality, such as the n-gram overlap or the semantic meaning. In this paper, we present the first learning-based metric to evaluate image captions. Our proposed framework enables us to incorporate both lexical and semantic information into a single learned metric. This results in an evaluator that takes into account various linguistic features to assess the caption quality. The experiments we performed to assess the proposed metric, show improvements upon the state of the art in terms of correlation with human judgements and demonstrate its superior robustness to distractions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
3
We thank the authors of these captioning approaches for making their codes publicly available.
 
Literature
1.
go back to reference Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI. 16, 265–283 (2016) Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI. 16, 265–283 (2016)
2.
go back to reference Aditya, S., Yang, Y., Baral, C., Aloimonos, Y., Fermüller, C.: Image understanding using vision and reasoning through scene description graph. Comput. Vis. Image Underst. (2017) Aditya, S., Yang, Y., Baral, C., Aloimonos, Y., Fermüller, C.: Image understanding using vision and reasoning through scene description graph. Comput. Vis. Image Underst. (2017)
3.
go back to reference Albrecht, J.S., Hwa, R.: Regression for machine translation evaluation at the sentence level. Mach. Transl. 22(1–2), 1 (2008)CrossRef Albrecht, J.S., Hwa, R.: Regression for machine translation evaluation at the sentence level. Mach. Transl. 22(1–2), 1 (2008)CrossRef
5.
go back to reference Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005) Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
6.
go back to reference Bojar, O., Graham, Y., Kamran, A., Stanojević, M.: Results of the wmt16 metrics shared task. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. vol. 2, pp. 199–231 (2016) Bojar, O., Graham, Y., Kamran, A., Stanojević, M.: Results of the wmt16 metrics shared task. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. vol. 2, pp. 199–231 (2016)
7.
go back to reference Bojar, O., Helcl, J., Kocmi, T., Libovickỳ, J., Musil, T.: Results of the WMT17 neural MT training task. In: Proceedings of the Second Conference on Machine Translation, pp. 525–533 (2017) Bojar, O., Helcl, J., Kocmi, T., Libovickỳ, J., Musil, T.: Results of the WMT17 neural MT training task. In: Proceedings of the Second Conference on Machine Translation, pp. 525–533 (2017)
9.
go back to reference Corston-Oliver, S., Gamon, M., Brockett, C.: A machine learning approach to the automatic evaluation of machine translation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 148–155. Association for Computational Linguistics (2001) Corston-Oliver, S., Gamon, M., Brockett, C.: A machine learning approach to the automatic evaluation of machine translation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 148–155. Association for Computational Linguistics (2001)
10.
go back to reference Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014) Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
11.
go back to reference Elliott, D., Keller, F.: Comparing automatic evaluation measures for image description. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 452–457 (2014) Elliott, D., Keller, F.: Comparing automatic evaluation measures for image description. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 452–457 (2014)
12.
go back to reference Fang, H., et al.: From captions to visual concepts and back (2015) Fang, H., et al.: From captions to visual concepts and back (2015)
13.
go back to reference Giménez, J., Màrquez, L.: Linguistic features for automatic evaluation of heterogenous MT systems. In: Proceedings of the Second Workshop on Statistical Machine Translation, pp. 256–264. Association for Computational Linguistics (2007) Giménez, J., Màrquez, L.: Linguistic features for automatic evaluation of heterogenous MT systems. In: Proceedings of the Second Workshop on Statistical Machine Translation, pp. 256–264. Association for Computational Linguistics (2007)
14.
go back to reference Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010) Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
15.
go back to reference Guzmán, F., Joty, S., Màrquez, L., Nakov, P.: Pairwise neural machine translation evaluation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, pp. 805–814 (2015) Guzmán, F., Joty, S., Màrquez, L., Nakov, P.: Pairwise neural machine translation evaluation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, pp. 805–814 (2015)
16.
go back to reference Guzmán, F., Joty, S., Màrquez, L., Nakov, P.: Machine translation evaluation with neural networks. Comput. Speech Lang. 45, 180–200 (2017)CrossRef Guzmán, F., Joty, S., Màrquez, L., Nakov, P.: Machine translation evaluation with neural networks. Comput. Speech Lang. 45, 180–200 (2017)CrossRef
17.
go back to reference Hodosh, M., Hockenmaier, J.: Focused evaluation for image description with binary forced-choice tasks. In: Proceedings of the 5th Workshop on Vision and Language, pp. 19–28 (2016) Hodosh, M., Hockenmaier, J.: Focused evaluation for image description with binary forced-choice tasks. In: Proceedings of the 5th Workshop on Vision and Language, pp. 19–28 (2016)
18.
go back to reference Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetCrossRef Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetCrossRef
19.
go back to reference Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015) Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
20.
go back to reference Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014) Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014)
21.
go back to reference Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., Erdem, E.: Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600 (2016) Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., Erdem, E.: Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:​1612.​07600 (2016)
23.
go back to reference Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)MathSciNetCrossRef Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)MathSciNetCrossRef
24.
go back to reference Kulesza, A., Shieber, S.M.: A learning approach to improving sentence-level MT evaluation. In: Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 75–84 (2004) Kulesza, A., Shieber, S.M.: A learning approach to improving sentence-level MT evaluation. In: Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 75–84 (2004)
25.
go back to reference Kulkarni, G., et al.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)CrossRef Kulkarni, G., et al.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)CrossRef
26.
go back to reference Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015) Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)
27.
go back to reference Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004) Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
28.
go back to reference Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370 (2016) Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:​1612.​00370 (2016)
29.
go back to reference Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 6 (2017) Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 6 (2017)
30.
go back to reference Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
31.
go back to reference van Miltenburg, E., Elliott, D.: Room for improvement in automatic image description: an error analysis. arXiv preprint arXiv:1704.04198 (2017) van Miltenburg, E., Elliott, D.: Room for improvement in automatic image description: an error analysis. arXiv preprint arXiv:​1704.​04198 (2017)
33.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
34.
go back to reference Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649. IEEE (2015) Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649. IEEE (2015)
36.
go back to reference Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., Sawaf, H.: Accelerated DP based search for statistical translation. In: Fifth European Conference on Speech Communication and Technology (1997) Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., Sawaf, H.: Accelerated DP based search for statistical translation. In: Fifth European Conference on Speech Communication and Technology (1997)
37.
go back to reference Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015) Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
38.
go back to reference Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE (2015)
39.
go back to reference Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015) Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
40.
go back to reference Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. OpenReview 2(5), 8 (2016) Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. OpenReview 2(5), 8 (2016)
41.
go back to reference You, Q., Jin, H., Luo, J.: Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 (2018) You, Q., Jin, H., Luo, J.: Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:​1801.​10121 (2018)
42.
go back to reference You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016) You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)
43.
go back to reference Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014) Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
44.
go back to reference Zhang, Y., Vogel, S.: Significance tests of automatic machine translation evaluation metrics. Mach. Transl. 24(1), 51–65 (2010)CrossRef Zhang, Y., Vogel, S.: Significance tests of automatic machine translation evaluation metrics. Mach. Transl. 24(1), 51–65 (2010)CrossRef
Metadata
Title
NNEval: Neural Network Based Evaluation Metric for Image Captioning
Authors
Naeha Sharif
Lyndon White
Mohammed Bennamoun
Syed Afaq Ali Shah
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-01237-3_3

Premium Partner