Skip to main content
Erschienen in: Pattern Recognition and Image Analysis 4/2020

01.10.2020 | MATHEMATICAL THEORY OF IMAGES AND SIGNALS REPRESENTING, PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING

Image Captioning using Reinforcement Learning with BLUDEr Optimization

verfasst von: P. R. Devi, V. Thrivikraman, D. Kashyap, S. S. Shylaja

Erschienen in: Pattern Recognition and Image Analysis | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Image captioning is a growing field of research that has taken hold of the research community. It is a challenging task owing to the complexity of natural language generation and the difficulty involved in feature extraction from a diverse collection of images. Many models have been proposed to tackle the problem, like state-of-the-art encoder-decoder (Sequential CNN-RNN) systems that have proved to be capable of obtaining results. Recently, Reinforcement learning has made itself the new approach to the problem and has been successful in surpassing many of the state-of-the-art paradigms. We have come up with a new reward system known as the BLUDEr metric, which is a linear combination of the non-differentiable metrics BLEU and CIDEr. We directly optimize this metric for our model, on natural language generation tasks. In our experiments, we use the Flickr30k and Flickr8k datasets, which have become two of the benchmark datasets when it comes to image captioning systems. We have achieved state-of-the-art results on these two datasets, when compared with other models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (2014), pp. 740–755. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (2014), pp. 740–755.
2.
Zurück zum Zitat B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2641–2649. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2641–2649.
3.
Zurück zum Zitat M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res. 47, 853–899 (2013).MathSciNetCrossRef M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res. 47, 853–899 (2013).MathSciNetCrossRef
4.
Zurück zum Zitat K. Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318. K. Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318.
5.
Zurück zum Zitat R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR 2015 (2014). R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR 2015 (2014).
6.
Zurück zum Zitat S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems (2015), Vol. 1, pp. 1171–1179. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems (2015), Vol. 1, pp. 1171–1179.
7.
Zurück zum Zitat A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for Markov decision processes,” Math. Oper. Res. 22 (1), 222–255 (1997).MathSciNetCrossRef A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for Markov decision processes,” Math. Oper. Res. 22 (1), 222–255 (1997).MathSciNetCrossRef
8.
Zurück zum Zitat S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning” (2016). arXiv:1612.00563. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning” (2016). arXiv:1612.00563.
9.
Zurück zum Zitat S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput. 9 (8), 1735–1780 (1997).CrossRef S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput. 9 (8), 1735–1780 (1997).CrossRef
10.
Zurück zum Zitat O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164.
11.
Zurück zum Zitat Qi Wu, Ch. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), 1367–1381 (2018).CrossRef Qi Wu, Ch. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), 1367–1381 (2018).CrossRef
12.
Zurück zum Zitat Haichao Shi, Peng Li, Bo Wang, and Zhenyu Wang, “Image captioning based on deep reinforcement learning,” in Proceedings of the 10th International Conference on Internet Multimedia Computing and Service (ICIMCS) (Nanjing, 2018), pp. 45–49. Haichao Shi, Peng Li, Bo Wang, and Zhenyu Wang, “Image captioning based on deep reinforcement learning,” in Proceedings of the 10th International Conference on Internet Multimedia Computing and Service (ICIMCS) (Nanjing, 2018), pp. 45–49.
13.
Zurück zum Zitat K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (2015), pp. 2048–2057. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (2015), pp. 2048–2057.
14.
Zurück zum Zitat A. Karpathy and Li Fei-Fei, “Deep visual semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3128–3137. A. Karpathy and Li Fei-Fei, “Deep visual semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3128–3137.
15.
Zurück zum Zitat Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
16.
Zurück zum Zitat X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2407–2415. X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2407–2415.
17.
Zurück zum Zitat Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, and Tat-Seng Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6298–6306. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, and Tat-Seng Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6298–6306.
18.
Zurück zum Zitat Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang, “Aligning where to see and what to tell: Image caption with region-based attention and scene factorization” (2015). arXiv:1506.06272. Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang, “Aligning where to see and what to tell: Image caption with region-based attention and scene factorization” (2015). arXiv:1506.06272.
19.
Zurück zum Zitat D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (2015). D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (2015).
20.
Zurück zum Zitat P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision (2016), pp. 382–398. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision (2016), pp. 382–398.
21.
Zurück zum Zitat M. J. Kusner, Yu Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” in ICML (2015), pp. 957–966. M. J. Kusner, Yu Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” in ICML (2015), pp. 957–966.
22.
Zurück zum Zitat Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches out, Proceedings of the ACL-04 Workshop (Barcelona, 2004), Vol. 8. Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches out, Proceedings of the ACL-04 Workshop (Barcelona, 2004), Vol. 8.
23.
Zurück zum Zitat S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Vol. 29, pp. 65–72. S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Vol. 29, pp. 65–72.
Metadaten
Titel
Image Captioning using Reinforcement Learning with BLUDEr Optimization
verfasst von
P. R. Devi
V. Thrivikraman
D. Kashyap
S. S. Shylaja
Publikationsdatum
01.10.2020
Verlag
Pleiades Publishing
Erschienen in
Pattern Recognition and Image Analysis / Ausgabe 4/2020
Print ISSN: 1054-6618
Elektronische ISSN: 1555-6212
DOI
https://doi.org/10.1134/S1054661820040094

Weitere Artikel der Ausgabe 4/2020

Pattern Recognition and Image Analysis 4/2020 Zur Ausgabe

MATHEMATICAL THEORY OF IMAGES AND SIGNALS REPRESENTING, PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING

Local Tetra-Directional Pattern–A New Texture Descriptor for Content-Based Image Retrieval

MATHEMATICAL THEORY OF IMAGES AND SIGNALS REPRESENTING, PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING

Survey of Learning Based Single Image Super-Resolution Reconstruction Technology

MATHEMATICAL THEORY OF IMAGES AND SIGNALS REPRESENTING, PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING

Computational Approaches to Aesthetic Quality Assessment of Digital Photographs: State of the Art and Future Research Directives