Skip to main content
Top
Published in: Cluster Computing 3/2019

29-03-2018

Fast image captioning using LSTM

Authors: Meng Han, Wenyu Chen, Alemu Dagmawi Moges

Published in: Cluster Computing | Special Issue 3/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Computer vision and natural language processing have been some of the long-standing challenges in artificial intelligence. In this paper, we explore a generative automatic image annotation model, which utilizes recent advances on both fronts. Our approach makes use of a deep-convolutional neural network to detect image regions, which later will be fed to recurrent neural network that is trained to maximize the likely-hood of the target sentence description of the given image. During our experimentation we found that better accuracy and training was achieved when the image representation from our detection model is coupled with the input word embedding, we also found out most of the information from the last layer of detection model vanishes when it is fed as thought vector for our LSTM decoder. This is mainly because the information within the last fully connected layer of the YOLO model represents the class probabilities for the detected objects and their bounding box and this information is not rich enough. We trained our model on coco benchmark for 60 h on 64,000 training and 12,800-validation dataset achieving 23% accuracy. We also realized a significant training speed drop when we changed the number of hidden units in the LSTM layer from 1470 to 4096.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Picard, R.W., Minka, T.P.: Vision texture for annotation. Multimed. Syst. 3(1), 3–14 (1995)CrossRef Picard, R.W., Minka, T.P.: Vision texture for annotation. Multimed. Syst. 3(1), 3–14 (1995)CrossRef
2.
go back to reference Cusano, C., Bicocca, M., Bicocca, V.: Image annotation using SVM. Proc. SPIE 1, 330–338 (2003)CrossRef Cusano, C., Bicocca, M., Bicocca, V.: Image annotation using SVM. Proc. SPIE 1, 330–338 (2003)CrossRef
3.
go back to reference Tang, J., Lewis, P.H.: A study of quality issues for image auto-annotation with the corel dataset. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)CrossRef Tang, J., Lewis, P.H.: A study of quality issues for image auto-annotation with the corel dataset. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)CrossRef
4.
go back to reference Li, J., Wang, J.Z.: Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 985–1002 (2008)CrossRef Li, J., Wang, J.Z.: Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 985–1002 (2008)CrossRef
5.
go back to reference Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)CrossRef Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)CrossRef
6.
go back to reference Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. Proc. CVIR Lect. Notes Comput. Sci. 3115, 24–32 (2004)CrossRef Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. Proc. CVIR Lect. Notes Comput. Sci. 3115, 24–32 (2004)CrossRef
7.
go back to reference Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, 07–12 June, pp. 3156–3164 (2015) Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, 07–12 June, pp. 3156–3164 (2015)
8.
go back to reference Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07–12 June, pp. 3128–3137 (2015) Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07–12 June, pp. 3128–3137 (2015)
9.
go back to reference Kulkarni, G., Premraj, V., Ordonez, V., et al.: Baby talk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)CrossRef Kulkarni, G., Premraj, V., Ordonez, V., et al.: Baby talk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)CrossRef
10.
go back to reference Girshick, R., Donahue, .J, Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) Girshick, R., Donahue, .J, Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
11.
go back to reference Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRef Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRef
12.
go back to reference Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, 4–9 Feb, pp. 4278–4284 (2017) Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, 4–9 Feb, pp. 4278–4284 (2017)
13.
go back to reference Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: CVPR 2016, pp. 779–788 (2016) Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: CVPR 2016, pp. 779–788 (2016)
15.
go back to reference Chung, J., Gulcehre, C., Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014) Chung, J., Gulcehre, C., Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:​1412.​3555 (2014)
16.
go back to reference Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, 2014, pp. 3104–3112 (2014) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, 2014, pp. 3104–3112 (2014)
Metadata
Title
Fast image captioning using LSTM
Authors
Meng Han
Wenyu Chen
Alemu Dagmawi Moges
Publication date
29-03-2018
Publisher
Springer US
Published in
Cluster Computing / Issue Special Issue 3/2019
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-018-1885-9

Other articles of this Special Issue 3/2019

Cluster Computing 3/2019 Go to the issue

Premium Partner