Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages

Sur, Chiranjib

doi:10.1007/s11042-019-08021-1

Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages

Published: 31 July 2019

Volume 78, pages 32187–32237, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Chiranjib Sur ORCID: orcid.org/0000-0002-1563-9304¹

694 Accesses
14 Citations
Explore all metrics

Abstract

Deep Learning Architectures has been researched the most in this decade because of its capability to scale up and solve problems that couldn’t be solved before. Mean while many NLP applications cropped up and there is a requirement to understand how the concepts gradually evolved till date after perceptron was introduced in 1959. This document will provide a detailed description of the computational neuroscience starting from artificial neural network and how researchers retrospected the drawbacks faced by the previous architectures and paved the way for modern deep learning. Modern deep learning is more than what it had been perceived decades ago and had been extended to architectures, with exceptional intelligence, scalability and precision, beyond imagination. This document will provide an overview of the continuation of work and will also specifically deal with applications of various domains related to natural language processing and visual and media contents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning Techniques for Automated Image Captioning

Exploring Memory and Time Efficient Neural Networks for Image Captioning

Journey of Letters to Vectors Through Neural Networks

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol 3, p 6
Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: ICML Unsupervised and Transfer Learning, vol 27, p 1
Bayer J, Wierstra D, Togelius J, Schmidhuber J (2009) Evolving memory cell structures for sequence learning. In: International conference on artificial neural networks. Springer, Berlin, pp 755–764
Bebis G, Georgiopoulos M (1994) Feed-forward neural networks. IEEE Potentials 13(4):27–31
Article Google Scholar
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Bengio Y, Lamblin P, Popovici P, Larochelle H (2007) Greedy layer-wise training of deep networks. In: Advances in neural information processing systems 19. MIT Press, Cambridge
Bengio Y, Boulanger-Lewandowski N, Pascanu R (2013) Advances in optimizing recurrent networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8624–8628
Bordes A, Weston J Learning end-to-end goal-oriented dialog. arXiv:1605.07683
Bordes A, Usunier N, Chopra S, Weston J Large-scale simple question answering with memory networks. arXiv:1506:02075
Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Chen S, Cowan CF, Grant PM (1991) Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans Neural Netw 2(2):302–309
Article Google Scholar
Chen W, Wilson JT, Tyree S, Weinberger KQ, Chen Y (2015) Compressing neural networks with the hashing trick. arXiv:1504.04788
Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: AAAI, pp 3981–3987
Chen F, Ji R, Su J, Wu Y, Wu Y (2017) Structcap: structured semantic embedding for image captioning. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 46–54
Chen H, Zhang H, Chen P Y, Yi J, Hsieh CJ (2017) Show-and-fool: crafting adversarial examples for neural image captioning. arXiv:1712.02051
Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: IJCAI, pp 606–612
Chen F, Ji R, Sun X, Wu Y, Su J (2018) GroupCap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353
Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) “Factual” or “Emotional”: stylized image captioning with adaptive learning and attention. arXiv:1807.03871
Cho Y, Saul LK (2009) Kernel methods for deep learning. In: Advances in neural information processing systems, pp 342–350
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv:1409.1259
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078
Cohn-Gordon R, Goodman N, Potts C (2018) Pragmatically informative image captioning with character-level reference. arXiv:1804.05417
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning. ACM, pp 160–167
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2):48
Google Scholar
Courville AC, Bergstra J, Bengio Y (2011) A spike and slab restricted Boltzmann machine. In: AISTATS, vol 1, p 5
Devlin J et al (2015) Language models for image captioning: the quirks and what works. arXiv:1505.01809
Dhillon PS, Foster DP, Ungar LH (2015) Eigenwords: spectral word embeddings. J Mach Learn Res 16(1):3035–3078
MathSciNet MATH Google Scholar
Doersch C (2016) Tutorial on variational autoencoders. arXiv:1606.05908
Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Du B, Xiong W, Wu J, Zhang L, Zhang L, Tao D (2016) Stacked convolutional denoising auto-encoders for feature representation. IEEE trans Cybern 47(4):1017–1027
Article Google Scholar
Fang H et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Farhadi A et al (2010) Every picture tells a story: Generating sentences from images. In: European conference on computer vision. Springer, Berlin
Fu J et al (2016) Deep Q-networks for accelerating the training of deep neural networks. arXiv:1606.01467
Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334
Article Google Scholar
Fu K, Li J, Jin J, Zhang C (2018) Image-text surgery: efficient concept learning in image captioning by generating pseudopairs. IEEE Trans Neural Netw Learn Syst 29.12(2018):5910–5921
Funahashi KI, Nakamura Y (1993) Approximation of dynamical systems by continuous time recurrent neural networks. Neural Netw 6(6):801–806
Article Google Scholar
Gan Z et al (2016) Semantic compositional networks for visual captioning. arXiv:1611.08002
Gan C et al (2017) Stylenet: generating attractive visual captions with styles. In: CVPR
Girshick R et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Goldberg Y, Levy O (2014) word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722
Graupe D (1997) Large scale memory storage and retrieval (LAMSTAR) network. In: Principles of artificial neural networks, pp 191–222
Graves A, Wayne G, Danihelka I (2014) Neural turing machines. arXiv:1410.5401
Han S, Mao H, Dally WJ (2015) Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding. arXiv:1510.00149
Harzig P, Brehm S, Lienhart R, Kaiser C, Schallner R (2018) Multimodal image captioning for marketing analysis. arXiv:1802.01958
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
Heermann PD, Khazenie N (1992) Classification of multispectral remote sensing data using a back-propagation neural network. IEEE Trans Geosci Remote Sens 30 (1):81–88
Article Google Scholar
Hendricks LA et al (2016) Deep compositional captioning: describing novel object categories without paired training data. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Herculano-Houzel S (2009) The human brain in numbers: a linearly scaled-up primate brain. Front Hum Neurosci 3:31
Article Google Scholar
Hinton GE (1986) Learning distributed representations of concepts. In: Proceedings of the eighth annual conference of the cognitive science society, vol 1, p 12
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Article MathSciNet MATH Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet MATH Google Scholar
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79(8):2554–2558
Article MathSciNet MATH Google Scholar
Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1. Association for Computational Linguistics, pp 873–882
Hubel DH, Wiesel TN (1959) Receptive fields of single neurones in the cat’s striate cortex. J Physiol 148(3):574–591
Article Google Scholar
Hutchinson B, Deng L, Yu D (2013) Tensor deep stacking networks. IEEE Trans Pattern Anal Mach Intell 35(8):1944–1957
Article Google Scholar
Irsoy O, Cardie C (2014) Deep recursive neural networks for compositionality in language. In: Advances in neural information processing systems, pp 2096–2104
Iyyer M, Manjunatha V, Boyd-Graber J, Daumé H III (2015) Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the association for computational linguistics
Izhikevich EM (2004) Which model to use for cortical spiking neurons? IEEE Trans Neural Netw 15(5):1063–1070
Article Google Scholar
Jia X et al (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision
Jiang W, Ma L, Chen X, Zhang H, Liu W (2018) Learning to guide decoding for image captioning. arXiv:1804.00887
Jin J et al (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272
Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kilickaya M, Akkus BK, Cakici R, Erdem A, Erdem E, Ikizler-Cinbis N (2017) Data-driven image captioning via salient region discovery. IET Comput Vis 11(6):398–406
Article Google Scholar
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Kiros R, Zemel R, Salakhutdinov R (2014) A multiplicative model for learning distributed text-based attribute representations. In: Advances in neural information processing systems
Kohonen T (1995) Learning vector quantization. In: Self-organizing maps. Springer, Berlin, pp 175–189
Kohonen T, Somervuo P (1998) Self-organizing maps of symbol strings. Neurocomputing 21(1):19–30
Article MATH Google Scholar
Krishna R et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kulkarni G et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kumar A, Irsoy O, Su J, Bradbury J, English R, Pierce B, ..., Socher R (2015) Ask me anything: dynamic memory networks for natural language processing. arXiv:1506.07285
Kuznetsova P et al (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2(10):351–362
Google Scholar
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284
Article Google Scholar
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 609–616
Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: ACL (2), pp 302–308
Li S et al (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning. Association for Computational Linguistics
Li X, Wang X, Xu C, Lan W, Wei Q, Yang G, Xu J (2018) COCO-CN for cross-lingual image tagging, captioning and retrieval. arXiv:1805.08661
Lin Y, Tong Z, Zhu S, Yu K (2010) Deep coding network. In: Advances in neural information processing systems, pp 1405–1413
Liu C, Mao J, Sha F, Yuille A L (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182
Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: a multimodal attentive translator for image captioning. arXiv:1702.05658
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings IEEE international conference on computer vision, vol 3, p 3
Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. arXiv:1803.08314
Lo SCB, Chan HP, Lin JS, Li H, Freedman M T, Mun S K (1995) Artificial convolution neural network for medical image pattern recognition. Neural Netw 8(7):1201–1214
Article Google Scholar
Lotter W, Kreiman G, Cox D (2016) Deep predictive coding networks for video prediction and unsupervised learning. arXiv:1301.1880
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), vol 6, p 2
Lu D, Whitehead S, Huang L, Ji H, Chang S F (2018) Entity-aware image caption generation. arXiv:1804.07889
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Luong T, Socher R, Manning CD (2013) Better word representations with recursive neural networks for morphology. In: CoNLL, pp 104–113
Mao J et al (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632
Mao J et al (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer vision
Mathews AP, Xie L, He X (2016) SentiCap: generating image descriptions with sentiments. In: AAAI
Melnyk I, Sercu T, Dognin PL, Ross J, Mroueh Y (2018) Improved image captioning with adversarial semantic alignment. arXiv:1805.00063
Memisevic R, Hinton G (2007) Unsupervised learning of image transformations. In: IEEE conference on computer vision and pattern recognition, 2007. CVPR’07. IEEE
Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, vol 2, p 3
Mitchell M et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics. Association for Computational Linguistics
Mnih A, Hinton G E (2009) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088
Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Advances in neural information processing systems, pp 2265–2273
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, ..., Petersen S (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Nakkiran P, Alvarez R, Prabhavalkar R, Parada C (2015) Compressing deep neural networks using a rank-constrained topology
Nie L, Wang M, Zhang L, Yan S, Zhang B, Chua T S (2015) Disease inference from health-related questions via sparse deep learning. IEEE Trans Knowl Data Eng 27(8):2107–2119
Article Google Scholar
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems
Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 895–903
Park CC, Kim B, Kim G (2018) Towards personalized image captioning via multimodal memory networks. IEEE Trans Pattern Anal Mach Intell 41.4(2018):999–1012
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP, vol 14, pp 1532–43
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 701–710
Pfister T, Simonyan K, Charles J, Zisserman A (2014) Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Asian conference on computer vision. Springer International Publishing, pp 538–552
Pu Y et al (2016) Variational autoencoder for deep learning of images, labels and captions. In: Advances in neural information processing systems
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. arXiv:1704.03899
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, vol 1, p 3
Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason 50(7):969–978
Article Google Scholar
Salakhutdinov R, Hinton GE (2009) Deep Boltzmann Machines. In: AISTATS, vol 1, p 3
Salakhutdinov R, Tenenbaum JB, Torralba A (2013) Learning with hierarchical-deep models. IEEE Trans Pattern Anal Mach Intell 35(8):1958–1971
Article Google Scholar
Schmidhuber J (1992) Learning complex, extended sequences using the principle of history compression. Neural Comput 4(2):234–242
Article Google Scholar
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), vol 1, pp 2556–2565
Shaw AM, Doyle FJ, Schwaber JS (1997) A dynamic neural network approach to nonlinear process modeling. Comput Chem Eng 21(4):371–385
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Vedaldi A, Zisserman A (2013) Deep fisher networks for large-scale image classification. In: Advances in neural information processing systems, pp 163–171
Socher R, Lin CC, Manning C, Ng AY (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136
Socher R et al (2014) Grounded compositional semantics for finding and describing images with sentences, vol 2, pp 207–218
Srivastava N, Salakhutdinov R R (2012) Multimodal learning with deep Boltzmann machines. In: Advances in neural information processing systems, pp 2222–2230
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Strobl EV, Visweswaran S (2013) Deep multiple kernel learning. In: 2013 12th international conference on machine learning and applications (ICMLA), vol 1. IEEE, pp 414–417
Sukhbaatar S, Szlam A, Weston J, Fergus R End-To-End memory networks. NIPS 2015 (and arXiv:1503.08895)
Sur C (2018) DeepSeq: learning browsing log data based personalized security vulnerabilities and counter intelligent measures. J Ambient Intell Humaniz Comput (2018):1–30
Sur C (2018) Ensemble one-vs-all learning technique with emphatic & rehearsal training for phishing email classification using psychology. J Exp Theor Artif Intell 30.6(2018):733–762
Sutskever I, Hinton GE, Taylor GW (2009) The recurrent temporal restricted boltzmann machine. In: Advances in neural information processing systems, pp 1601–1608
Sutskever I, Martens J, Hinton G (2011) Generating text with recurrent neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11)
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, ..., Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network embedding. In: Proceedings of the 24th international conference on World Wide Web. ACM, pp 1067–1077
Tavakoliy HR, Shetty R, Borji A, Laaksonen J (2017) Paying attention to descriptions generated by image captioning models. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2506–2515
Torralba A, Tenenbaum JB, Salakhutdinov RR (2011) Learning to learn with compound hd models. In: Advances in neural information processing systems, pp 2061–2069
Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision
Tran K et al (2016) Rich image captioning in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 384–394
Tymoshenko K, Bonadiman D, Moschitti A (2016) Convolutional neural networks vs. convolution kernels: feature engineering for answer sentence reranking. In: Proceedings of NAACL-HLT, pp 1268–1278
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
MathSciNet MATH Google Scholar
Vinyals O et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. In: Advances in neural information processing systems, pp 2692–2700
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663
Article Google Scholar
Wang Z, de Freitas N, Lanctot M (2015) Dueling network architectures for deep reinforcement learning. arXiv:1511.06581
Wang Y, Lin Z, Shen X, Cohen S, Cottrell G W (2017) Skeleton key: image captioning by skeleton-attribute decomposition. arXiv:1704.06972
Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2s):40
Google Scholar
Weston J, Chopra S, Bordes A Memory networks. ICLR 2015 (and arXiv:1410.3916)
Weston J, Bordes A, Chopra S, Rush AM, van Merriënboer B, Joulin A, Mikolov T (2015) Towards ai-complete question answering: a set of prerequisite toy tasks. arXiv:1502.05698
Wu Q et al (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40.6(2017):1367–1381
Wu J, Hu Z, Mooney RJ (2018) Joint image captioning and question answering. arXiv:1805.08389
Wu C, Wei Y, Chu X, Su F, Wang L (2018) Modeling visual and word-conditional semantic attention for image captioning. Signal Process Image Commun 67(2018):100–107
Xu K et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning
Yang Y et al (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics
Yang Z et al (2016) Review networks for caption generation. In: Advances in neural information processing systems
Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5263–5271
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: IEEE international conference on computer vision, ICCV, pp 22–29
Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process 27.11(2018):5514–5524
You Q et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition
You Q, Jin H, Luo J (2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv:1801.10121
Young P et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2:67–78
Article Google Scholar
Yu D, Deng L, Seide F (2013) The deep tensor neural network with applications to large vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 21 (2):388–396
Article Google Scholar
Zhang X, LeCun Y (2015) Text understanding from scratch. arXiv:1502.01710
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657
Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales T M (2017) Actor-critic sequence training for image captioning. arXiv:1706.09601
Zhang Y-D et al (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl 78.3(2019):3613–3632
Zhang Y-D, Muhammad K, Tang C (2018) Twelve-layer deep convolutional neural network with stochastic pooling for tea category classification on GPU platform. Multimed Tools Appl 77(17):22821–22839
Article Google Scholar
Zhang M, Yang Y, Zhang H, Ji Y, Shen H T, Chua T S (2018) More is better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans Image Process 28.1(2018):32–44
Zhao W, Wang B, Ye J, Yang M, Zhao Z, Luo R, Qiao Y (2018) A multi-task learning approach for image captioning. In: IJCAI, pp 1205–1211
Zhuang J, Tsang IW, Hoi SC (2011) Two-layer multiple kernel learning. In: AISTATS, pp 909–917

Download references

Author information

Authors and Affiliations

Computer & Information Science & Engineering Department, University of Florida, Gainesville, FL, USA
Chiranjib Sur

Authors

Chiranjib Sur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chiranjib Sur.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sur, C. Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages. Multimed Tools Appl 78, 32187–32237 (2019). https://doi.org/10.1007/s11042-019-08021-1

Download citation

Received: 14 August 2018
Revised: 18 May 2019
Accepted: 17 July 2019
Published: 31 July 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11042-019-08021-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages

Abstract

Access this article

Similar content being viewed by others

Deep Learning Techniques for Automated Image Captioning

Exploring Memory and Time Efficient Neural Networks for Image Captioning

Journey of Letters to Vectors Through Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages

Abstract

Access this article

Similar content being viewed by others

Deep Learning Techniques for Automated Image Captioning

Exploring Memory and Time Efficient Neural Networks for Image Captioning

Journey of Letters to Vectors Through Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation