nach oben

Erschienen in:

2019 | OriginalPaper | Buchkapitel

Captioning the Images: A Deep Analysis

verfasst von : Chaitrali P. Chaudhari, Satish Devane

Erschienen in: Computing, Communication and Signal Processing

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Image captioning is one of the fundamental tasks in machine learning since the ability to generate text captions of an image can have a great impact by assisting us in day-to-day life. However, it is not just an object classification or recognition task, because the model must know the dependencies among the recognized objects and their attributes and encode that knowledge correctly in the caption using a natural language like English. Recently, the internet is overwhelmed with the huge amount of textual and visual data consisting of billions of unstructured images and videos. Meaningful captions will serve as useful keys for retrieval, creative searching, and powerful browsing of these images. In this paper, we present the goal of analysis and classification of the recent state-of-the-art in image captioning and discuss significant differences among them. We provide a comparative review of existing models, techniques with their advantages and disadvantages. Future directions in the field of automatic image caption generation are also explored.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Gender Identification from Frontal Facial Images Using Multiresolution Statistical Descriptors

Nächstes Kapitel Age-Type Identification and Recognition of Historical Kannada Handwritten Document Images Using HOG Feature Descriptors

Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Trans. Comput. 22(1) (1973)CrossRef

Barnard, K., Forsyth, D.: Learning the semantics of words and pictures. In: Proceedings of International Conference on Computer Vision, pp. II:408–415 (2001)

Barnard, K., Duyguly, P., Forsyth, D.: Clustering art. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition, June 2001

Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)MATH

Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object Recognition as machine Translation: Learning a lexicon for a fixed image vocabulary. In: Proceedings of European Conference Computer Vision (2002)

Hofmann, T.: Learning and representing topic: a hierarchical mixture model for word occurrence in document databases. In: Proceedings of Workshop on Learning from Text and the Web, CMU (1998)

Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: European Conference on Computer Vision (2010)

Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: Computer Vision and Pattern Recognition (CVPR), vol. 2011, pp. 1601–1608. IEEE (2011)

Yang, Y., Teo, C.L., Daume, III.H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Conference on Empirical Methods in Natural Language Processing (2011)

10.

Yatskar, M., Galley, M., Vanderwende, L., Zettlemoyer, L.: See no Evil, say no Evil: description generation from densely labeled images. In: Joint Conference on Lexical and Computation Semantics (2014)

11.

Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll ́ar, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)

12.

Elliott, D., Keller, F.: Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing (2013)

13.

Elliott, D., de Vries, A.P.: Describing images using inferred visual dependency representations. In: Annual Meeting of the Association for Computational Linguistics (2015)

14.

Ortiz, L.M.G., Wolff, C., Lapata, M.: Learning to interpret and describe abstract scenes. In: Conference of the North American Chapter of the Association of Computational Linguistics (2015)

15.

Lin, D., Fidler, S., Kong, C., Urtasun, R.: Generating multi-sentence natural language descriptions of indoor scenes. In: British Machine Vision Conference (2015)

16.

Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. Association for Computational Linguistics (2011)

17.

Kuznetsova, P., Ordonezz, V., Berg, T.L., Choi, Y.: TREETALK: composition and compression of trees for image descriptions. In: Conference on Empirical Methods in Natural Language Processing (2014)CrossRef

18.

Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A. C., Yamaguchi, K., Berg, T. L., Stratos, K., Daume, III.H.: Midge: generating image descriptions from computer vision detections. In: Conference of the European Chapter of the Association for Computational Linguistics (2012)

19.

Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR (2003)

20.

Berg, T.L., Berg, A.C., Edwards, J., Forsyth, D.A.: Who’s in the Picture? In: Proceedings Neural Information Processing Systems Conference (2004)

21.

Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Learned-Miller, E., Teh, Y.-W., Forsyth, D.A.: Names and faces in the news. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition (2004)

22.

Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)

23.

Berg, T.L., Forsyth, D.A.: Animals on the web. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition (2006)

24.

Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)

25.

Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009)

26.

Kumar, N., Belhumeur, P., Nayar, S.K.: FaceTracer: a search engine for large collections of images with faces. In: ECCV (2008)

27.

Kumar, N., Berg, A.C., Belhumeur, P., Nayar, S.K.: Attribute and simile classifiers for face verification. In: ICCV (2009)

28.

Jie, L., Caputo, B., Ferrari, V.: Who’s doing what: joint modeling of names and verbs for simultaneous face and pose annotation. In: NIPS, editor, Advances in Neural Information Processing Systems, NIPS (2009)

29.

Li, L.-J., Fei-Fei, L.: OPTIMOL: automatic online picture collection via incremental model learning. Int. J. Comput. Vis. 88, 147–168 (2009)MathSciNetCrossRef

30.

Schroff, F., Criminisi, A., Zisserman, A.: Harvesting image databases from the web. In: Proceedings of 11th IEEE International Conference Computer Vision (2007)

31.

Felzenszwalb, P.F., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)

32.

Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)CrossRef

33.

Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems (2011)

34.

Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Annual Meeting of the Association for Computational Linguistics (2012)

35.

Mason, R., Charniak, E.: Nonparametric method for data-driven image captioning. In: Annual Meeting of the Association for Computational Linguistics (2014)

36.

Patterson, G., Xu, C., Su, H., Hays, J.: The SUN attribute database: beyond categories for deeper scene understanding. Int. J. Comput. Vis. 108 (1–2), 59–81 (2014)CrossRef

37.

Yagcioglu, S., Erdem, E., Erdem, A., Cakici, R.: A distributed representation based query expansion approach for image captioning. In: Annual Meeting of the Association for Computational Linguistics (2015)

38.

Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., Mitchell, M.: Language models for image captioning: the quirks and what works. In: Annual Meeting of the Association for Computational Linguistics (2015)

39.

Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. PAMI 30, 1371–1384 (2008)CrossRef

40.

Hodosh, M., Hockenmaier, J.: Sentence-based image description with scalable, explicit models. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2013)

41.

Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)

42.

Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems (2014)

43.

Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: International Conference on Computer Vision (2015)

44.

Pinheiro, P., Lebret, R., Collobert, R.: Simple image description generator via a linear phrase-based model. In: International Conference on Learning Representations Workshop (2015)

45.

Ushiku, Y., Yamaguchi, M., Mukuta, Y., Harada, T.: Common subspace for model and similarity: phrase learning for caption generation from images. In: International Conference on Computer Vision (2015)

46.

Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. In: Advances in Neural Information Processing Systems Deep Learning Workshop (2015)

47.

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)

48.

Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)

49.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (2015)

50.

Lebret, R., Pinheiro, P.O., Collobert, R.: Phrase-based image captioning. In: International Conference on Machine Learning (2015)

51.

Gupta, A., Kembhavi, A., Davis, L.: Observing human-object interactions: using spatial and functional compatibility for recognition. In: PAMI (2009)

52.

Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)CrossRef

53.

Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (2006)

54.

Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2012)CrossRef

55.

Verma, Y., Jawahar, C.V.: Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In: British Machine Vision Conference (2014)

56.

Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)CrossRef

57.

Lenat, D.B.: Cyc: a large-scale investment in knowledge infrastructure. Commun. ACM 38(11), 33–38 (1995PCrossRef

Titel: Captioning the Images: A Deep Analysis
verfasst von: Chaitrali P. Chaudhari
Satish Devane
Verlag: Springer Singapore
Buch: Computing, Communication and Signal Processing
Print ISBN: 978-981-13-1512-1

Electronic ISBN: 978-981-13-1513-8

Copyright-Jahr: 2019
DOI: https://doi.org/10.1007/978-981-13-1513-8_100

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Jonas Klose/© Pine Valley Capital GmbH, Carina Kießling von der Strategieberatung Roland Berger/© Monika Walther Fotografie | ATZ, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.