Skip to main content

2016 | OriginalPaper | Buchkapitel

Learning Visual Storylines with Skipping Recurrent Neural Networks

verfasst von : Gunnar A. Sigurdsson, Xinlei Chen, Abhinav Gupta

Erschienen in: Computer Vision – ECCV 2016

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

What does a typical visit to Paris look like? Do people first take photos of the Louvre and then the Eiffel Tower? Can we visually model a temporal event like “Paris Vacation” using current frameworks? In this paper, we explore how we can automatically learn the temporal aspects, or storylines of visual concepts from web data. Previous attempts focus on consecutive image-to-image transitions and are unsuccessful at recovering the long-term underlying story. Our novel Skipping Recurrent Neural Network (S-RNN) model does not attempt to predict each and every data point in the sequence, like classic RNNs. Rather, S-RNN uses a framework that skips through the images in the photo stream to explore the space of all ordered subsets of the albums via an efficient sampling procedure. This approach reduces the negative impact of strong short-term correlations, and recovers the latent story more accurately. We show how our learned storylines can be used to analyze, predict, and summarize photo albums from Flickr. Our experimental results provide strong qualitative and quantitative evidence that S-RNN is significantly better than other candidate methods such as LSTMs on learning long-term correlations and recovering latent storylines. Moreover, we show how storylines can help machines better understand and summarize photo streams by inferring a brief personalized story of each individual album.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
In our Flickr dataset, 71.1% of consecutive images are above average (cosine) similarity.
 
2
For simplicity in notation, we assume a single training sequence, but in our experiments we use multiple albums for one concept to discover common latent storylines.
 
Literatur
1.
Zurück zum Zitat Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: ICCV (2013) Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: ICCV (2013)
2.
Zurück zum Zitat Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR (2014) Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR (2014)
3.
Zurück zum Zitat Sadeghi, F., Divvala, S.K., Farhadi, A.: VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. In: CVPR (2015) Sadeghi, F., Divvala, S.K., Farhadi, A.: VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. In: CVPR (2015)
4.
Zurück zum Zitat Izadinia, H., Farhadi, A., Hertzmann, A., Hoffman, M.D.: Image classification and retrieval from user-supplied tags (2014). arXiv preprint: arXiv:1411.6909 Izadinia, H., Farhadi, A., Hertzmann, A., Hoffman, M.D.: Image classification and retrieval from user-supplied tags (2014). arXiv preprint: arXiv:​1411.​6909
5.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
6.
Zurück zum Zitat Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)CrossRef Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)CrossRef
7.
Zurück zum Zitat Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015) Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015)
8.
Zurück zum Zitat Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering (2015). arXiv preprint: arXiv:1511.07394 Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering (2015). arXiv preprint: arXiv:​1511.​07394
9.
Zurück zum Zitat Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering (2015). arXiv preprint: arXiv:1511.02274 Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering (2015). arXiv preprint: arXiv:​1511.​02274
10.
Zurück zum Zitat Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images (2015). arXiv preprint: arXiv:1511.03416 Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images (2015). arXiv preprint: arXiv:​1511.​03416
11.
Zurück zum Zitat Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering (2016). arXiv preprint: arXiv:1603.01417 Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering (2016). arXiv preprint: arXiv:​1603.​01417
12.
Zurück zum Zitat Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions (2014). arXiv preprint: arXiv:1412.2306 Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions (2014). arXiv preprint: arXiv:​1412.​2306
13.
Zurück zum Zitat Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015) Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
14.
Zurück zum Zitat Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
15.
Zurück zum Zitat Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text (2015). arXiv preprint: arXiv:1505.00487 Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text (2015). arXiv preprint: arXiv:​1505.​00487
16.
Zurück zum Zitat Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention (2015). arXiv preprint: arXiv:1502.03044 Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention (2015). arXiv preprint: arXiv:​1502.​03044
17.
Zurück zum Zitat Gregor, K., Danihelka, I., Graves, A., Wierstra, D.: Draw: a recurrent neural network for image generation (2015). arXiv preprint: arXiv:1502.04623 Gregor, K., Danihelka, I., Graves, A., Wierstra, D.: Draw: a recurrent neural network for image generation (2015). arXiv preprint: arXiv:​1502.​04623
18.
Zurück zum Zitat Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). arXiv preprint: arXiv:1506.06724 Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). arXiv preprint: arXiv:​1506.​06724
19.
Zurück zum Zitat Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. In: CVPR (2015) Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. In: CVPR (2015)
20.
Zurück zum Zitat Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. TNN 5(2), 157–166 (1994) Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. TNN 5(2), 157–166 (1994)
21.
Zurück zum Zitat Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
22.
Zurück zum Zitat Kim, G., Xing, E.P.: Reconstructing storyline graphs for image recommendation from web community photos. In: CVPR (2014) Kim, G., Xing, E.P.: Reconstructing storyline graphs for image recommendation from web community photos. In: CVPR (2014)
23.
Zurück zum Zitat Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: CVPR (2014) Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: CVPR (2014)
24.
Zurück zum Zitat DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. In: ACM MM, pp. 211–218. ACM (1998) DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. In: ACM MM, pp. 211–218. ACM (1998)
25.
Zurück zum Zitat Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by graph modeling. TCSVT 15(2), 296–305 (2005) Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by graph modeling. TCSVT 15(2), 296–305 (2005)
26.
Zurück zum Zitat Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: CVPR (2013) Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: CVPR (2013)
27.
Zurück zum Zitat Martin-Brualla, R., He, Y., Russell, B.C., Seitz, S.M.: The 3D jigsaw puzzle: mapping large indoor spaces. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 1–16. Springer, Heidelberg (2014) Martin-Brualla, R., He, Y., Russell, B.C., Seitz, S.M.: The 3D jigsaw puzzle: mapping large indoor spaces. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 1–16. Springer, Heidelberg (2014)
28.
Zurück zum Zitat Sadeghi, F., Tena, J.R., Farhadi Ali, S.L.: Learning to select and order vacation photographs. In: WACV (2015) Sadeghi, F., Tena, J.R., Farhadi Ali, S.L.: Learning to select and order vacation photographs. In: WACV (2015)
29.
Zurück zum Zitat Xiong, B., Kim, G., Sigal, L.: Storyline representation of egocentric videos with an applications to story-based search. In: ICCV, pp. 4525–4533 (2015) Xiong, B., Kim, G., Sigal, L.: Storyline representation of egocentric videos with an applications to story-based search. In: ICCV, pp. 4525–4533 (2015)
30.
Zurück zum Zitat Kim, G., Moon, S., Sigal, L.: Joint photo stream and blog post summarization and exploration. In: CVPR, pp. 3081–3089. IEEE (2015) Kim, G., Moon, S., Sigal, L.: Joint photo stream and blog post summarization and exploration. In: CVPR, pp. 3081–3089. IEEE (2015)
31.
Zurück zum Zitat Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: CVPR, pp. 3584–3592 (2015) Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: CVPR, pp. 3584–3592 (2015)
32.
Zurück zum Zitat Shank, R., Abelson, R.: Scripts, plans, goals and understanding (1977) Shank, R., Abelson, R.: Scripts, plans, goals and understanding (1977)
33.
Zurück zum Zitat Chambers, N., Jurafsky, D.: Unsupervised learning of narrative event chains. In: ACL (2008) Chambers, N., Jurafsky, D.: Unsupervised learning of narrative event chains. In: ACL (2008)
34.
Zurück zum Zitat McIntyre, N., Lapata, M.: Learning to tell tales: a data-driven approach to story generation. In: ACL (2009) McIntyre, N., Lapata, M.: Learning to tell tales: a data-driven approach to story generation. In: ACL (2009)
35.
Zurück zum Zitat Wang, D., Li, T., Ogihara, M.: Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs. In: AAAI (2012) Wang, D., Li, T., Ogihara, M.: Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs. In: AAAI (2012)
36.
Zurück zum Zitat Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: CVPR (2009) Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: CVPR (2009)
37.
Zurück zum Zitat Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. TOMCCAP 3(1), 3 (2007)CrossRef Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. TOMCCAP 3(1), 3 (2007)CrossRef
38.
Zurück zum Zitat Cernekova, Z., Pitas, I., Nikou, C.: Information theory-based shot cut/fade detection and video summarization. TCSVT 16(1), 82–91 (2006) Cernekova, Z., Pitas, I., Nikou, C.: Information theory-based shot cut/fade detection and video summarization. TCSVT 16(1), 82–91 (2006)
39.
Zurück zum Zitat Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012) Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)
40.
Zurück zum Zitat Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: ACM MM (2002) Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: ACM MM (2002)
41.
Zurück zum Zitat Sinha, P., Mehrotra, S., Jain, R.: Summarization of personal photologs using multidimensional content and context. In: ICMR (2011) Sinha, P., Mehrotra, S., Jain, R.: Summarization of personal photologs using multidimensional content and context. In: ICMR (2011)
42.
Zurück zum Zitat Obrador, P., De Oliveira, R., Oliver, N.: Supporting personal photo storytelling for social albums. In: ACM MM, pp. 561–570. ACM (2010) Obrador, P., De Oliveira, R., Oliver, N.: Supporting personal photo storytelling for social albums. In: ACM MM, pp. 561–570. ACM (2010)
43.
Zurück zum Zitat Mikolov, T.: Recurrent neural network based language model. In: INTERSPEECH (2010) Mikolov, T.: Recurrent neural network based language model. In: INTERSPEECH (2010)
44.
Zurück zum Zitat Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: ICML (2011) Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: ICML (2011)
45.
Zurück zum Zitat Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
46.
Zurück zum Zitat Williams, R.J., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Back-Propagation: Theory, Architectures and Applications, pp. 433–486 (1995) Williams, R.J., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Back-Propagation: Theory, Architectures and Applications, pp. 433–486 (1995)
47.
Zurück zum Zitat Werbos, P.J.: Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1(4), 339–356 (1988)CrossRef Werbos, P.J.: Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1(4), 339–356 (1988)CrossRef
48.
Zurück zum Zitat Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research (2015). arXiv preprint: arXiv:1503.01817 Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research (2015). arXiv preprint: arXiv:​1503.​01817
49.
50.
Zurück zum Zitat Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 1027–1035 (2007) Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 1027–1035 (2007)
Metadaten
Titel
Learning Visual Storylines with Skipping Recurrent Neural Networks
verfasst von
Gunnar A. Sigurdsson
Xinlei Chen
Abhinav Gupta
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-46454-1_5