Skip to main content
Erschienen in: International Journal of Multimedia Information Retrieval 1/2015

01.03.2015 | Regular Paper

Aligning plot synopses to videos for story-based retrieval

verfasst von: Makarand Tapaswi, Martin Bäuml, Rainer Stiefelhagen

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
For \(z \sim 100\), \(N_S \sim 40\) and \(N_T \sim 700\) DTW3 takes a couple of minutes to solve with our unoptimized Matlab implementation.
 
Literatur
5.
Zurück zum Zitat Alahari K, Seguin G, Sivic J, Laptev I (2013) Pose estimation and segmentation of people in 3D movies. In: IEEE International Conference on Computer Vision Alahari K, Seguin G, Sivic J, Laptev I (2013) Pose estimation and segmentation of people in 3D movies. In: IEEE International Conference on Computer Vision
6.
Zurück zum Zitat Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: IEEE Conference on Computer Vision and Pattern Recognition Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: IEEE Conference on Computer Vision and Pattern Recognition
7.
Zurück zum Zitat Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008(246309):1–10 Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008(246309):1–10
8.
Zurück zum Zitat Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media Inc Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media Inc
9.
Zurück zum Zitat Bredin H, Poignant J, Tapaswi M, Fortier G, et al (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: European Conference on Computer vision Workshop on Information fusion in computer vision for concept recognition Bredin H, Poignant J, Tapaswi M, Fortier G, et al (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: European Conference on Computer vision Workshop on Information fusion in computer vision for concept recognition
10.
Zurück zum Zitat Cour T, Sapp B, Jordan C, Taskar B (2009) Learning from ambiguously labeled images. In: IEEE Conference on Computer vision and pattern recognition Cour T, Sapp B, Jordan C, Taskar B (2009) Learning from ambiguously labeled images. In: IEEE Conference on Computer vision and pattern recognition
11.
Zurück zum Zitat Cour T, Sapp B, Nagle A, Taskar B (2012) Talking pictures : temporal grouping and dialog-supervised person recognition. In: IEEE Conference on Computer vision and pattern recognition Cour T, Sapp B, Nagle A, Taskar B (2012) Talking pictures : temporal grouping and dialog-supervised person recognition. In: IEEE Conference on Computer vision and pattern recognition
12.
Zurück zum Zitat Demarty CH, Penet C, Scheld M, Ionescu B, Quang VL, Jiang YG (2013) The mediaeval 2013 affect task: violent scenes detection. In: Working notes Proceedings of the mediaeval 2013 Workshop Demarty CH, Penet C, Scheld M, Ionescu B, Quang VL, Jiang YG (2013) The mediaeval 2013 affect task: violent scenes detection. In: Working notes Proceedings of the mediaeval 2013 Workshop
13.
Zurück zum Zitat Ercolessi P, Bredin H, Sénac C (2012) StoViz: story visualization of TV series. In: ACM Multimedia Ercolessi P, Bredin H, Sénac C (2012) StoViz: story visualization of TV series. In: ACM Multimedia
14.
Zurück zum Zitat Everingham M, Sivic J, Zisserman A (2006) Hello! My name is... Buffy—automatic naming of characters in TV video. In: British machine vision conference Everingham M, Sivic J, Zisserman A (2006) Hello! My name is... Buffy—automatic naming of characters in TV video. In: British machine vision conference
15.
Zurück zum Zitat Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382CrossRef Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382CrossRef
16.
Zurück zum Zitat Freiburg B, Kamps J, Snoek C (2011) Crowdsourcing visual detectors for video search. In: ACM Multimedia Freiburg B, Kamps J, Snoek C (2011) Crowdsourcing visual detectors for video search. In: ACM Multimedia
17.
Zurück zum Zitat Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos input. In: IEEE Conference on Computer vision and pattern recognition Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos input. In: IEEE Conference on Computer vision and pattern recognition
18.
Zurück zum Zitat Habibian A, Snoek C (2013) Video2sentence and vice versa. In: ACM Multimedia demo Habibian A, Snoek C (2013) Video2sentence and vice versa. In: ACM Multimedia demo
19.
Zurück zum Zitat Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21CrossRef Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21CrossRef
20.
Zurück zum Zitat Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: IEEE Conference on Computer vision and pattern recognition Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: IEEE Conference on Computer vision and pattern recognition
21.
Zurück zum Zitat Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer vision and pattern recognition Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer vision and pattern recognition
22.
Zurück zum Zitat Law-To J, Chen L, Joly A, Laptev I, Buisson O, Gouet-Bruent V, Boujemaa N, Stentiford FI (2007) Video copy detection: a comparative study. In: ACM International Conference on Image and video retrieval Law-To J, Chen L, Joly A, Laptev I, Buisson O, Gouet-Bruent V, Boujemaa N, Stentiford FI (2007) Video copy detection: a comparative study. In: ACM International Conference on Image and video retrieval
23.
Zurück zum Zitat Law-To J, Grefenstette G, Gauvain JL (2009) VoxaleadNews: robust automatic segmentation of video into browsable content. In: ACM Multimedia Law-To J, Grefenstette G, Gauvain JL (2009) VoxaleadNews: robust automatic segmentation of video into browsable content. In: ACM Multimedia
24.
Zurück zum Zitat Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In: Computational natural language learning Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In: Computational natural language learning
25.
Zurück zum Zitat Li Y, Lee SH, Yeh CH, Kuo CC (2006) Techniques for movie content analysis and skimming. IEEE Signal Process Mag 23(2):79–89CrossRefMATH Li Y, Lee SH, Yeh CH, Kuo CC (2006) Techniques for movie content analysis and skimming. IEEE Signal Process Mag 23(2):79–89CrossRefMATH
26.
Zurück zum Zitat Liang C, Xu C, Cheng J, Min W, Lu H (2013) Script-to-movie : a computational framework for story movie composition. IEEE Trans Multimed 15(2):401–414CrossRef Liang C, Xu C, Cheng J, Min W, Lu H (2013) Script-to-movie : a computational framework for story movie composition. IEEE Trans Multimed 15(2):401–414CrossRef
27.
Zurück zum Zitat Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer vision and pattern recognition Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer vision and pattern recognition
28.
Zurück zum Zitat Myers CS, Rabiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected word recognition. Bell Syst Tech J 60(7):1389–1409 Myers CS, Rabiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected word recognition. Bell Syst Tech J 60(7):1389–1409
29.
30.
Zurück zum Zitat Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intel 34(12):2441–2453 Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intel 34(12):2441–2453
31.
Zurück zum Zitat Peng Y, Xiao J (2010) Story-based retrieval by learning and measuring the concept-based and content-based similarity. In: Advances in multimedia modeling Peng Y, Xiao J (2010) Story-based retrieval by learning and measuring the concept-based and content-based similarity. In: Advances in multimedia modeling
32.
Zurück zum Zitat Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quenot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: Interspeech Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quenot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: Interspeech
33.
Zurück zum Zitat Rasheed Z, Shah M (2005) Detection and representation of scenes in videos. IEEE Trans Multimed 7(6):1097–1105CrossRef Rasheed Z, Shah M (2005) Detection and representation of scenes in videos. IEEE Trans Multimed 7(6):1097–1105CrossRef
34.
Zurück zum Zitat Rogers DF, Adams JA (1990) Mathematical elements for computer graphics, 2 edn. McGraw-Hill, New York Rogers DF, Adams JA (1990) Mathematical elements for computer graphics, 2 edn. McGraw-Hill, New York
35.
Zurück zum Zitat Sang J, Xu C (2010) Character-based movie summarization. In: ACM Multimedia Sang J, Xu C (2010) Character-based movie summarization. In: ACM Multimedia
36.
Zurück zum Zitat Sankar P, Jawahar CV, Zisserman A (2009) Subtitle-free movie to script alignment. In: British machine vision conference Sankar P, Jawahar CV, Zisserman A (2009) Subtitle-free movie to script alignment. In: British machine vision conference
37.
Zurück zum Zitat Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE Conference on Computer vision and pattern recognition Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE Conference on Computer vision and pattern recognition
38.
Zurück zum Zitat Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: ACM Multimedia information retrieval Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: ACM Multimedia information retrieval
39.
Zurück zum Zitat Snoek C, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. IEEE Trans Multimed 9(5):975–986CrossRef Snoek C, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. IEEE Trans Multimed 9(5):975–986CrossRef
40.
Zurück zum Zitat Snoek C, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 4(2):215–322 Snoek C, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 4(2):215–322
41.
Zurück zum Zitat Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia
42.
Zurück zum Zitat Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? Probabilistic person identification in TV-series. In: IEEE Conference on Computer vision and pattern recognition Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? Probabilistic person identification in TV-series. In: IEEE Conference on Computer vision and pattern recognition
43.
Zurück zum Zitat Tapaswi M, Bäuml M, Stiefelhagen R (2014) Story-based video retrieval in TV series using plot synopses. In: ACM International Conference on Multimedia retrieval Tapaswi M, Bäuml M, Stiefelhagen R (2014) Story-based video retrieval in TV series using plot synopses. In: ACM International Conference on Multimedia retrieval
44.
Zurück zum Zitat Tapaswi M, Bäuml M, Stiefelhagen R (2014) StoryGraphs: visualizing character interactions as a timeline. In: IEEE Conference on Computer vision and pattern recognition Tapaswi M, Bäuml M, Stiefelhagen R (2014) StoryGraphs: visualizing character interactions as a timeline. In: IEEE Conference on Computer vision and pattern recognition
45.
Zurück zum Zitat Tsoneva T, Barbieri M, Weda H (2007) Automated summarization of narrative video on a semantic level. In: International Conference on Semantic computing Tsoneva T, Barbieri M, Weda H (2007) Automated summarization of narrative video on a semantic level. In: International Conference on Semantic computing
46.
Zurück zum Zitat Wang X, Liu Y, Wang D, Wu F (2013) Cross-media topic mining on Wikipedia. In: ACM Multimedia Wang X, Liu Y, Wang D, Wu F (2013) Cross-media topic mining on Wikipedia. In: ACM Multimedia
47.
Zurück zum Zitat Xu C, Zhang YF, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355CrossRef Xu C, Zhang YF, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355CrossRef
48.
Zurück zum Zitat Yusoff Y, Christmas W, Kittler J (1998) A study on automatic shot change detection. In: Multimedia Applications, Services and Techniques — ECMAST’98, vol. 1425. Springer, Berlin Yusoff Y, Christmas W, Kittler J (1998) A study on automatic shot change detection. In: Multimedia Applications, Services and Techniques — ECMAST’98, vol. 1425. Springer, Berlin
49.
Zurück zum Zitat Zaragoza H, Craswell N, Taylor M, Saria S, Robertson S (2004) Microsoft Cambridge at TREC-13: Web and HARD tracks. In: Proceedings of TREC Zaragoza H, Craswell N, Taylor M, Saria S, Robertson S (2004) Microsoft Cambridge at TREC-13: Web and HARD tracks. In: Proceedings of TREC
Metadaten
Titel
Aligning plot synopses to videos for story-based retrieval
verfasst von
Makarand Tapaswi
Martin Bäuml
Rainer Stiefelhagen
Publikationsdatum
01.03.2015
Verlag
Springer London
Erschienen in
International Journal of Multimedia Information Retrieval / Ausgabe 1/2015
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-014-0065-9

Weitere Artikel der Ausgabe 1/2015

International Journal of Multimedia Information Retrieval 1/2015 Zur Ausgabe