Skip to main content
Erschienen in: International Journal of Multimedia Information Retrieval 2/2016

01.06.2016 | Regular Paper

On the use of commonsense ontology for multimedia event recounting

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Textually narrating the observed evidences relevant to the reasons why a video clip is being retrieved for an event is still a highly challenging problem. This paper explores the use of a commonsense ontology, namely ConceptNet, in generating short descriptions for recounting the audio–visual evidences. The ontology is exploited as a knowledge engine to provide event–relevant common sense, which is expressed in terms of concepts and their relationships, for semantics understanding, context-based concept screening and sentence synthesis. A principal way of exploiting the ontology, from extracting the event–relevant semantic network to the formation of syntactic parse trees, is outlined and discussed. Experimental results on two benchmark datasets (TRECVID MED and MediaEval) show the effectiveness of our approach. The findings show insights on the usability of common sense for multimedia search, including the feasibility of inferring relevant concepts for event detection, as well as the quality of textual sentences in meeting human expectation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Refer to http://​vireo.​cs.​cityu.​edu.​hk/​mer_​demo/​networks.​html for twenty event networks generated for TRECVID MED 2012.
 
3
-ing and -s are omitted in ConceptNet.
 
Literatur
1.
Zurück zum Zitat Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell 23(11):1222–1239CrossRef Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell 23(11):1222–1239CrossRef
2.
Zurück zum Zitat Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-wide: a real-world web image database from National University of Singapore. In: Proceedings of CIVR, pp 48:1–48:9 Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-wide: a real-world web image database from National University of Singapore. In: Proceedings of CIVR, pp 48:1–48:9
3.
Zurück zum Zitat Cilibrasi RL, Vitanyi PMB (2007) The Google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383CrossRef Cilibrasi RL, Vitanyi PMB (2007) The Google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383CrossRef
4.
Zurück zum Zitat Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of ECCV, pp 428–441 Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of ECCV, pp 428–441
5.
Zurück zum Zitat Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of CVPR, pp 2634–2641 Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of CVPR, pp 2634–2641
6.
Zurück zum Zitat Demarty CH, Penet C, Schedl M, Ionescu B, Quang VL, Jiang YG (2013) The MediaEval 2013 affect task: violent scenes detection. In: MediaEval workshop Demarty CH, Penet C, Schedl M, Ionescu B, Quang VL, Jiang YG (2013) The MediaEval 2013 affect task: violent scenes detection. In: MediaEval workshop
7.
Zurück zum Zitat Deng J, Dong W, Socher R, Jia Li L, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR Deng J, Dong W, Socher R, Jia Li L, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR
8.
Zurück zum Zitat Ding D, Metze F, Rawat S, Schulam PF, Burger S, Younessian E, Bao L, Christel MG, Hauptmann A (2012) Beyond audio and video retrieval: towards multimedia summarization. In: Proceedings of ICMR, pp 2:1–2:8 Ding D, Metze F, Rawat S, Schulam PF, Burger S, Younessian E, Bao L, Christel MG, Hauptmann A (2012) Beyond audio and video retrieval: towards multimedia summarization. In: Proceedings of ICMR, pp 2:1–2:8
9.
Zurück zum Zitat Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: Proceedings of ECCV, pp 15–29 Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: Proceedings of ECCV, pp 15–29
10.
Zurück zum Zitat Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney RJ, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of ICCV, pp 2712–2719 Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney RJ, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of ICCV, pp 2712–2719
11.
Zurück zum Zitat Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of AAAI Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of AAAI
12.
Zurück zum Zitat Izadinia H, Shah M (2012) Recognizing complex events using large margin joint low-level event model. In: Proceedings of ECCV, pp 430–444 Izadinia H, Shah M (2012) Recognizing complex events using large margin joint low-level event model. In: Proceedings of ECCV, pp 430–444
13.
Zurück zum Zitat Jiang YG, Dai Q, Wang J, Ngo CW, Xue X, Chang SF (2012) Fast semantic diffusion for large scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091MathSciNetCrossRef Jiang YG, Dai Q, Wang J, Ngo CW, Xue X, Chang SF (2012) Fast semantic diffusion for large scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091MathSciNetCrossRef
14.
Zurück zum Zitat Jiang YG, Ngo CW, Chang SF (2009) Semantic context transfer across heterogeneous sources for domain adaptive video search. In: Proceedings of ACM MM, pp 155–164 Jiang YG, Ngo CW, Chang SF (2009) Semantic context transfer across heterogeneous sources for domain adaptive video search. In: Proceedings of ACM MM, pp 155–164
15.
Zurück zum Zitat Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ICMR Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ICMR
16.
Zurück zum Zitat Khan MUG, Zhang L, Gotoh Y (2011) Towards coherent natural language description of video streams. In: ICCV workshops, pp 664–671 Khan MUG, Zhang L, Gotoh Y (2011) Towards coherent natural language description of video streams. In: ICCV workshops, pp 664–671
17.
Zurück zum Zitat Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Proceedings of AAAI Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Proceedings of AAAI
18.
Zurück zum Zitat Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptions. In: Proceedings of CVPR, pp 1601–1608 Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptions. In: Proceedings of CVPR, pp 1601–1608
19.
Zurück zum Zitat Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of ACL, pp 359–368 Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of ACL, pp 359–368
20.
Zurück zum Zitat Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of CoNLL, pp 220–228 Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of CoNLL, pp 220–228
21.
Zurück zum Zitat Lin Y, Michel JB, Aiden EL, Orwant J, Brockman W, Petrov S (2012) Syntactic annotations for the google books n-gram corpus. In: Proceedings of ACL, pp 169–174 Lin Y, Michel JB, Aiden EL, Orwant J, Brockman W, Petrov S (2012) Syntactic annotations for the google books n-gram corpus. In: Proceedings of ACL, pp 169–174
22.
23.
Zurück zum Zitat Liu J, Yu Q, Javed O, Ali S, Tamrakar A, Divakaran A, Cheng H, Sawhney HS (2013) Video event recognition using concept attributes. In: Proceedings of WACV, pp 339–346 Liu J, Yu Q, Javed O, Ali S, Tamrakar A, Divakaran A, Cheng H, Sawhney HS (2013) Video event recognition using concept attributes. In: Proceedings of WACV, pp 339–346
24.
Zurück zum Zitat Ma Z, Hauptmann AG, Yang Y, Sebe N (2012) Classifier-specific intermediate representation for multimedia tasks. In: Proceedings of ICMR, pp 50:1–50:8 Ma Z, Hauptmann AG, Yang Y, Sebe N (2012) Classifier-specific intermediate representation for multimedia tasks. In: Proceedings of ICMR, pp 50:1–50:8
25.
Zurück zum Zitat Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of CVPR, pp 2929–2936 Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of CVPR, pp 2929–2936
26.
Zurück zum Zitat Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: Downie JS, Veltkamp RC (eds) Proceedings of ISMIR, pp 441–446 Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: Downie JS, Veltkamp RC (eds) Proceedings of ISMIR, pp 441–446
27.
Zurück zum Zitat Mazloom M, Gavves E, van de Sande KEA, Snoek C (2013) Searching informative concept banks for video event detection. In: Proceedings of ICMR, pp 255–262 Mazloom M, Gavves E, van de Sande KEA, Snoek C (2013) Searching informative concept banks for video event detection. In: Proceedings of ICMR, pp 255–262
28.
Zurück zum Zitat Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimed 14(1):88–101CrossRef Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimed 14(1):88–101CrossRef
29.
Zurück zum Zitat Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé III H (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of EACL, pp 747–756 Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé III H (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of EACL, pp 747–756
30.
Zurück zum Zitat Natarajan P, Wu S, Luisier F, Zhuang X, Tickoo M, Ye G, Liu D, Chang SF, Saleemi I, Shah M, Davis L, Gupta A, Haritaoglu I, Guler S, Morde A (2013) BBN VISER TRECVID 2013 multimedia event detection and multimedia event recounting systems. In: NIST TRECVID workshop Natarajan P, Wu S, Luisier F, Zhuang X, Tickoo M, Ye G, Liu D, Chang SF, Saleemi I, Shah M, Davis L, Gupta A, Haritaoglu I, Guler S, Morde A (2013) BBN VISER TRECVID 2013 multimedia event detection and multimedia event recounting systems. In: NIST TRECVID workshop
31.
Zurück zum Zitat Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of ICML Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of ICML
32.
Zurück zum Zitat NIST, Information Technology Laboratory: 2012 TRECVID Multimedia Event Detection Track NIST, Information Technology Laboratory: 2012 TRECVID Multimedia Event Detection Track
33.
Zurück zum Zitat NIST, Information Technology Laboratory: 2013 TRECVID Multimedia Event Recounting Track NIST, Information Technology Laboratory: 2013 TRECVID Multimedia Event Recounting Track
34.
Zurück zum Zitat Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS, pp 1143–1151 Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS, pp 1143–1151
35.
Zurück zum Zitat Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of ICCV, pp 433–440 Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of ICCV, pp 433–440
36.
Zurück zum Zitat Romano J (1990) On the behavior of randomization tests without a group invariance assumption. J Am Stat Assoc 85(411):686–692MathSciNetCrossRefMATH Romano J (1990) On the behavior of randomization tests without a group invariance assumption. J Am Stat Assoc 85(411):686–692MathSciNetCrossRefMATH
37.
Zurück zum Zitat Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of ACM MM, pp 421–430 Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of ACM MM, pp 421–430
38.
Zurück zum Zitat Speer R, Havasi C, Lieberman H (2008) Analogyspace: reducing the dimensionality of common sense knowledge. In: Proceedings of AAAI, pp 548–553 Speer R, Havasi C, Lieberman H (2008) Analogyspace: reducing the dimensionality of common sense knowledge. In: Proceedings of AAAI, pp 548–553
39.
Zurück zum Zitat Sun C, Burns B, Nevatia R, Snoek CGM, Bolles B, Myers GK, Wang W, Yeh E (2014) Isomer: informative segment observations for multimedia event recounting. In: Proceedings of ICMR Sun C, Burns B, Nevatia R, Snoek CGM, Bolles B, Myers GK, Wang W, Yeh E (2014) Isomer: informative segment observations for multimedia event recounting. In: Proceedings of ICMR
40.
Zurück zum Zitat Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio–visual concept classifiers. In: Proceedings of ACM MM, pp 655–658 Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio–visual concept classifiers. In: Proceedings of ACM MM, pp 655–658
41.
Zurück zum Zitat Tan CC, Ngo CW (2013) The vireo team at MediaEval 2013: violent scenes detection by mid-level concepts learnt from youtube. In: MediaEval, Proceedings of CEUR workshop, vol 1043 Tan CC, Ngo CW (2013) The vireo team at MediaEval 2013: violent scenes detection by mid-level concepts learnt from youtube. In: MediaEval, Proceedings of CEUR workshop, vol 1043
42.
Zurück zum Zitat Torralba A, Murphy KP, Freeman WT (2010) Using the forest to see the trees: exploiting context for visual object detection and localization. Commun ACM 53(3):107–114CrossRef Torralba A, Murphy KP, Freeman WT (2010) Using the forest to see the trees: exploiting context for visual object detection and localization. Commun ACM 53(3):107–114CrossRef
43.
Zurück zum Zitat Verma Y, Gupta A, Mannem P, Jawahar CV (2013) Generating image descriptions using semantic similarities in the output space. In: CVPR workshops, pp 288–293 Verma Y, Gupta A, Mannem P, Jawahar CV (2013) Generating image descriptions using semantic similarities in the output space. In: CVPR workshops, pp 288–293
44.
Zurück zum Zitat Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. CoRR Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. CoRR
45.
Zurück zum Zitat Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of CVPR, pp 3169–3176. Colorado Springs, USA Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of CVPR, pp 3169–3176. Colorado Springs, USA
46.
Zurück zum Zitat Weng MF, Chuang YY (2012) Cross-domain multicue fusion for concept-based video indexing. IEEE Trans Pattern Anal Mach Intell 34(10):1927–1941CrossRef Weng MF, Chuang YY (2012) Cross-domain multicue fusion for concept-based video indexing. IEEE Trans Pattern Anal Mach Intell 34(10):1927–1941CrossRef
47.
Zurück zum Zitat Yanagawa A, Chang SF, Kennedy L, Hsu W (2007) Columbia University’s baseline detectors for 374 LSCOM semantic visual concepts. Technical report, Columbia University Yanagawa A, Chang SF, Kennedy L, Hsu W (2007) Columbia University’s baseline detectors for 374 LSCOM semantic visual concepts. Technical report, Columbia University
48.
Zurück zum Zitat Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742CrossRef Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742CrossRef
49.
Zurück zum Zitat Yang Y, Teo CL, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of EMNLP, pp 444–454 Yang Y, Teo CL, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of EMNLP, pp 444–454
50.
Zurück zum Zitat Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRef Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRef
Metadaten
Titel
On the use of commonsense ontology for multimedia event recounting
Publikationsdatum
01.06.2016
Erschienen in
International Journal of Multimedia Information Retrieval / Ausgabe 2/2016
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-015-0090-3

Weitere Artikel der Ausgabe 2/2016

International Journal of Multimedia Information Retrieval 2/2016 Zur Ausgabe