Skip to main content
Erschienen in: International Journal of Computer Vision 1/2017

30.05.2016

Spatially Coherent Interpretations of Videos Using Pattern Theory

verfasst von: Fillipe D. M. de Souza, Sudeep Sarkar, Anuj Srivastava, Jingyong Su

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander’s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Albanese, M., Chellappa, R., Cuntoor, N., Moscato, V., Picariello, A., Subrahmanian, V., et al. (2010). Pads: A probabilistic activity detection framework for video data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12), 2246–2261.CrossRef Albanese, M., Chellappa, R., Cuntoor, N., Moscato, V., Picariello, A., Subrahmanian, V., et al. (2010). Pads: A probabilistic activity detection framework for video data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12), 2246–2261.CrossRef
Zurück zum Zitat Albanese, M., Chellappa, R., Moscato, V., Picariello, A., Subrahmanian, V., Turaga, P., et al. (2008). A constrained probabilistic petri net framework for human activity detection in video. IEEE Transactions on Multimedia, 10(6), 982–996.CrossRef Albanese, M., Chellappa, R., Moscato, V., Picariello, A., Subrahmanian, V., Turaga, P., et al. (2008). A constrained probabilistic petri net framework for human activity detection in video. IEEE Transactions on Multimedia, 10(6), 982–996.CrossRef
Zurück zum Zitat Amer, M.R., Todorovic, S., Fern, A., Zhu, S.C. (2013). Monte carlo tree search for scheduling activity recognition. In IEEE International Conference on Computer Vision (ICCV) (pp. 1353–1360). Amer, M.R., Todorovic, S., Fern, A., Zhu, S.C. (2013). Monte carlo tree search for scheduling activity recognition. In IEEE International Conference on Computer Vision (ICCV) (pp. 1353–1360).
Zurück zum Zitat Bhattacharya, S., Kalayeh, M.M., Sukthankar, R., Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Bhattacharya, S., Kalayeh, M.M., Sukthankar, R., Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Brendel, W., Fern, A., Todorovic, S. (2011). Probabilistic event logic for interval-based event recognition. In: CVPR. Brendel, W., Fern, A., Todorovic, S. (2011). Probabilistic event logic for interval-based event recognition. In: CVPR.
Zurück zum Zitat Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.CrossRef Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.CrossRef
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2011). Smote: Synthetic minority over-sampling technique. arXiv preprint arXiv:1106.1813. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2011). Smote: Synthetic minority over-sampling technique. arXiv preprint arXiv:​1106.​1813.
Zurück zum Zitat Das, P., Xu, C., Doell, R.F., Corso, J.J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2634–2641). Das, P., Xu, C., Doell, R.F., Corso, J.J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2634–2641).
Zurück zum Zitat de Souza, F.D.M., Sarkar, S., Srivastava, A., Su, J. (2014). Pattern theory-based interpretation of activities. In: IEEE International Conference on Pattern Recognition (ICPR). de Souza, F.D.M., Sarkar, S., Srivastava, A., Su, J. (2014). Pattern theory-based interpretation of activities. In: IEEE International Conference on Pattern Recognition (ICPR).
Zurück zum Zitat Dubba, K.S.R. (2012). Learning relational event models from videos. Ph.D. thesis, University of Leeds. Dubba, K.S.R. (2012). Learning relational event models from videos. Ph.D. thesis, University of Leeds.
Zurück zum Zitat Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015). Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).
Zurück zum Zitat Ghanem, N., DeMenthon, D., Doermann, D., Davis, L. (2004). Representation and recognition of events in surveillance video using petri nets. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop. 2004. CVPRW’04 (pp. 112–112). Ghanem, N., DeMenthon, D., Doermann, D., Davis, L. (2004). Representation and recognition of events in surveillance video using petri nets. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop. 2004. CVPRW’04 (pp. 112–112).
Zurück zum Zitat Grenander, U. (1993). General pattern theory: A mathematical study of regular structures. Oxford: Clarendon Press.MATH Grenander, U. (1993). General pattern theory: A mathematical study of regular structures. Oxford: Clarendon Press.MATH
Zurück zum Zitat Grenander, U., & Miller, M. I. (2007). Pattern theory: From representation to inference (Vol. 1). Oxford: Oxford University Press.MATH Grenander, U., & Miller, M. I. (2007). Pattern theory: From representation to inference (Vol. 1). Oxford: Oxford University Press.MATH
Zurück zum Zitat Hilde, K., Arslan, A., Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Hilde, K., Arslan, A., Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8), 852–872.CrossRef Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8), 852–872.CrossRef
Zurück zum Zitat Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101. Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101.
Zurück zum Zitat Joo, S.W., Chellappa, R. (2006). Recognition of multi-object events using attribute grammars. In: IEEE International Conference on Image Processing (pp. 2897–2900). Joo, S.W., Chellappa, R. (2006). Recognition of multi-object events using attribute grammars. In: IEEE International Conference on Image Processing (pp. 2897–2900).
Zurück zum Zitat Ke, Y., Sukthankar, R., Hebert, M. (2007). Event detection in crowded videos. In: ICCV. Ke, Y., Sukthankar, R., Hebert, M. (2007). Event detection in crowded videos. In: ICCV.
Zurück zum Zitat Lan, T., Sigal, L., Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In: CVPR. Lan, T., Sigal, L., Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In: CVPR.
Zurück zum Zitat Lan, T., Wang, Y., Yang, W., Robinovitch, S., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8), 1549–1562. Lan, T., Wang, Y., Yang, W., Robinovitch, S., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8), 1549–1562.
Zurück zum Zitat Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8). Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8).
Zurück zum Zitat Morariu, V.I., Davis, L.S. (2011). Multi-agent event recognition in structured scenarios. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3289–3296). Morariu, V.I., Davis, L.S. (2011). Multi-agent event recognition in structured scenarios. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3289–3296).
Zurück zum Zitat Narayanaswamy, S., Barbu, A., Siskind, J. (2014). Seeing what youŕe told: Sentence-guided activity recognition in video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Narayanaswamy, S., Barbu, A., Siskind, J. (2014). Seeing what youŕe told: Sentence-guided activity recognition in video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Pei, M., Jia, Y., Zhu, S.C. (2011). Parsing video events with goal inference and intent prediction. In: IEEE International Conference on Computer Vision (ICCV) (pp. 487–494). Pei, M., Jia, Y., Zhu, S.C. (2011). Parsing video events with goal inference and intent prediction. In: IEEE International Conference on Computer Vision (ICCV) (pp. 487–494).
Zurück zum Zitat Romdhane, R., Boulay, B., Bremond, F., Thonnat, M. (2011). Probabilistic recognition of complex event. In: Computer Vision Systems (CVS) (pp. 122–131). Springer. Romdhane, R., Boulay, B., Bremond, F., Thonnat, M. (2011). Probabilistic recognition of complex event. In: Computer Vision Systems (CVS) (pp. 122–131). Springer.
Zurück zum Zitat Ryoo, M.S., Aggarwal, J.K. (2007). Robust human-computer interaction system guiding a user by providing feedback. In: IJCAI (pp. 2850–2855). Ryoo, M.S., Aggarwal, J.K. (2007). Robust human-computer interaction system guiding a user by providing feedback. In: IJCAI (pp. 2850–2855).
Zurück zum Zitat Sadanand, S., Corso, J.J. (2012). Action bank: A high-level representation of activity in video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Sadanand, S., Corso, J.J. (2012). Action bank: A high-level representation of activity in video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
Zurück zum Zitat Shu, T., Xie, D., Rothrock, B., Todorovic, S., Zhu, S.C. (2015). Joint inference of groups, events and human roles in aerial videos. In: CVPR. Shu, T., Xie, D., Rothrock, B., Todorovic, S., Zhu, S.C. (2015). Joint inference of groups, events and human roles in aerial videos. In: CVPR.
Zurück zum Zitat Si, Z., Pei, M., Yao, B., Zhu, S.C. (2011). Unsupervised learning of event and-or grammar and semantics from video. In: IEEE International Conference on Computer Vision (ICCV) (pp. 41–48). Si, Z., Pei, M., Yao, B., Zhu, S.C. (2011). Unsupervised learning of event and-or grammar and semantics from video. In: IEEE International Conference on Computer Vision (ICCV) (pp. 41–48).
Zurück zum Zitat Souza, F., Sarkar, S., Srivastava, A., Su, J. (2015). Temporally coherent interpretations for long videos using pattern theory. In: CVPR. Souza, F., Sarkar, S., Srivastava, A., Su, J. (2015). Temporally coherent interpretations for long videos using pattern theory. In: CVPR.
Zurück zum Zitat Vahdat, A., Cannons, K., Mori, G., Kim, I., Oh, S. (2013). Compositional models for video event detection: A multiple kernel learning latent variable approach. In: ICCV. Vahdat, A., Cannons, K., Mori, G., Kim, I., Oh, S. (2013). Compositional models for video event detection: A multiple kernel learning latent variable approach. In: ICCV.
Zurück zum Zitat Wang, X., Ji, Q. (2015). Video event recognition with deep hierarchical context model. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wang, X., Ji, Q. (2015). Video event recognition with deep hierarchical context model. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Wei, P., Zhao, Y., Zheng, N., Zhu, S.C. (2013). Modeling 4d human-object interactions for event and object recognition. In: ICCV. Wei, P., Zhao, Y., Zheng, N., Zhu, S.C. (2013). Modeling 4d human-object interactions for event and object recognition. In: ICCV.
Zurück zum Zitat Xu, Z., Yang, Y., Hauptmann, A.G. (2015). A discriminative cnn video representation for event detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Xu, Z., Yang, Y., Hauptmann, A.G. (2015). A discriminative cnn video representation for event detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Metadaten
Titel
Spatially Coherent Interpretations of Videos Using Pattern Theory
verfasst von
Fillipe D. M. de Souza
Sudeep Sarkar
Anuj Srivastava
Jingyong Su
Publikationsdatum
30.05.2016
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 1/2017
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-016-0913-6

Weitere Artikel der Ausgabe 1/2017

International Journal of Computer Vision 1/2017 Zur Ausgabe

Premium Partner