nach oben

International Journal of Computer Vision

Erschienen in:

05.10.2016

Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences

verfasst von: Eren Erdal Aksoy, Adil Orhan, Florentin Wörgötter

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Understanding continuous human actions is a non-trivial but important problem in computer vision. Although there exists a large corpus of work in the recognition of action sequences, most approaches suffer from problems relating to vast variations in motions, action combinations, and scene contexts. In this paper, we introduce a novel method for semantic segmentation and recognition of long and complex manipulation action tasks, such as “preparing a breakfast” or “making a sandwich”. We represent manipulations with our recently introduced “Semantic Event Chain” (SEC) concept, which captures the underlying spatiotemporal structure of an action invariant to motion, velocity, and scene context. Solely based on the spatiotemporal interactions between manipulated objects and hands in the extracted SEC, the framework automatically parses individual manipulation streams performed either sequentially or concurrently. Using event chains, our method further extracts basic primitive elements of each parsed manipulation. Without requiring any prior object knowledge, the proposed framework can also extract object-like scene entities that exhibit the same role in semantically similar manipulations. We conduct extensive experiments on various recent datasets to validate the robustness of the framework.

Vorheriger Artikel Generalizing the Prediction Sum of Squares Statistic and Formula, Application to Linear Fractional Image Warp and Surface Fitting

Nächster Artikel Partially Camouflaged Object Tracking using Modified Probabilistic Neural Network and Fuzzy Energy based Active Contour

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Abramov, A., Aksoy, E.E., Dörr, J., Pauwels, K., Wörgötter, F., Dellen, B. (2010). 3D semantic representation of actions from efficient stereo-image-sequence segmentation on GPUs. In 5th International Symposium 3D Data Processing, Visualization and Transmission (pp. 1–8).

Abramov, A., Pauwels, K., Papon, J., Wörgötter, F., & Dellen, B. (2012). Depth-supported real-time video segmentation with the kinect. In IEEE Workshop on Applications of Computer Vision (pp. 457–464).

Ahad, M. A. R. (2011). Computer vision and action recognition: A guide for image processing and computer vision community for action understanding. New York: Atlantis Publishing Corporation.CrossRef

Aksoy, E. E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object-action relations by observation. The International Journal of Robotics Research, 30, 1229–1249.CrossRef

Aksoy, E. E., Abramov, A., Wörgötter, F., & Dellen, B. (2010). Categorizing object-action relations from semantic scene graphs. In: IEEE International Conference on Robotics and Automation (pp. 398–405).

Aksoy, E. E., Tamosiunaite, M., Vuga, R., Ude, A., Geib, C., Steedman, M., & Wörgötter, F. (2013). Structural bootstrapping at the sensorimotor level for the fast acquisition of action knowledge for cognitive robots. In IEEE International Conference on Development and Learning and Epigenetic Robotics (pp. 1–8).

Aksoy, E. E., Tamosiunaite, M., & Wörgötter, F. (2015). Model-free incremental learning of the semantics of manipulation actions. Robotics and Autonomous Systems, 71, 118–133.CrossRef

Anscombe, G. E. M. (1963). Intention. Ithaca: Cornell University Press.

Badler, N. (1975). Temporal scene analysis: Conceptual descriptions of object movements. Ph.D. thesis. University of Toronto, Toronto.

Bobick, A. F., & Ivanov, Y. A. (1998). Action recognition using probabilistic parsing. In Computer Vision and Pattern Recognition (pp. 196–202).

Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine intelligence, 23, 257–267.CrossRef

Brand, M. (1996). Understanding manipulation in video. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (pp. 94–99).

Brand, M. (1997). The inverse hollywood problem: From video to scripts and storyboards via causal analysis. In Proceedings, AAAI97 (pp. 12–96).

Bullock, I. M., Ma, R. R., & Dollar, M. A. (2013). A hand-centric classification of human and robot dexterous manipulation. IEEE Transactions on Haptics, 6, 129–144.CrossRef

Chen, H. S., Chen, H. T., Chen, Y. W., & Lee, S. Y. (2006). Human action recognition using star skeleton. In Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks (pp. 171–178).

Cutkosky, M. R. (1989). On grasp choice, grasp models, and the design of hands for manufacturing tasks. In IEEE International Conference on Robotics and Automation (pp. 269–279).

Danelljan, M., Khan, F. S., Felsberg, M., & Weijer, J. (2006). Adaptive color attributes for real-time visual tracking. In Computer Vision and Pattern Recognition (pp. 1090–1097).

Efros, A. A., Berg, A. C., Mori, G., Malik, J. (2003). Recognizing action at a distance. In IEEE International Conference on Computer Vision (pp. 726–733).

Ekvall, S. & Kragic, D. (2005). Grasp recognition for programming by demonstration. In IEEE International Conference on Robotics and Automation (pp. 748–753).

Elliott, J. M., & Connolly, K. J. (1984). A classification of manipulative hand movements. Developmental Medicine & Child Neurology, 26, 283–296.CrossRef

Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In International Conference on Computer Vision (pp. 407–414).

Feix, T., Pawlik, R., Schmiedmayer, H., Romero, J., & Kragic, D. (2009). A comprehensive grasp taxonomy. In Robotics, Science and Systems: Workshop on Understanding the Human Hand for Advancing Robotic Manipulation (pp. 407–414).

Fern, A., Siskind, J. M., & Givan, R. (2002). Learning temporal, relational, force-dynamic event definitions from video. In National Conference on Artificial Intelligence (pp. 159–166).

Fernando, B., Gavves, E., Oramas, M. J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 5378–5387).

Graf, J., Puls, S., & Wörn, H. (2010). Recognition and understanding situations and activities with description logics for safe human-robot cooperation. In The Second International Conference on Advanced Cognitive Technologies and Applications: Cognitive 2010 (p. 7).

Gupta, A., & Davis, L. (2007). Objects in action: An approach for combining action understanding and object perception. In Computer Vision and Pattern Recognition (pp. 1–8).

Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1775–1789.CrossRef

Hoai, M., Zhong Lan, Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3265–3272).

Ke, Y., Sukthankar, R., & Hebert, M. (2007). Event detection in crowded videos. In IEEE International Conference on Computer Vision.

Kjellström, H., Romero, J., & Kragić, D. (2011). Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115, 81–90.CrossRef

Koo, S. Y., Lee, D., & Kwon, D. S. (2014). Incremental object learning and robust tracking of multiple objects from RGB-D point set data. Journal of Visual Communication and Image Representation, 25, 108–121.CrossRef

Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32, 951–970.CrossRef

Krüger, N., Geib, C., Piater, J., Petrick, R., Steedman, M., Wörgötter, F., et al. (2011). Object-action complexes: Grounded abstractions of sensory-motor processes. Robotics and Autonomous Systems, 59, 740–757.CrossRef

Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64, 107–123.MathSciNetCrossRef

Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In International Conference on Computer Vision (pp. 432–439).

Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition (pp. 1–8).

Lee, K., Su, Y., Kim, T. K., & Demiris, Y. (2013). A syntactic approach to robot imitation learning using probabilistic activity grammars. Robotics and Autonomous Systems, 61, 1323–1334.CrossRef

Li, Y., Ye, Z., & Rehg, J. M. (2015). Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition.

Liu, J., Feng, F., Nakamura, Y. C., & Pollard, N. S. (2016). Annotating everyday grasps in action. In J.-P. Laumond & N. Abe (Eds.), Dance notations and robot motion (pp. 263–282). Springer tracts in robotics Berlin: Springer.

Luo, G., Bergstrom, N., Ek, C., & Kragic, D. (2011). Representing actions with kernels. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 2028–2035).

Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In European Conference on Computer Vision (Vol. IV, pp. 359–372).

Martinez, D., Alenya, G., Jimenez, P., Torras, C., Rossmann, J., Wantia, N., Aksoy, E.E., Haller, S., & Piater, J. (2014). Active learning of manipulation sequences. In IEEE International Conference on Robotics and Automation (pp. 5671–5678).

Mele, A. (1992). Springs of action: Understanding intentional behavior. Oxford: Oxford University Press.

Minnen, D., Essa, J., & Starner, T. (2003). Expectation grammars: Leveraging high-level expectations for activity recognition. In Computer Vision and Pattern Recognition (pp. 626–632).

Nagahama, K., Yamazaki, K., Okada, K., & Inaba, M. (2013). Manipulation of multiple objects in close proximity based on visual hierarchical relationships. In IEEE International Conference on Robotics and Automation (pp. 1303–1310).

Papon, J., Abramov, A., Aksoy, E. E., & Wörgötter, F. (2012). A modular system architecture for online parallel vision pipelines. In IEEE Workshop on Applications of Computer Vision (WACV) (pp. 361–368).

Pardowitz, M., Haschke, R., Steil, J., & Ritter, H. (2008). Gestalt-based action segmentation for robot task learning. In IEEE-RAS International Conference on Humanoid Robots (pp. 347–352).

Pauwels, K., Krüger, N., Lappe, M., Wörgötter, F., & Van Hulle, M. (2010). A cortical architecture on parallel hardware for motion processing in real time. Journal of Vision, 10, 18.CrossRef

Pei, M., Si, Z., Yao, B., & Zhu, S. C. (2013). Video event parsing and learning with goal and intent prediction. Computer Vision and Image Understanding, 117, 1369–1383.CrossRef

Peursum, P., Bui, H. H., Venkatesh, S., & West, G. A. W. (2004). Human action segmentation via controlled use of missing data in HMMs. In International Conference on Pattern Recognition (pp. 440–445).

Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28, 976–990.CrossRef

Ramirez-Amaro, K., Kim, E.S., Kim, J., Zhang, B.T., Beetz, M., & Cheng, G. (2013). Enhancing human action recognition through spatio-temporal feature learning and semantic rules. In IEEE-RAS International Conference on Humanoid Robots.

Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition.

Rui, Y., & Anandan, P. (2000). Segmenting visual actions based on spatio-temporal motion patterns. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 111–118).

Ryoo, M. S., & Aggarwal, J. K. (2000). Recognition of composite human activities through context-free grammar based representation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1709–1718).

Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004 (Vol. 3, pp. 32–36).

Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th International Conference on Multimedia (pp. 357–360).

Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-Markov models. International Journal of Computer Vision, 93, 22–32.CrossRefMATH

Siskind, J. (1994). Grounding language in perception. Artificial Intelligence Review, 8, 371–391.CrossRef

Siskind, J., & Morris, Q. (1996) . A maximum-likelihood approach to visual event classification. In European Conference on Computer Vision (pp. 347–360).

Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Conditional models for contextual human motion recognition. Computer Vision and Image Understanding, 104, 210–220.CrossRef

Sridhar, M., Cohn, G. A., & Hogg, D. (2008). Learning functional object-categories from a relational spatio-temporal representation. In Proceedings of 18th European Conference on Artificial Intelligence (pp. 606–610).

Thuc, H. L. U., Ke, S. R., Hwang, J. N., Tuan, P. V., & Chau, T. N. (2012). Quasi-periodic action recognition from monocular videos via 3d human models and cyclic hmms. In International Conference on Advanced Technologies for Communications (pp. 110–113).

Vitaladevuni, S. N. P., Kellokumpu, V., & Davis, L. S. (2008). Action recognition using ballistic dynamics. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

Vuga, R., Aksoy, E. E., Wörgötter, F., & Ude, A. (2013). Augmenting semantic event chains with trajectory information for learning and recognition of manipulation tasks. In International Workshop on Robotics in Alpe-Adria-Danube Region (RAAD).

Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition (pp. 3169–3176).

Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision.

Wang, Z., Wang, J., Xiao, J., Lin, K. H., & Huang, T. (2012). Substructure and boundary modeling for continuous action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1330–1337).

Wei, P., Zhao, Y., Zheng, N., & Zhu, S. C. (2006). Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Transactions on Pattern Analysis & Machine Intelligence

Weinland, D., Ronfard, R., & Boyer, E. (2006). Automatic discovery of action taxonomies from multiple views. IEEE Conference on Computer Vision and Pattern Recognition, 2, 1639–1645.

Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115, 224–241.CrossRef

Wimmer, R. (2011). Grasp sensing for human-computer interaction. In Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction (pp. 221–228).

Wörgötter, F., Agostini, A., Krüger, N., Shylo, N., & Porr, B. (2009). Cognitive agents: A procedural perspective relying on the predictability of object-action-complexes oacs. Robotics and Autonomous Systems, 57, 420–432.CrossRef

Wörgötter, F., Aksoy, E. E., Krüger, N., Piater, J., Ude, A., & Tamosiunaite, M. (2013). A simple ontology of manipulation actions based on hand-object relations. IEEE Transactions on Autonomous Mental Development, 5, 117–134.CrossRef

Wörgötter, F., Geib, C., Tamosiunaite, M., Aksoy, E. E., Piater, J., Hanchen, X., Ude, A., Nemec, B., Kraft, D., Krüger, N., Wächter, M., & Asfour, T. (2015). Structural bootstrapping: A novel concept for the fast acquisition of action-knowledge. IEEE Transactions on Autonomous Mental Development, 140–154.

Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 379–385).

Yang, Y., Fermüller, C., & Aloimonos, Y. (2013). Detection of manipulation action consequences (mac). In International Conference on Computer Vision and Pattern Recognition (pp. 2563–2570).

Yang, S., Gao, Q., Liu, C., Xiong, C., & Chai, J. (2016). Grounded Semantic Role Labeling. In The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2016).

Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 819–826).

Zhou, F., De la Torre Frade, F., & Hodgins, J. K. (2013). Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35, 582–596.CrossRef

Titel: Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences
verfasst von: Eren Erdal Aksoy
Adil Orhan
Florentin Wörgötter
Publikationsdatum: 05.10.2016
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 1/2017
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-016-0956-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2017

Fast Algorithms for Fitting Active Appearance Models to Unconstrained Images

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

Refining Geometry from Depth Sensors using IR Shading Images

Free-Hand Sketch Synthesis with Deformable Stroke Models

Generalizing the Prediction Sum of Squares Statistic and Formula, Application to Linear Fractional Image Warp and Surface Fitting

Active Rectification of Curved Document Images Using Structured Beams