Skip to main content
Erschienen in: International Journal of Computer Vision 1/2017

05.10.2016

Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences

verfasst von: Eren Erdal Aksoy, Adil Orhan, Florentin Wörgötter

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Understanding continuous human actions is a non-trivial but important problem in computer vision. Although there exists a large corpus of work in the recognition of action sequences, most approaches suffer from problems relating to vast variations in motions, action combinations, and scene contexts. In this paper, we introduce a novel method for semantic segmentation and recognition of long and complex manipulation action tasks, such as “preparing a breakfast” or “making a sandwich”. We represent manipulations with our recently introduced “Semantic Event Chain” (SEC) concept, which captures the underlying spatiotemporal structure of an action invariant to motion, velocity, and scene context. Solely based on the spatiotemporal interactions between manipulated objects and hands in the extracted SEC, the framework automatically parses individual manipulation streams performed either sequentially or concurrently. Using event chains, our method further extracts basic primitive elements of each parsed manipulation. Without requiring any prior object knowledge, the proposed framework can also extract object-like scene entities that exhibit the same role in semantically similar manipulations. We conduct extensive experiments on various recent datasets to validate the robustness of the framework.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Abramov, A., Aksoy, E.E., Dörr, J., Pauwels, K., Wörgötter, F., Dellen, B. (2010). 3D semantic representation of actions from efficient stereo-image-sequence segmentation on GPUs. In 5th International Symposium 3D Data Processing, Visualization and Transmission (pp. 1–8). Abramov, A., Aksoy, E.E., Dörr, J., Pauwels, K., Wörgötter, F., Dellen, B. (2010). 3D semantic representation of actions from efficient stereo-image-sequence segmentation on GPUs. In 5th International Symposium 3D Data Processing, Visualization and Transmission (pp. 1–8).
Zurück zum Zitat Abramov, A., Pauwels, K., Papon, J., Wörgötter, F., & Dellen, B. (2012). Depth-supported real-time video segmentation with the kinect. In IEEE Workshop on Applications of Computer Vision (pp. 457–464). Abramov, A., Pauwels, K., Papon, J., Wörgötter, F., & Dellen, B. (2012). Depth-supported real-time video segmentation with the kinect. In IEEE Workshop on Applications of Computer Vision (pp. 457–464).
Zurück zum Zitat Ahad, M. A. R. (2011). Computer vision and action recognition: A guide for image processing and computer vision community for action understanding. New York: Atlantis Publishing Corporation.CrossRef Ahad, M. A. R. (2011). Computer vision and action recognition: A guide for image processing and computer vision community for action understanding. New York: Atlantis Publishing Corporation.CrossRef
Zurück zum Zitat Aksoy, E. E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object-action relations by observation. The International Journal of Robotics Research, 30, 1229–1249.CrossRef Aksoy, E. E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object-action relations by observation. The International Journal of Robotics Research, 30, 1229–1249.CrossRef
Zurück zum Zitat Aksoy, E. E., Abramov, A., Wörgötter, F., & Dellen, B. (2010). Categorizing object-action relations from semantic scene graphs. In: IEEE International Conference on Robotics and Automation (pp. 398–405). Aksoy, E. E., Abramov, A., Wörgötter, F., & Dellen, B. (2010). Categorizing object-action relations from semantic scene graphs. In: IEEE International Conference on Robotics and Automation (pp. 398–405).
Zurück zum Zitat Aksoy, E. E., Tamosiunaite, M., Vuga, R., Ude, A., Geib, C., Steedman, M., & Wörgötter, F. (2013). Structural bootstrapping at the sensorimotor level for the fast acquisition of action knowledge for cognitive robots. In IEEE International Conference on Development and Learning and Epigenetic Robotics (pp. 1–8). Aksoy, E. E., Tamosiunaite, M., Vuga, R., Ude, A., Geib, C., Steedman, M., & Wörgötter, F. (2013). Structural bootstrapping at the sensorimotor level for the fast acquisition of action knowledge for cognitive robots. In IEEE International Conference on Development and Learning and Epigenetic Robotics (pp. 1–8).
Zurück zum Zitat Aksoy, E. E., Tamosiunaite, M., & Wörgötter, F. (2015). Model-free incremental learning of the semantics of manipulation actions. Robotics and Autonomous Systems, 71, 118–133.CrossRef Aksoy, E. E., Tamosiunaite, M., & Wörgötter, F. (2015). Model-free incremental learning of the semantics of manipulation actions. Robotics and Autonomous Systems, 71, 118–133.CrossRef
Zurück zum Zitat Anscombe, G. E. M. (1963). Intention. Ithaca: Cornell University Press. Anscombe, G. E. M. (1963). Intention. Ithaca: Cornell University Press.
Zurück zum Zitat Badler, N. (1975). Temporal scene analysis: Conceptual descriptions of object movements. Ph.D. thesis. University of Toronto, Toronto. Badler, N. (1975). Temporal scene analysis: Conceptual descriptions of object movements. Ph.D. thesis. University of Toronto, Toronto.
Zurück zum Zitat Bobick, A. F., & Ivanov, Y. A. (1998). Action recognition using probabilistic parsing. In Computer Vision and Pattern Recognition (pp. 196–202). Bobick, A. F., & Ivanov, Y. A. (1998). Action recognition using probabilistic parsing. In Computer Vision and Pattern Recognition (pp. 196–202).
Zurück zum Zitat Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine intelligence, 23, 257–267.CrossRef Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine intelligence, 23, 257–267.CrossRef
Zurück zum Zitat Brand, M. (1996). Understanding manipulation in video. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (pp. 94–99). Brand, M. (1996). Understanding manipulation in video. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (pp. 94–99).
Zurück zum Zitat Brand, M. (1997). The inverse hollywood problem: From video to scripts and storyboards via causal analysis. In Proceedings, AAAI97 (pp. 12–96). Brand, M. (1997). The inverse hollywood problem: From video to scripts and storyboards via causal analysis. In Proceedings, AAAI97 (pp. 12–96).
Zurück zum Zitat Bullock, I. M., Ma, R. R., & Dollar, M. A. (2013). A hand-centric classification of human and robot dexterous manipulation. IEEE Transactions on Haptics, 6, 129–144.CrossRef Bullock, I. M., Ma, R. R., & Dollar, M. A. (2013). A hand-centric classification of human and robot dexterous manipulation. IEEE Transactions on Haptics, 6, 129–144.CrossRef
Zurück zum Zitat Chen, H. S., Chen, H. T., Chen, Y. W., & Lee, S. Y. (2006). Human action recognition using star skeleton. In Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks (pp. 171–178). Chen, H. S., Chen, H. T., Chen, Y. W., & Lee, S. Y. (2006). Human action recognition using star skeleton. In Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks (pp. 171–178).
Zurück zum Zitat Cutkosky, M. R. (1989). On grasp choice, grasp models, and the design of hands for manufacturing tasks. In IEEE International Conference on Robotics and Automation (pp. 269–279). Cutkosky, M. R. (1989). On grasp choice, grasp models, and the design of hands for manufacturing tasks. In IEEE International Conference on Robotics and Automation (pp. 269–279).
Zurück zum Zitat Danelljan, M., Khan, F. S., Felsberg, M., & Weijer, J. (2006). Adaptive color attributes for real-time visual tracking. In Computer Vision and Pattern Recognition (pp. 1090–1097). Danelljan, M., Khan, F. S., Felsberg, M., & Weijer, J. (2006). Adaptive color attributes for real-time visual tracking. In Computer Vision and Pattern Recognition (pp. 1090–1097).
Zurück zum Zitat Efros, A. A., Berg, A. C., Mori, G., Malik, J. (2003). Recognizing action at a distance. In IEEE International Conference on Computer Vision (pp. 726–733). Efros, A. A., Berg, A. C., Mori, G., Malik, J. (2003). Recognizing action at a distance. In IEEE International Conference on Computer Vision (pp. 726–733).
Zurück zum Zitat Ekvall, S. & Kragic, D. (2005). Grasp recognition for programming by demonstration. In IEEE International Conference on Robotics and Automation (pp. 748–753). Ekvall, S. & Kragic, D. (2005). Grasp recognition for programming by demonstration. In IEEE International Conference on Robotics and Automation (pp. 748–753).
Zurück zum Zitat Elliott, J. M., & Connolly, K. J. (1984). A classification of manipulative hand movements. Developmental Medicine & Child Neurology, 26, 283–296.CrossRef Elliott, J. M., & Connolly, K. J. (1984). A classification of manipulative hand movements. Developmental Medicine & Child Neurology, 26, 283–296.CrossRef
Zurück zum Zitat Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In International Conference on Computer Vision (pp. 407–414). Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In International Conference on Computer Vision (pp. 407–414).
Zurück zum Zitat Feix, T., Pawlik, R., Schmiedmayer, H., Romero, J., & Kragic, D. (2009). A comprehensive grasp taxonomy. In Robotics, Science and Systems: Workshop on Understanding the Human Hand for Advancing Robotic Manipulation (pp. 407–414). Feix, T., Pawlik, R., Schmiedmayer, H., Romero, J., & Kragic, D. (2009). A comprehensive grasp taxonomy. In Robotics, Science and Systems: Workshop on Understanding the Human Hand for Advancing Robotic Manipulation (pp. 407–414).
Zurück zum Zitat Fern, A., Siskind, J. M., & Givan, R. (2002). Learning temporal, relational, force-dynamic event definitions from video. In National Conference on Artificial Intelligence (pp. 159–166). Fern, A., Siskind, J. M., & Givan, R. (2002). Learning temporal, relational, force-dynamic event definitions from video. In National Conference on Artificial Intelligence (pp. 159–166).
Zurück zum Zitat Fernando, B., Gavves, E., Oramas, M. J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 5378–5387). Fernando, B., Gavves, E., Oramas, M. J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (pp. 5378–5387).
Zurück zum Zitat Graf, J., Puls, S., & Wörn, H. (2010). Recognition and understanding situations and activities with description logics for safe human-robot cooperation. In The Second International Conference on Advanced Cognitive Technologies and Applications: Cognitive 2010 (p. 7). Graf, J., Puls, S., & Wörn, H. (2010). Recognition and understanding situations and activities with description logics for safe human-robot cooperation. In The Second International Conference on Advanced Cognitive Technologies and Applications: Cognitive 2010 (p. 7).
Zurück zum Zitat Gupta, A., & Davis, L. (2007). Objects in action: An approach for combining action understanding and object perception. In Computer Vision and Pattern Recognition (pp. 1–8). Gupta, A., & Davis, L. (2007). Objects in action: An approach for combining action understanding and object perception. In Computer Vision and Pattern Recognition (pp. 1–8).
Zurück zum Zitat Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1775–1789.CrossRef Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1775–1789.CrossRef
Zurück zum Zitat Hoai, M., Zhong Lan, Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3265–3272). Hoai, M., Zhong Lan, Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3265–3272).
Zurück zum Zitat Ke, Y., Sukthankar, R., & Hebert, M. (2007). Event detection in crowded videos. In IEEE International Conference on Computer Vision. Ke, Y., Sukthankar, R., & Hebert, M. (2007). Event detection in crowded videos. In IEEE International Conference on Computer Vision.
Zurück zum Zitat Kjellström, H., Romero, J., & Kragić, D. (2011). Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115, 81–90.CrossRef Kjellström, H., Romero, J., & Kragić, D. (2011). Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115, 81–90.CrossRef
Zurück zum Zitat Koo, S. Y., Lee, D., & Kwon, D. S. (2014). Incremental object learning and robust tracking of multiple objects from RGB-D point set data. Journal of Visual Communication and Image Representation, 25, 108–121.CrossRef Koo, S. Y., Lee, D., & Kwon, D. S. (2014). Incremental object learning and robust tracking of multiple objects from RGB-D point set data. Journal of Visual Communication and Image Representation, 25, 108–121.CrossRef
Zurück zum Zitat Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32, 951–970.CrossRef Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32, 951–970.CrossRef
Zurück zum Zitat Krüger, N., Geib, C., Piater, J., Petrick, R., Steedman, M., Wörgötter, F., et al. (2011). Object-action complexes: Grounded abstractions of sensory-motor processes. Robotics and Autonomous Systems, 59, 740–757.CrossRef Krüger, N., Geib, C., Piater, J., Petrick, R., Steedman, M., Wörgötter, F., et al. (2011). Object-action complexes: Grounded abstractions of sensory-motor processes. Robotics and Autonomous Systems, 59, 740–757.CrossRef
Zurück zum Zitat Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In International Conference on Computer Vision (pp. 432–439). Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In International Conference on Computer Vision (pp. 432–439).
Zurück zum Zitat Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition (pp. 1–8). Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition (pp. 1–8).
Zurück zum Zitat Lee, K., Su, Y., Kim, T. K., & Demiris, Y. (2013). A syntactic approach to robot imitation learning using probabilistic activity grammars. Robotics and Autonomous Systems, 61, 1323–1334.CrossRef Lee, K., Su, Y., Kim, T. K., & Demiris, Y. (2013). A syntactic approach to robot imitation learning using probabilistic activity grammars. Robotics and Autonomous Systems, 61, 1323–1334.CrossRef
Zurück zum Zitat Li, Y., Ye, Z., & Rehg, J. M. (2015). Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition. Li, Y., Ye, Z., & Rehg, J. M. (2015). Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition.
Zurück zum Zitat Liu, J., Feng, F., Nakamura, Y. C., & Pollard, N. S. (2016). Annotating everyday grasps in action. In J.-P. Laumond & N. Abe (Eds.), Dance notations and robot motion (pp. 263–282). Springer tracts in robotics Berlin: Springer. Liu, J., Feng, F., Nakamura, Y. C., & Pollard, N. S. (2016). Annotating everyday grasps in action. In J.-P. Laumond & N. Abe (Eds.), Dance notations and robot motion (pp. 263–282). Springer tracts in robotics Berlin: Springer.
Zurück zum Zitat Luo, G., Bergstrom, N., Ek, C., & Kragic, D. (2011). Representing actions with kernels. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 2028–2035). Luo, G., Bergstrom, N., Ek, C., & Kragic, D. (2011). Representing actions with kernels. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 2028–2035).
Zurück zum Zitat Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In European Conference on Computer Vision (Vol. IV, pp. 359–372). Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In European Conference on Computer Vision (Vol. IV, pp. 359–372).
Zurück zum Zitat Martinez, D., Alenya, G., Jimenez, P., Torras, C., Rossmann, J., Wantia, N., Aksoy, E.E., Haller, S., & Piater, J. (2014). Active learning of manipulation sequences. In IEEE International Conference on Robotics and Automation (pp. 5671–5678). Martinez, D., Alenya, G., Jimenez, P., Torras, C., Rossmann, J., Wantia, N., Aksoy, E.E., Haller, S., & Piater, J. (2014). Active learning of manipulation sequences. In IEEE International Conference on Robotics and Automation (pp. 5671–5678).
Zurück zum Zitat Mele, A. (1992). Springs of action: Understanding intentional behavior. Oxford: Oxford University Press. Mele, A. (1992). Springs of action: Understanding intentional behavior. Oxford: Oxford University Press.
Zurück zum Zitat Minnen, D., Essa, J., & Starner, T. (2003). Expectation grammars: Leveraging high-level expectations for activity recognition. In Computer Vision and Pattern Recognition (pp. 626–632). Minnen, D., Essa, J., & Starner, T. (2003). Expectation grammars: Leveraging high-level expectations for activity recognition. In Computer Vision and Pattern Recognition (pp. 626–632).
Zurück zum Zitat Nagahama, K., Yamazaki, K., Okada, K., & Inaba, M. (2013). Manipulation of multiple objects in close proximity based on visual hierarchical relationships. In IEEE International Conference on Robotics and Automation (pp. 1303–1310). Nagahama, K., Yamazaki, K., Okada, K., & Inaba, M. (2013). Manipulation of multiple objects in close proximity based on visual hierarchical relationships. In IEEE International Conference on Robotics and Automation (pp. 1303–1310).
Zurück zum Zitat Papon, J., Abramov, A., Aksoy, E. E., & Wörgötter, F. (2012). A modular system architecture for online parallel vision pipelines. In IEEE Workshop on Applications of Computer Vision (WACV) (pp. 361–368). Papon, J., Abramov, A., Aksoy, E. E., & Wörgötter, F. (2012). A modular system architecture for online parallel vision pipelines. In IEEE Workshop on Applications of Computer Vision (WACV) (pp. 361–368).
Zurück zum Zitat Pardowitz, M., Haschke, R., Steil, J., & Ritter, H. (2008). Gestalt-based action segmentation for robot task learning. In IEEE-RAS International Conference on Humanoid Robots (pp. 347–352). Pardowitz, M., Haschke, R., Steil, J., & Ritter, H. (2008). Gestalt-based action segmentation for robot task learning. In IEEE-RAS International Conference on Humanoid Robots (pp. 347–352).
Zurück zum Zitat Pauwels, K., Krüger, N., Lappe, M., Wörgötter, F., & Van Hulle, M. (2010). A cortical architecture on parallel hardware for motion processing in real time. Journal of Vision, 10, 18.CrossRef Pauwels, K., Krüger, N., Lappe, M., Wörgötter, F., & Van Hulle, M. (2010). A cortical architecture on parallel hardware for motion processing in real time. Journal of Vision, 10, 18.CrossRef
Zurück zum Zitat Pei, M., Si, Z., Yao, B., & Zhu, S. C. (2013). Video event parsing and learning with goal and intent prediction. Computer Vision and Image Understanding, 117, 1369–1383.CrossRef Pei, M., Si, Z., Yao, B., & Zhu, S. C. (2013). Video event parsing and learning with goal and intent prediction. Computer Vision and Image Understanding, 117, 1369–1383.CrossRef
Zurück zum Zitat Peursum, P., Bui, H. H., Venkatesh, S., & West, G. A. W. (2004). Human action segmentation via controlled use of missing data in HMMs. In International Conference on Pattern Recognition (pp. 440–445). Peursum, P., Bui, H. H., Venkatesh, S., & West, G. A. W. (2004). Human action segmentation via controlled use of missing data in HMMs. In International Conference on Pattern Recognition (pp. 440–445).
Zurück zum Zitat Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28, 976–990.CrossRef Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28, 976–990.CrossRef
Zurück zum Zitat Ramirez-Amaro, K., Kim, E.S., Kim, J., Zhang, B.T., Beetz, M., & Cheng, G. (2013). Enhancing human action recognition through spatio-temporal feature learning and semantic rules. In IEEE-RAS International Conference on Humanoid Robots. Ramirez-Amaro, K., Kim, E.S., Kim, J., Zhang, B.T., Beetz, M., & Cheng, G. (2013). Enhancing human action recognition through spatio-temporal feature learning and semantic rules. In IEEE-RAS International Conference on Humanoid Robots.
Zurück zum Zitat Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition. Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition.
Zurück zum Zitat Rui, Y., & Anandan, P. (2000). Segmenting visual actions based on spatio-temporal motion patterns. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 111–118). Rui, Y., & Anandan, P. (2000). Segmenting visual actions based on spatio-temporal motion patterns. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 111–118).
Zurück zum Zitat Ryoo, M. S., & Aggarwal, J. K. (2000). Recognition of composite human activities through context-free grammar based representation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1709–1718). Ryoo, M. S., & Aggarwal, J. K. (2000). Recognition of composite human activities through context-free grammar based representation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1709–1718).
Zurück zum Zitat Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004 (Vol. 3, pp. 32–36). Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004 (Vol. 3, pp. 32–36).
Zurück zum Zitat Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th International Conference on Multimedia (pp. 357–360). Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th International Conference on Multimedia (pp. 357–360).
Zurück zum Zitat Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-Markov models. International Journal of Computer Vision, 93, 22–32.CrossRefMATH Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-Markov models. International Journal of Computer Vision, 93, 22–32.CrossRefMATH
Zurück zum Zitat Siskind, J. (1994). Grounding language in perception. Artificial Intelligence Review, 8, 371–391.CrossRef Siskind, J. (1994). Grounding language in perception. Artificial Intelligence Review, 8, 371–391.CrossRef
Zurück zum Zitat Siskind, J., & Morris, Q. (1996) . A maximum-likelihood approach to visual event classification. In European Conference on Computer Vision (pp. 347–360). Siskind, J., & Morris, Q. (1996) . A maximum-likelihood approach to visual event classification. In European Conference on Computer Vision (pp. 347–360).
Zurück zum Zitat Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Conditional models for contextual human motion recognition. Computer Vision and Image Understanding, 104, 210–220.CrossRef Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Conditional models for contextual human motion recognition. Computer Vision and Image Understanding, 104, 210–220.CrossRef
Zurück zum Zitat Sridhar, M., Cohn, G. A., & Hogg, D. (2008). Learning functional object-categories from a relational spatio-temporal representation. In Proceedings of 18th European Conference on Artificial Intelligence (pp. 606–610). Sridhar, M., Cohn, G. A., & Hogg, D. (2008). Learning functional object-categories from a relational spatio-temporal representation. In Proceedings of 18th European Conference on Artificial Intelligence (pp. 606–610).
Zurück zum Zitat Thuc, H. L. U., Ke, S. R., Hwang, J. N., Tuan, P. V., & Chau, T. N. (2012). Quasi-periodic action recognition from monocular videos via 3d human models and cyclic hmms. In International Conference on Advanced Technologies for Communications (pp. 110–113). Thuc, H. L. U., Ke, S. R., Hwang, J. N., Tuan, P. V., & Chau, T. N. (2012). Quasi-periodic action recognition from monocular videos via 3d human models and cyclic hmms. In International Conference on Advanced Technologies for Communications (pp. 110–113).
Zurück zum Zitat Vitaladevuni, S. N. P., Kellokumpu, V., & Davis, L. S. (2008). Action recognition using ballistic dynamics. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Vitaladevuni, S. N. P., Kellokumpu, V., & Davis, L. S. (2008). Action recognition using ballistic dynamics. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
Zurück zum Zitat Vuga, R., Aksoy, E. E., Wörgötter, F., & Ude, A. (2013). Augmenting semantic event chains with trajectory information for learning and recognition of manipulation tasks. In International Workshop on Robotics in Alpe-Adria-Danube Region (RAAD). Vuga, R., Aksoy, E. E., Wörgötter, F., & Ude, A. (2013). Augmenting semantic event chains with trajectory information for learning and recognition of manipulation tasks. In International Workshop on Robotics in Alpe-Adria-Danube Region (RAAD).
Zurück zum Zitat Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition (pp. 3169–3176). Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition (pp. 3169–3176).
Zurück zum Zitat Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision.
Zurück zum Zitat Wang, Z., Wang, J., Xiao, J., Lin, K. H., & Huang, T. (2012). Substructure and boundary modeling for continuous action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1330–1337). Wang, Z., Wang, J., Xiao, J., Lin, K. H., & Huang, T. (2012). Substructure and boundary modeling for continuous action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1330–1337).
Zurück zum Zitat Wei, P., Zhao, Y., Zheng, N., & Zhu, S. C. (2006). Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Transactions on Pattern Analysis & Machine Intelligence Wei, P., Zhao, Y., Zheng, N., & Zhu, S. C. (2006). Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Transactions on Pattern Analysis & Machine Intelligence
Zurück zum Zitat Weinland, D., Ronfard, R., & Boyer, E. (2006). Automatic discovery of action taxonomies from multiple views. IEEE Conference on Computer Vision and Pattern Recognition, 2, 1639–1645. Weinland, D., Ronfard, R., & Boyer, E. (2006). Automatic discovery of action taxonomies from multiple views. IEEE Conference on Computer Vision and Pattern Recognition, 2, 1639–1645.
Zurück zum Zitat Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115, 224–241.CrossRef Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115, 224–241.CrossRef
Zurück zum Zitat Wimmer, R. (2011). Grasp sensing for human-computer interaction. In Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction (pp. 221–228). Wimmer, R. (2011). Grasp sensing for human-computer interaction. In Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction (pp. 221–228).
Zurück zum Zitat Wörgötter, F., Agostini, A., Krüger, N., Shylo, N., & Porr, B. (2009). Cognitive agents: A procedural perspective relying on the predictability of object-action-complexes oacs. Robotics and Autonomous Systems, 57, 420–432.CrossRef Wörgötter, F., Agostini, A., Krüger, N., Shylo, N., & Porr, B. (2009). Cognitive agents: A procedural perspective relying on the predictability of object-action-complexes oacs. Robotics and Autonomous Systems, 57, 420–432.CrossRef
Zurück zum Zitat Wörgötter, F., Aksoy, E. E., Krüger, N., Piater, J., Ude, A., & Tamosiunaite, M. (2013). A simple ontology of manipulation actions based on hand-object relations. IEEE Transactions on Autonomous Mental Development, 5, 117–134.CrossRef Wörgötter, F., Aksoy, E. E., Krüger, N., Piater, J., Ude, A., & Tamosiunaite, M. (2013). A simple ontology of manipulation actions based on hand-object relations. IEEE Transactions on Autonomous Mental Development, 5, 117–134.CrossRef
Zurück zum Zitat Wörgötter, F., Geib, C., Tamosiunaite, M., Aksoy, E. E., Piater, J., Hanchen, X., Ude, A., Nemec, B., Kraft, D., Krüger, N., Wächter, M., & Asfour, T. (2015). Structural bootstrapping: A novel concept for the fast acquisition of action-knowledge. IEEE Transactions on Autonomous Mental Development, 140–154. Wörgötter, F., Geib, C., Tamosiunaite, M., Aksoy, E. E., Piater, J., Hanchen, X., Ude, A., Nemec, B., Kraft, D., Krüger, N., Wächter, M., & Asfour, T. (2015). Structural bootstrapping: A novel concept for the fast acquisition of action-knowledge. IEEE Transactions on Autonomous Mental Development, 140–154.
Zurück zum Zitat Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 379–385). Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 379–385).
Zurück zum Zitat Yang, Y., Fermüller, C., & Aloimonos, Y. (2013). Detection of manipulation action consequences (mac). In International Conference on Computer Vision and Pattern Recognition (pp. 2563–2570). Yang, Y., Fermüller, C., & Aloimonos, Y. (2013). Detection of manipulation action consequences (mac). In International Conference on Computer Vision and Pattern Recognition (pp. 2563–2570).
Zurück zum Zitat Yang, S., Gao, Q., Liu, C., Xiong, C., & Chai, J. (2016). Grounded Semantic Role Labeling. In The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2016). Yang, S., Gao, Q., Liu, C., Xiong, C., & Chai, J. (2016). Grounded Semantic Role Labeling. In The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2016).
Zurück zum Zitat Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 819–826). Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 819–826).
Zurück zum Zitat Zhou, F., De la Torre Frade, F., & Hodgins, J. K. (2013). Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35, 582–596.CrossRef Zhou, F., De la Torre Frade, F., & Hodgins, J. K. (2013). Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35, 582–596.CrossRef
Metadaten
Titel
Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences
verfasst von
Eren Erdal Aksoy
Adil Orhan
Florentin Wörgötter
Publikationsdatum
05.10.2016
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 1/2017
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-016-0956-8

Weitere Artikel der Ausgabe 1/2017

International Journal of Computer Vision 1/2017 Zur Ausgabe