Weitere Artikel dieser Ausgabe durch Wischen aufrufen
This research was partially supported by the Strategic Information and Communications R&D Promotion Program (No. 142103011) of the Ministry of Internal Affairs and Communications.
This paper proposes a novel framework for generating action descriptions from human whole body motions and objects to be manipulated. This generation is based on three modules: the first module categorizes human motions and objects; the second module associates the motion and object categories with words; and the third module extracts a sentence structure as word sequences. Human motions and objects to be manipulated are classified into categories in the first module, then words highly relevant to the motion and object categories are generated from the second module, and finally the words are converted into sentences in the form of word sequences by the third module. The motions and objects along with the relations among the motions, objects, and words are parametrized stochastically by the first and second modules. The sentence structures are parametrized from a dataset of word sequences in a dynamical system by the third module. The link of the stochastic representation of the motions, objects, and words with the dynamical representation of the sentences allows for synthesizing sentences descriptive to human actions. We tested our proposed method on synthesizing action descriptions for a human action dataset captured by an RGB-D sensor, and demonstrated its validity.
Abdel-Kalim, A. E., & Farag, A. A. (2006). Asift: A sift descriptor with color invariant characteristics. In Proceedings of 2006 IEEE Computer society conference on computer vision and pattern recognition, Vol. 2, pp. 1978–1983.
Bengio, Y., Schwenk, H., Senécal, J., Morin, F., & Gauvain, J. (2006). Neural probabilistic language models. In Innovations in machine learning, pp. 137–186.
Cheng, G., Hyon, S. H., Morimoto, J., Ude, A., Hale, J. G., Colvin, G., et al. (2007). CB: A humanoid research platform for exploring neuroscience. Advanced Robotics, 21(10), 1097–1114. CrossRef
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. MATH
Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789. CrossRef
Ho, E. S., & Shum, H. (2013). Motion adaptation for humanoid robots in constrained environments. In Proceedings of IEEE international conference on advanced robotics, pp. 3813–3818.
Ho, E. S., Komura, T., & Tai, C. (2010). Spatial relationship preserving character motion adaptation. In ACM transactions on graphics, Vol. 29, p. 33.
Inamura, T., Toshima, I., Tanie, H., & Nakamura, Y. (2004). Embodied symbol emergence based on mimesis theory. International Journal of Robotics Research, 23(4), 363–377. CrossRef
Kadaba, M. P., Ramakrishnan, H. K., & Wootten, M. E. (1990). Measurement of lower extremity kinematics during level walking. Journal of Orthopaedic Research, 8(3), 383–392. CrossRef
Kaneko, K., Kanehiro, F., Kajita, S., Yokoyama, K., Akachi, K., Kawasaki, T., et al. (2002). Design of prototype humanoid robotics platform for HRP. In Proceedings of the 2002 IEEE/RSJ international conference on intelligent robots and systems, Vol. 3, pp. 2431–2436.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 664–676. CrossRef
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., & Guadarrama, S. (2013). Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the 27th AAAI conference on artificial intelligence, pp. 541–547.
Kulkarni, G., Premraj, V., Dhar, S., & Li, S. (2011). Baby talk: Understanding and generating simple image descriptions. In Proceedings of IEEE conference on computer vision and pattern recognition, pp. 1601–1608.
Kuniyoshi, Y., Inaba, M., & Inoue, H. (1994). Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transactions on Robotics and Automation, 10(6), 799–822. CrossRef
Kuroki, Y., Fujita, M., Ishida, T., Nagasaka, K., & Yamaguchi, J. (2003). A small biped entertainment robot exploring attractive applications. In Proceedings of the IEEE international conference on robotics and automation, Vol. 1, pp. 471–476.
Li, S., Kulkarni, G., Berg, T. L. , Berg, A. C., & Choi, Y. (2011). Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th conference on computational natural language learning, pp. 220–228.
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on Computer vision, Vol. 2, pp. 1150–1157.
Mataric, M. J. (2000). Getting humanoids to move and imitate. IEEE Intelligent Systems, 15(4), 18–24. CrossRef
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, Vol. 77, pp. 257–286.
Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action. Nature Reviews, 2, 661–670. CrossRef
Saussure, F. D. (1966). Course in general linguistics. New York: McGraw-Hill Book Company.
Sugano, S., & Kato, I. (1987). Wabot-2: Autonomous robot with dexterous finger-arm–finger-arm coordination control in keyboard performance. In Proceedings of 1987 IEEE international conference on robotics and automation, Vol. 4, pp. 90–97.
Takano, W., & Nakamura, Y. (2015). Symbolically structured database for human whole body motions based on association between motion symbols and motion words. Robotics and Autonomous Systems, 66, 75–85. CrossRef
Takano, W., & Nakamura, Y. (2015). Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions. International Journal of Robotics Research, 34(10), 1314–1328. CrossRef
Tani, J., & Ito, M. (2003). Self-organization of behavioral primitives as multiple attractor dynamics: A robot experiment. IEEE Transactions on Systems, Man and Cybernetics Part A: Systems and Humans, 33(4), 481–488. CrossRef
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of IEEE conference on computer vision and pattern recognition, pp. 3156–3164 .
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In IEEE conference on computer vision and pattern recognition, pp. 1290–1297.
Yao, B., & Fei-Fei, L. (2012). Recognizing human–object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1691–1703. CrossRef
Yu, G., Liu, Z., & Yuan, J. (2014). Discriminative orderlet mining for real-time recognition of human–object interaction. In In Proceedings of Asian conference on computer vision, pp. 50–65.
Zhan, Q., Liang, Y., & Xiao, Y. (2009). Color-based segmentation of point clouds. In Proceedings of ISPRS laser scanning workshop, Vol. 38, pp. 248–252.
- Linking human motions and objects to language for synthesizing action sentences
- Springer US
Print ISSN: 0929-5593
Elektronische ISSN: 1573-7527
Neuer Inhalt/© Filograph | Getty Images | iStock