Abstract
Intelligent gesture recognition systems open a new era of natural human-computer interaction: Gesturing is instinctive and a skill we all have, so it requires little or no thought, leaving the focus on the task itself, as it should be, not on the interaction modality. We present a new approach to gesture recognition that attends to both body and hands, and interprets gestures continuously from an unsegmented and unbounded input stream. This article describes the whole procedure of continuous body and hand gesture recognition, from the signal acquisition to processing, to the interpretation of the processed signals.
Our system takes a vision-based approach, tracking body and hands using a single stereo camera. Body postures are reconstructed in 3D space using a generative model-based approach with a particle filter, combining both static and dynamic attributes of motion as the input feature to make tracking robust to self-occlusion. The reconstructed body postures guide searching for hands. Hand shapes are classified into one of several canonical hand shapes using an appearance-based approach with a multiclass support vector machine. Finally, the extracted body and hand features are combined and used as the input feature for gesture recognition. We consider our task as an online sequence labeling and segmentation problem. A latent-dynamic conditional random field is used with a temporal sliding window to perform the task continuously. We augment this with a novel technique called multilayered filtering, which performs filtering both on the input layer and the prediction layer. Filtering on the input layer allows capturing long-range temporal dependencies and reducing input signal noise; filtering on the prediction layer allows taking weighted votes of multiple overlapping prediction results as well as reducing estimation noise.
We tested our system in a scenario of real-world gestural interaction using the NATOPS dataset, an official vocabulary of aircraft handling gestures. Our experimental results show that: (1) the use of both static and dynamic attributes of motion in body tracking allows statistically significant improvement of the recognition performance over using static attributes of motion alone; and (2) the multilayered filtering statistically significantly improves recognition performance over the nonfiltering method. We also show that, on a set of twenty-four NATOPS gestures, our system achieves a recognition accuracy of 75.37%.
- Barr, A. 1981. Superquadrics and angle-preserving transformations. IEEE Comput. Graph. Appl. 1, 1, 11--23. Google ScholarDigital Library
- Bobick, A. F. and Davis, J. W. 2001. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 3, 257--267. Google ScholarDigital Library
- Bradski, G. and Kaehler, A. 2008. Learning OpenCV: Computer Vision with the OpenCV Library. O'Reilly, Cambridge, MA.Google Scholar
- Brand, M. 1999. Shadow puppetry. In Proceedings of the IEEE International Conference on Computer Vision. 1237--1244. Google ScholarDigital Library
- Chang, C.-C. and Lin, C.-J. 2011. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3, 27. Google ScholarDigital Library
- Dalal, N. and Triggs, B. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 886--893. Google ScholarDigital Library
- Denavit, J. and Hartenberg, R. S. 1955. A kinematic notation for lower-pair mechanisms based on matrices. ASME J. Appl. Mechan. 23, 215--221.Google Scholar
- Deutscher, J., Blake, A., and Reid, I. D. 2000. Articulated body motion capture by annealed particle filtering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2126--2133.Google Scholar
- Engin, A. 1980. On the biomechanics of the shoulder complex. J. Biomechan. 13, 7, 575--581, 583--590.Google ScholarCross Ref
- Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and Twombly, X. 2007. Vision-based hand pose estimation: A review. Comput. Vis. Image Understand. 108, 1-2, 52--73. Google ScholarDigital Library
- Feng, X., Yang, J., and Abdel-Malek, K. 2008. Survey of biomechanical models for the human shoulder complex. Tech. rep., SAE International.Google Scholar
- Fofi, D., Sliwa, T., and Voisin, Y. 2004. A comparative survey on invisible structured light. In Proceedings of SPIE Machine Vision Applications in Industrial Inspection XII.Google Scholar
- Freeman, W. T., Anderson, D. B., Beardsley, P. A., Dodge, C., Roth, M., Weissman, C. D., Yerazunis, W. S., Kage, H., Kyuma, K., Miyake, Y., and ichi Tanaka, K. 1998. Computer vision for interactive computer graphics. IEEE Comput. Graph. Appl. 18, 3, 42--53. Google ScholarDigital Library
- Gokturk, S., Yalcin, H., and Bamji, C. 2004. A time-of-flight depth sensor— System description, issues and solutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshop. Google ScholarDigital Library
- Gunawardana, A., Mahajan, M., Acero, A., and Platt, J. C. 2005. Hidden conditional random fields for phone classification. In Proceedings of the 9th European Conference on Speech Communication and Technology. 1117--1120.Google Scholar
- Harris, F. 1978. On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66, 1, 51--83.Google ScholarCross Ref
- Hsu, C.-W. and Lin, C.-J. 2002. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 2, 415--425. Google ScholarDigital Library
- Huang, L., Morency, L.-P., and Gratch, J. 2011. Virtual rapport 2.0. In Proceedings of the 11th International Conference on Intelligent Virtual Agents. Lecture Notes in Computer Science Series, vol. 6895, Springer, 68--79. Google ScholarDigital Library
- Isard, M. and Blake, A. 1998. CONDENSATION— Conditional density propagation for visual tracking. Int. J. Comput. Vis. 29, 1, 5--28. Google ScholarDigital Library
- Kim, K., Chalidabhongse, T. H., Harwood, D., and Davis, L. S. 2005. Real-Time foreground-background segmentation using codebook model. Real-Time Imag. 11, 3, 172--185. Google ScholarDigital Library
- Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single-Layer learning revisited: A stepwise procedure for building and training a neural network. In Neurocomputing: Algorithms, Architectures and Applications, J. Fogelman, Ed., Springer-Verlag.Google Scholar
- Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann, 282--289. Google ScholarDigital Library
- Lee, M. W. and Cohen, I. 2006. A model-based approach for estimating human 3d poses in static images. IEEE Trans. Pattern Anal. Mach. Intell. 28, 905--916. Google ScholarDigital Library
- Mitra, S. and Acharya, T. 2007. Gesture recognition: A survey. IEEE Trans. Syst. Man, Cybernet. C: Appl. Rev. 37, 3, 311--324. Google ScholarDigital Library
- Morency, L.-P., Quattoni, A., and Darrell, T. 2007. Latent-Dynamic discriminative models for continuous gesture recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google Scholar
- Mori, G. and Malik, J. 2006. Recovering 3d human body configurations using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 28, 7, 1052--1062. Google ScholarDigital Library
- Murphy, K. 2002. Dynamic bayesian networks: Representation, inference and learning. Ph.D. thesis Computer Science Division, UC, Berkeley. Google ScholarDigital Library
- NASA. 1995. Man-Systems Integration Standards: Vol. 1. Section 3. Anthropometry and Biomechanics. http://msis.jsc.hasa.gov/sections/section03.htm.Google Scholar
- Nocedal, J. and Wright, S. J. 1999. Numerical Optimization. Springer-Verlag.Google Scholar
- Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Google ScholarDigital Library
- Poppe, R. 2007. Vision-Based human motion analysis: An overview. Comput. Vis. Image Understand. 108, 1-2, 4--18. Google ScholarDigital Library
- Quattoni, A., Wang, S. B., Morency, L.-P., Collins, M., and Darrell, T. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29, 10, 1848--1852. Google ScholarDigital Library
- Schwarz, L. A., Mkhitaryan, A., Mateus, D., and Navab, N. 2011. Estimating human 3d pose from time-of-flight images based on geodesic distances and optical flow. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 700--706.Google Scholar
- Shakhnarovich, G., Viola, P. A., and Darrell, T. 2003. Fast pose estimation with parameter-sensitive hashing. In Proceedings of the IEEE International Conference on Computer Vision. 750--759. Google ScholarDigital Library
- Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library
- Sminchisescu, C. and Triggs, B. 2003. Kinematic jump processes for monocular 3d human tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 69--76. Google ScholarDigital Library
- Song, Y., Demirdjian, D., and Davis, R. 2011a. Multi-Signal gesture recognition using temporal smoothing hidden conditional random fields. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 388--393.Google Scholar
- Song, Y., Demirdjian, D., and Davis, R. 2011b. Tracking body and hands for gesture recognition: Natops aircraft handling signals database. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 500--506.Google Scholar
- Sutton, C. A., Rohanimanesh, K., and McCallum, A. 2004. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann. Google ScholarDigital Library
- Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer, New York. Google ScholarDigital Library
- Wang, Y. and Mori, G. 2009. Max-Margin hidden conditional random fields for human action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 872--879.Google Scholar
- Weinland, D., Ronfard, R., and Boyer, E. 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Understand. 115, 2, 224--241. Google ScholarDigital Library
- Yin, Y. and Davis, R. 2010. Toward natural interaction in the real world: Real-Time gesture recognition. In Proceedings of the 12th International Conference on Multimodal Interfaces/International Workshop on Machine Learning for Multimodal Interaction. 15. Google ScholarDigital Library
Index Terms
- Continuous body and hand gesture recognition for natural human-computer interaction
Recommendations
Multi-scenario gesture recognition using Kinect
CGAMES '12: Proceedings of the 2012 17th International Conference on Computer Games: AI, Animation, Mobile, Interactive Multimedia, Educational & Serious Games (CGAMES)Hand gesture recognition (HGR) is an important research topic because some situations require silent communication with sign languages. Computational HGR systems assist silent communication, and help people learn a sign language. In this article, a ...
Hand tracking and gesture recognition system for human-computer interaction using low-cost hardware
Human-Computer Interaction (HCI) exists ubiquitously in our daily lives. It is usually achieved by using a physical controller such as a mouse, keyboard or touch screen. It hinders Natural User Interface (NUI) as there is a strong barrier between the ...
Choosing and modeling the hand gesture database for a natural user interface
GW'11: Proceedings of the 9th international conference on Gesture and Sign Language in Human-Computer Interaction and Embodied CommunicationThis paper presents a database of natural hand gestures ('IITiS Gesture Database') recorded with motion capture devices. For the purpose of benchmarking and testing the gesture interaction system we have selected twenty-two natural hand gestures and ...
Comments