skip to main content
research-article

Continuous body and hand gesture recognition for natural human-computer interaction

Authors Info & Claims
Published:20 March 2012Publication History
Skip Abstract Section

Abstract

Intelligent gesture recognition systems open a new era of natural human-computer interaction: Gesturing is instinctive and a skill we all have, so it requires little or no thought, leaving the focus on the task itself, as it should be, not on the interaction modality. We present a new approach to gesture recognition that attends to both body and hands, and interprets gestures continuously from an unsegmented and unbounded input stream. This article describes the whole procedure of continuous body and hand gesture recognition, from the signal acquisition to processing, to the interpretation of the processed signals.

Our system takes a vision-based approach, tracking body and hands using a single stereo camera. Body postures are reconstructed in 3D space using a generative model-based approach with a particle filter, combining both static and dynamic attributes of motion as the input feature to make tracking robust to self-occlusion. The reconstructed body postures guide searching for hands. Hand shapes are classified into one of several canonical hand shapes using an appearance-based approach with a multiclass support vector machine. Finally, the extracted body and hand features are combined and used as the input feature for gesture recognition. We consider our task as an online sequence labeling and segmentation problem. A latent-dynamic conditional random field is used with a temporal sliding window to perform the task continuously. We augment this with a novel technique called multilayered filtering, which performs filtering both on the input layer and the prediction layer. Filtering on the input layer allows capturing long-range temporal dependencies and reducing input signal noise; filtering on the prediction layer allows taking weighted votes of multiple overlapping prediction results as well as reducing estimation noise.

We tested our system in a scenario of real-world gestural interaction using the NATOPS dataset, an official vocabulary of aircraft handling gestures. Our experimental results show that: (1) the use of both static and dynamic attributes of motion in body tracking allows statistically significant improvement of the recognition performance over using static attributes of motion alone; and (2) the multilayered filtering statistically significantly improves recognition performance over the nonfiltering method. We also show that, on a set of twenty-four NATOPS gestures, our system achieves a recognition accuracy of 75.37%.

References

  1. Barr, A. 1981. Superquadrics and angle-preserving transformations. IEEE Comput. Graph. Appl. 1, 1, 11--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bobick, A. F. and Davis, J. W. 2001. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 3, 257--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bradski, G. and Kaehler, A. 2008. Learning OpenCV: Computer Vision with the OpenCV Library. O'Reilly, Cambridge, MA.Google ScholarGoogle Scholar
  4. Brand, M. 1999. Shadow puppetry. In Proceedings of the IEEE International Conference on Computer Vision. 1237--1244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chang, C.-C. and Lin, C.-J. 2011. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3, 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dalal, N. and Triggs, B. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 886--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Denavit, J. and Hartenberg, R. S. 1955. A kinematic notation for lower-pair mechanisms based on matrices. ASME J. Appl. Mechan. 23, 215--221.Google ScholarGoogle Scholar
  8. Deutscher, J., Blake, A., and Reid, I. D. 2000. Articulated body motion capture by annealed particle filtering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2126--2133.Google ScholarGoogle Scholar
  9. Engin, A. 1980. On the biomechanics of the shoulder complex. J. Biomechan. 13, 7, 575--581, 583--590.Google ScholarGoogle ScholarCross RefCross Ref
  10. Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and Twombly, X. 2007. Vision-based hand pose estimation: A review. Comput. Vis. Image Understand. 108, 1-2, 52--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Feng, X., Yang, J., and Abdel-Malek, K. 2008. Survey of biomechanical models for the human shoulder complex. Tech. rep., SAE International.Google ScholarGoogle Scholar
  12. Fofi, D., Sliwa, T., and Voisin, Y. 2004. A comparative survey on invisible structured light. In Proceedings of SPIE Machine Vision Applications in Industrial Inspection XII.Google ScholarGoogle Scholar
  13. Freeman, W. T., Anderson, D. B., Beardsley, P. A., Dodge, C., Roth, M., Weissman, C. D., Yerazunis, W. S., Kage, H., Kyuma, K., Miyake, Y., and ichi Tanaka, K. 1998. Computer vision for interactive computer graphics. IEEE Comput. Graph. Appl. 18, 3, 42--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gokturk, S., Yalcin, H., and Bamji, C. 2004. A time-of-flight depth sensor— System description, issues and solutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshop. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gunawardana, A., Mahajan, M., Acero, A., and Platt, J. C. 2005. Hidden conditional random fields for phone classification. In Proceedings of the 9th European Conference on Speech Communication and Technology. 1117--1120.Google ScholarGoogle Scholar
  16. Harris, F. 1978. On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66, 1, 51--83.Google ScholarGoogle ScholarCross RefCross Ref
  17. Hsu, C.-W. and Lin, C.-J. 2002. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 2, 415--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Huang, L., Morency, L.-P., and Gratch, J. 2011. Virtual rapport 2.0. In Proceedings of the 11th International Conference on Intelligent Virtual Agents. Lecture Notes in Computer Science Series, vol. 6895, Springer, 68--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Isard, M. and Blake, A. 1998. CONDENSATION— Conditional density propagation for visual tracking. Int. J. Comput. Vis. 29, 1, 5--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kim, K., Chalidabhongse, T. H., Harwood, D., and Davis, L. S. 2005. Real-Time foreground-background segmentation using codebook model. Real-Time Imag. 11, 3, 172--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single-Layer learning revisited: A stepwise procedure for building and training a neural network. In Neurocomputing: Algorithms, Architectures and Applications, J. Fogelman, Ed., Springer-Verlag.Google ScholarGoogle Scholar
  22. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann, 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lee, M. W. and Cohen, I. 2006. A model-based approach for estimating human 3d poses in static images. IEEE Trans. Pattern Anal. Mach. Intell. 28, 905--916. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mitra, S. and Acharya, T. 2007. Gesture recognition: A survey. IEEE Trans. Syst. Man, Cybernet. C: Appl. Rev. 37, 3, 311--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Morency, L.-P., Quattoni, A., and Darrell, T. 2007. Latent-Dynamic discriminative models for continuous gesture recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  26. Mori, G. and Malik, J. 2006. Recovering 3d human body configurations using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 28, 7, 1052--1062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Murphy, K. 2002. Dynamic bayesian networks: Representation, inference and learning. Ph.D. thesis Computer Science Division, UC, Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. NASA. 1995. Man-Systems Integration Standards: Vol. 1. Section 3. Anthropometry and Biomechanics. http://msis.jsc.hasa.gov/sections/section03.htm.Google ScholarGoogle Scholar
  29. Nocedal, J. and Wright, S. J. 1999. Numerical Optimization. Springer-Verlag.Google ScholarGoogle Scholar
  30. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Poppe, R. 2007. Vision-Based human motion analysis: An overview. Comput. Vis. Image Understand. 108, 1-2, 4--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Quattoni, A., Wang, S. B., Morency, L.-P., Collins, M., and Darrell, T. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29, 10, 1848--1852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Schwarz, L. A., Mkhitaryan, A., Mateus, D., and Navab, N. 2011. Estimating human 3d pose from time-of-flight images based on geodesic distances and optical flow. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 700--706.Google ScholarGoogle Scholar
  34. Shakhnarovich, G., Viola, P. A., and Darrell, T. 2003. Fast pose estimation with parameter-sensitive hashing. In Proceedings of the IEEE International Conference on Computer Vision. 750--759. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sminchisescu, C. and Triggs, B. 2003. Kinematic jump processes for monocular 3d human tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 69--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Song, Y., Demirdjian, D., and Davis, R. 2011a. Multi-Signal gesture recognition using temporal smoothing hidden conditional random fields. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 388--393.Google ScholarGoogle Scholar
  38. Song, Y., Demirdjian, D., and Davis, R. 2011b. Tracking body and hands for gesture recognition: Natops aircraft handling signals database. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 500--506.Google ScholarGoogle Scholar
  39. Sutton, C. A., Rohanimanesh, K., and McCallum, A. 2004. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wang, Y. and Mori, G. 2009. Max-Margin hidden conditional random fields for human action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 872--879.Google ScholarGoogle Scholar
  42. Weinland, D., Ronfard, R., and Boyer, E. 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Understand. 115, 2, 224--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yin, Y. and Davis, R. 2010. Toward natural interaction in the real world: Real-Time gesture recognition. In Proceedings of the 12th International Conference on Multimodal Interfaces/International Workshop on Machine Learning for Multimodal Interaction. 15. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Continuous body and hand gesture recognition for natural human-computer interaction

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Interactive Intelligent Systems
            ACM Transactions on Interactive Intelligent Systems  Volume 2, Issue 1
            Special Issue on Affective Interaction in Natural Environments
            March 2012
            171 pages
            ISSN:2160-6455
            EISSN:2160-6463
            DOI:10.1145/2133366
            Issue’s Table of Contents

            Copyright © 2012 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 March 2012
            • Revised: 1 December 2011
            • Accepted: 1 December 2011
            • Received: 1 December 2010
            Published in tiis Volume 2, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader