research-article

Continuous body and hand gesture recognition for natural human-computer interaction

Authors:
Yale Song

Massachusetts Institute of Technology, Cambridge, MA

Massachusetts Institute of Technology, Cambridge, MA
View Profile

,
David Demirdjian

Massachusetts Institute of Technology, Cambridge, MA

Massachusetts Institute of Technology, Cambridge, MA
View Profile

,
Randall Davis

Massachusetts Institute of Technology, Cambridge, MA

Massachusetts Institute of Technology, Cambridge, MA
View Profile

ACM Transactions on Interactive Intelligent Systems Volume 2 Issue 1Article No.: 5pp 1–28https://doi.org/10.1145/2133366.2133371

Published:20 March 2012Publication History

ACM Transactions on Interactive Intelligent Systems

Abstract

Intelligent gesture recognition systems open a new era of natural human-computer interaction: Gesturing is instinctive and a skill we all have, so it requires little or no thought, leaving the focus on the task itself, as it should be, not on the interaction modality. We present a new approach to gesture recognition that attends to both body and hands, and interprets gestures continuously from an unsegmented and unbounded input stream. This article describes the whole procedure of continuous body and hand gesture recognition, from the signal acquisition to processing, to the interpretation of the processed signals.

Our system takes a vision-based approach, tracking body and hands using a single stereo camera. Body postures are reconstructed in 3D space using a generative model-based approach with a particle filter, combining both static and dynamic attributes of motion as the input feature to make tracking robust to self-occlusion. The reconstructed body postures guide searching for hands. Hand shapes are classified into one of several canonical hand shapes using an appearance-based approach with a multiclass support vector machine. Finally, the extracted body and hand features are combined and used as the input feature for gesture recognition. We consider our task as an online sequence labeling and segmentation problem. A latent-dynamic conditional random field is used with a temporal sliding window to perform the task continuously. We augment this with a novel technique called multilayered filtering, which performs filtering both on the input layer and the prediction layer. Filtering on the input layer allows capturing long-range temporal dependencies and reducing input signal noise; filtering on the prediction layer allows taking weighted votes of multiple overlapping prediction results as well as reducing estimation noise.

We tested our system in a scenario of real-world gestural interaction using the NATOPS dataset, an official vocabulary of aircraft handling gestures. Our experimental results show that: (1) the use of both static and dynamic attributes of motion in body tracking allows statistically significant improvement of the recognition performance over using static attributes of motion alone; and (2) the multilayered filtering statistically significantly improves recognition performance over the nonfiltering method. We also show that, on a set of twenty-four NATOPS gestures, our system achieves a recognition accuracy of 75.37%.

References

Barr, A. 1981. Superquadrics and angle-preserving transformations. IEEE Comput. Graph. Appl. 1, 1, 11--23. Google ScholarDigital Library
Bobick, A. F. and Davis, J. W. 2001. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 3, 257--267. Google ScholarDigital Library
Bradski, G. and Kaehler, A. 2008. Learning OpenCV: Computer Vision with the OpenCV Library. O'Reilly, Cambridge, MA.Google Scholar
Brand, M. 1999. Shadow puppetry. In Proceedings of the IEEE International Conference on Computer Vision. 1237--1244. Google ScholarDigital Library
Chang, C.-C. and Lin, C.-J. 2011. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3, 27. Google ScholarDigital Library
Dalal, N. and Triggs, B. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 886--893. Google ScholarDigital Library
Denavit, J. and Hartenberg, R. S. 1955. A kinematic notation for lower-pair mechanisms based on matrices. ASME J. Appl. Mechan. 23, 215--221.Google Scholar
Deutscher, J., Blake, A., and Reid, I. D. 2000. Articulated body motion capture by annealed particle filtering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2126--2133.Google Scholar
Engin, A. 1980. On the biomechanics of the shoulder complex. J. Biomechan. 13, 7, 575--581, 583--590.Google ScholarCross Ref
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and Twombly, X. 2007. Vision-based hand pose estimation: A review. Comput. Vis. Image Understand. 108, 1-2, 52--73. Google ScholarDigital Library
Feng, X., Yang, J., and Abdel-Malek, K. 2008. Survey of biomechanical models for the human shoulder complex. Tech. rep., SAE International.Google Scholar
Fofi, D., Sliwa, T., and Voisin, Y. 2004. A comparative survey on invisible structured light. In Proceedings of SPIE Machine Vision Applications in Industrial Inspection XII.Google Scholar
Freeman, W. T., Anderson, D. B., Beardsley, P. A., Dodge, C., Roth, M., Weissman, C. D., Yerazunis, W. S., Kage, H., Kyuma, K., Miyake, Y., and ichi Tanaka, K. 1998. Computer vision for interactive computer graphics. IEEE Comput. Graph. Appl. 18, 3, 42--53. Google ScholarDigital Library
Gokturk, S., Yalcin, H., and Bamji, C. 2004. A time-of-flight depth sensor— System description, issues and solutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshop. Google ScholarDigital Library
Gunawardana, A., Mahajan, M., Acero, A., and Platt, J. C. 2005. Hidden conditional random fields for phone classification. In Proceedings of the 9th European Conference on Speech Communication and Technology. 1117--1120.Google Scholar
Harris, F. 1978. On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66, 1, 51--83.Google ScholarCross Ref
Hsu, C.-W. and Lin, C.-J. 2002. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 2, 415--425. Google ScholarDigital Library
Huang, L., Morency, L.-P., and Gratch, J. 2011. Virtual rapport 2.0. In Proceedings of the 11th International Conference on Intelligent Virtual Agents. Lecture Notes in Computer Science Series, vol. 6895, Springer, 68--79. Google ScholarDigital Library
Isard, M. and Blake, A. 1998. CONDENSATION— Conditional density propagation for visual tracking. Int. J. Comput. Vis. 29, 1, 5--28. Google ScholarDigital Library
Kim, K., Chalidabhongse, T. H., Harwood, D., and Davis, L. S. 2005. Real-Time foreground-background segmentation using codebook model. Real-Time Imag. 11, 3, 172--185. Google ScholarDigital Library
Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single-Layer learning revisited: A stepwise procedure for building and training a neural network. In Neurocomputing: Algorithms, Architectures and Applications, J. Fogelman, Ed., Springer-Verlag.Google Scholar
Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann, 282--289. Google ScholarDigital Library
Lee, M. W. and Cohen, I. 2006. A model-based approach for estimating human 3d poses in static images. IEEE Trans. Pattern Anal. Mach. Intell. 28, 905--916. Google ScholarDigital Library
Mitra, S. and Acharya, T. 2007. Gesture recognition: A survey. IEEE Trans. Syst. Man, Cybernet. C: Appl. Rev. 37, 3, 311--324. Google ScholarDigital Library
Morency, L.-P., Quattoni, A., and Darrell, T. 2007. Latent-Dynamic discriminative models for continuous gesture recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google Scholar
Mori, G. and Malik, J. 2006. Recovering 3d human body configurations using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 28, 7, 1052--1062. Google ScholarDigital Library
Murphy, K. 2002. Dynamic bayesian networks: Representation, inference and learning. Ph.D. thesis Computer Science Division, UC, Berkeley. Google ScholarDigital Library
NASA. 1995. Man-Systems Integration Standards: Vol. 1. Section 3. Anthropometry and Biomechanics. http://msis.jsc.hasa.gov/sections/section03.htm.Google Scholar
Nocedal, J. and Wright, S. J. 1999. Numerical Optimization. Springer-Verlag.Google Scholar
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Google ScholarDigital Library
Poppe, R. 2007. Vision-Based human motion analysis: An overview. Comput. Vis. Image Understand. 108, 1-2, 4--18. Google ScholarDigital Library
Quattoni, A., Wang, S. B., Morency, L.-P., Collins, M., and Darrell, T. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29, 10, 1848--1852. Google ScholarDigital Library
Schwarz, L. A., Mkhitaryan, A., Mateus, D., and Navab, N. 2011. Estimating human 3d pose from time-of-flight images based on geodesic distances and optical flow. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 700--706.Google Scholar
Shakhnarovich, G., Viola, P. A., and Darrell, T. 2003. Fast pose estimation with parameter-sensitive hashing. In Proceedings of the IEEE International Conference on Computer Vision. 750--759. Google ScholarDigital Library
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library
Sminchisescu, C. and Triggs, B. 2003. Kinematic jump processes for monocular 3d human tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 69--76. Google ScholarDigital Library
Song, Y., Demirdjian, D., and Davis, R. 2011a. Multi-Signal gesture recognition using temporal smoothing hidden conditional random fields. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 388--393.Google Scholar
Song, Y., Demirdjian, D., and Davis, R. 2011b. Tracking body and hands for gesture recognition: Natops aircraft handling signals database. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 500--506.Google Scholar
Sutton, C. A., Rohanimanesh, K., and McCallum, A. 2004. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann. Google ScholarDigital Library
Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer, New York. Google ScholarDigital Library
Wang, Y. and Mori, G. 2009. Max-Margin hidden conditional random fields for human action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 872--879.Google Scholar
Weinland, D., Ronfard, R., and Boyer, E. 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Understand. 115, 2, 224--241. Google ScholarDigital Library
Yin, Y. and Davis, R. 2010. Toward natural interaction in the real world: Real-Time gesture recognition. In Proceedings of the 12th International Conference on Multimodal Interfaces/International Workshop on Machine Learning for Multimodal Interaction. 15. Google ScholarDigital Library

Index Terms

Continuous body and hand gesture recognition for natural human-computer interaction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Image and video acquisition
        Motion capture
  2. Computer graphics
    1. Animation
      1. Motion capture
      2. Motion processing
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools

Recommendations

Multi-scenario gesture recognition using Kinect
CGAMES '12: Proceedings of the 2012 17th International Conference on Computer Games: AI, Animation, Mobile, Interactive Multimedia, Educational & Serious Games (CGAMES)

Hand gesture recognition (HGR) is an important research topic because some situations require silent communication with sign languages. Computational HGR systems assist silent communication, and help people learn a sign language. In this article, a ...
Read More
Hand tracking and gesture recognition system for human-computer interaction using low-cost hardware

Human-Computer Interaction (HCI) exists ubiquitously in our daily lives. It is usually achieved by using a physical controller such as a mouse, keyboard or touch screen. It hinders Natural User Interface (NUI) as there is a strong barrier between the ...
Read More
Choosing and modeling the hand gesture database for a natural user interface
GW'11: Proceedings of the 9th international conference on Gesture and Sign Language in Human-Computer Interaction and Embodied Communication

This paper presents a database of natural hand gestures ('IITiS Gesture Database') recorded with motion capture devices. For the purpose of benchmarking and testing the gesture interaction system we have selected twenty-two natural hand gestures and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Interactive Intelligent Systems Volume 2, Issue 1
Special Issue on Affective Interaction in Natural Environments
March 2012
171 pages
ISSN:2160-6455
EISSN:2160-6463
DOI:10.1145/2133366
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 March 2012
- Revised: 1 December 2011
- Accepted: 1 December 2011
- Received: 1 December 2010
Published in tiis Volume 2, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Pose tracking
conditional random fields
gesture recognition
human-computer interaction
multilayered filtering
online sequence labeling and segmentation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 90
  Total Citations
  View Citations
- 3,010
  Total Downloads
- Downloads (Last 12 months)95
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Continuous body and hand gesture recognition for natural human-computer interaction

ACM Transactions on Interactive Intelligent Systems

Abstract

References

Cited By

Index Terms

Recommendations

Multi-scenario gesture recognition using Kinect

Hand tracking and gesture recognition system for human-computer interaction using low-cost hardware

Choosing and modeling the hand gesture database for a natural user interface

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Continuous body and hand gesture recognition for natural human-computer interaction

ACM Transactions on Interactive Intelligent Systems

Abstract

References

Cited By

Index Terms

Recommendations

Multi-scenario gesture recognition using Kinect

Hand tracking and gesture recognition system for human-computer interaction using low-cost hardware

Choosing and modeling the hand gesture database for a natural user interface

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media