Real-time gesture recognition system and application

https://doi.org/10.1016/S0262-8856(02)00113-0Get rights and content

Abstract

In this paper, we consider a vision-based system that can interpret a user's gestures in real time to manipulate windows and objects within a graphical user interface. A hand segmentation procedure first extracts binary hand blob(s) from each frame of the acquired image sequence. Fourier descriptors are used to represent the shape of the hand blobs, and are input to radial-basis function (RBF) network(s) for pose classification. The pose likelihood vector from the RBF network output is used as input to the gesture recognizer, along with motion information. Gesture recognition performances using hidden Markov models (HMM) and recurrent neural networks (RNN) were investigated. Test results showed that the continuous HMM yielded the best performance with gesture recognition rates of 90.2%. Experiments with combining the continuous HMMs and RNNs revealed that a linear combination of the two classifiers improved the classification results to 91.9%. The gesture recognition system was deployed in a prototype user interface application, and users who tested it found the gestures intuitive and the application easy to use. Real time processing rates of up to 22 frames per second were obtained.

Introduction

As computers become more pervasive in society, facilitating natural human–computer interaction (HCI) will have a positive impact on their use. Hence, there has been growing interest in the development of new approaches and technologies for bridging the human–computer barrier. The ultimate aim is to bring HCI to a regime where interactions with computers will be as natural as interactions between humans, and to this end, incorporating gestures in HCI is an important research area [1].

We are interested in developing a vision-based system which can interpret a user's gestures in real time to manipulate windows and objects within a graphical user interface (GUI). Various works by Kadobayashi et al. [2], Pavlovic et al. [3], Freeman et al. [4] and Kjeldsen et al. [5] indicate that there is keen interest among current researchers to incorporate gestures into traditional HCI interfaces. Our work expands on their ideas and also looks into the possibility of using two-handed gestures while imposing fewer constraints on the users.

Much of the research on real-time gesture recognition has focused on the space-time trajectory of the hand without considering the shape or posture of the hand [6], [7], [8]. These works utilized only relative or oscillatory motion of the hand to recognize the gesture. However, in many situations, the meaning of gestures depends very much on the hand posture, in addition to the hand movement. Hence, our work incorporates hand posture as well as hand motion to recognize gestures. Also, unlike other works where users are required to wear artificial devices like data gloves [9] or green markers [10], it is our aim to allow the users to perform gestures in a natural and unencumbered manner.

In the work by Kjeldson and Kender, gestures were incorporated into a windowing user interface to manipulate windows [5]. The hand was segmented by using a neural network whose inputs were images coded in the hue-saturation-intensity (HSI) color model. Another neural network was trained to classify each pose. Gesture interpretation was performed by a state machine which implemented a gesture grammar. Their work demonstrated the feasibility of using gestures in a modern GUI. Our system differs from Ref. [5] in a few ways. We defined a slightly larger gesture set, and our system is specifically designed to allow the user to employ two-handed gestures, if he wishes. Moreover, our system does not need to be re-trained for every new user; it needs only to be trained once to achieve a relatively high level of user-invariance. Finally, the processing steps are quite different.

In the following, we present an overview of the system in Section 2. In Section 3, we describe the segmentation procedure to locate the hand(s) in the image. We then discuss a wrist-cropping method to isolate the segmented hand from the rest of the arm in Section 4. Next in Section 5, we describe pose classification using RBF networks, with Fourier descriptors of the segmented hand boundary as input features. We develop and compare the performance of the gesture recognizer based on hidden Markov Models (HMM) and recurrent neural networks (RNN) in Section 6. We also consider enhancing the recognition performance by combining the classifiers. In Section 7, we present and discuss the recognition results. A prototype GUI application that was developed to run in real time is described in Section 8, and Section 9 concludes the paper.

Section snippets

System description

Fourteen gestures shown in Fig. 1 were defined for controlling the windowing system. The Point gesture is used to move the cursor on the screen. The user can select a window/object to manipulate using the Pick gesture. Windows can be minimized with the Close gesture and restored with the Open gesture. The size of the window/object can be varied in different directions using different Enlarge and Shrink gestures. The Undo gestures can be used to reverse the previous action. These gestures are

Segmentation

Fig. 4 depicts the segmentation process, which uses color and motion cues. The camera's field of view may contain objects moving in the background, but it is assumed that hands are the only skin-colored objects in the view, to simplify their extraction. Background differencing is used to isolate the moving object region, followed by a segmentation process to extract the skin-colored objects (hand and arm). A wrist-cropping operation is next used to separate the hand from the segmented arm.

Wrist-cropping

The binary image obtained by segmentation is further processed to isolate the hand from the rest of the lower arm, by a wrist-cropping procedure. This is necessary because the segmented image may or may not include the lower arm depending on whether the user is wearing long-sleeves shirt, watches or other wrist ornaments. This can result in significantly different features being extracted for the same hand pose, and lead to increased complexity in the pose classifier. To avoid this, we

Fourier descriptors

We used Fourier descriptors [13] to represent the boundary of the extracted binary hand as the set of complex numbers, bk=xk+jyk, where {xk,yk} are the boundary pixels. This is re-sampled to a fixed length sequence, {fk,k=0,1,…N−1}, for use with the discrete Fourier transform (DFT). Denoting {Fn} as the DFT coefficients, the set of (rotation, scale, and translation invariant) Fourier descriptors is given byAn=|Fn||F1|,2≤n<N.We used a set of 56 Fourier descriptors, resulting from using a

Gesture recognition

Gesture recognition uses the pose classification results and motion information of the centroids of the segmented hand(s) to classify the current frame as belonging to one of the fourteen predefined gestures. Input to the gesture recognition module is a nine element input vector u=[u0u8]T, which consists of five elements [u0u4] from the pose classifier, and four additional elements [u5u8], which encode the hand centroid motion and location. If the centroids of the primary hand and secondary

Results and discussion

In this section, we present results and discussion on the different components of the system, as well as overall gesture recognition performance.

Application

The prototype application simulates a windowing GUI driven by gesture. The processing rate for frames acquired in real time is 22 fps. Fig. 13 shows a screen shot of the application. A major portion of the screen is the simulated desktop, where the user can manipulate windows and objects as in any other GUI desktop environment. To the right of the simulated desktop is a tweak panel for users to adjust the simulation parameters according to personal preference. Just below it is an image display

Conclusions

In this paper, we considered a vision-based system that can interpret a user's gestures in real time to manipulate windows and objects within a graphical user interface. Every frame from the acquired image sequence was processed through five different stages, viz. hand segmentation, wrist-cropping, feature extraction, pose classification, and gesture recognition. Output from the gesture recognition module was used in an application to control windows and objects in a simulated GUI.

Hand

References (25)

  • B. Raytchev et al.

    User-independent online gesture recognition by relative motion extraction

    Pattern Recogn. Lett.

    (2000)
  • V.I. Pavlovic et al.

    Visual interpretation of hand gestures for human–computer interaction: a review

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1997)
  • R. Kadobayashi et al.

    Design and evaluation of an immersive walk-through application for exploring cyberspace

    Proc. Third Int. Conf. Autom. Face Gesture Recogn.

    (1998)
  • V.I. Pavlovic et al.

    Gestural interface to a visual computing environment for molecular biologists

    Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt

    (1996)
  • W.T. Freeman et al.

    Computer vision for computer games

    Proc. Int. Conf. Autom. Face Gesture Recogn.

    (1996)
  • R. Kjeldsen et al.

    Towards the use of gesture in traditional user interface

    Proc. Int. Conf. Autom. Face Gesture Recogn.

    (1996)
  • C.J. Cohen et al.

    Dynamical system representation, generation, and recognition of basic oscillatory motion gestures

    Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt

    (1996)
  • S. Nagaya et al.

    A theoretical consideration of pattern space trajectory for gesture spotting recognition

    Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt

    (1996)
  • R. Liang et al.

    A real-time gesture recognitiion system for sign language

    Proc. Third Int. Conf. Autom. Face Gesture Recogn.

    (1998)
  • M. Hoch

    A prototype system for intuitive film planning

    Proc. Third Int. Conf. Autom. Face Gesture Recogn.

    (1998)
  • J. Yang, A. Waibel, Tracking Human Faces in Real Time, Technical Report CMU-CS-95-210, Department of Computer Science,...
  • E.S. Koh, Pose Recognition System, BE Thesis, National University of Singapore,...
  • Cited by (123)

    • A comprehensive survey and taxonomy of sign language research

      2022, Engineering Applications of Artificial Intelligence
    • Dynamic hand gesture tracking and recognition: Survey of different phases

      2023, International Journal of Systematic Innovation
    View all citing articles on Scopus
    View full text