Real-time gesture recognition system and application

doi:10.1016/S0262-8856(02)00113-0

Image and Vision Computing

Volume 20, Issues 13–14, 1 December 2002, Pages 993-1007

https://doi.org/10.1016/S0262-8856(02)00113-0 Get rights and content

Abstract

In this paper, we consider a vision-based system that can interpret a user's gestures in real time to manipulate windows and objects within a graphical user interface. A hand segmentation procedure first extracts binary hand blob(s) from each frame of the acquired image sequence. Fourier descriptors are used to represent the shape of the hand blobs, and are input to radial-basis function (RBF) network(s) for pose classification. The pose likelihood vector from the RBF network output is used as input to the gesture recognizer, along with motion information. Gesture recognition performances using hidden Markov models (HMM) and recurrent neural networks (RNN) were investigated. Test results showed that the continuous HMM yielded the best performance with gesture recognition rates of 90.2%. Experiments with combining the continuous HMMs and RNNs revealed that a linear combination of the two classifiers improved the classification results to 91.9%. The gesture recognition system was deployed in a prototype user interface application, and users who tested it found the gestures intuitive and the application easy to use. Real time processing rates of up to 22 frames per second were obtained.

Introduction

As computers become more pervasive in society, facilitating natural human–computer interaction (HCI) will have a positive impact on their use. Hence, there has been growing interest in the development of new approaches and technologies for bridging the human–computer barrier. The ultimate aim is to bring HCI to a regime where interactions with computers will be as natural as interactions between humans, and to this end, incorporating gestures in HCI is an important research area [1].

We are interested in developing a vision-based system which can interpret a user's gestures in real time to manipulate windows and objects within a graphical user interface (GUI). Various works by Kadobayashi et al. [2], Pavlovic et al. [3], Freeman et al. [4] and Kjeldsen et al. [5] indicate that there is keen interest among current researchers to incorporate gestures into traditional HCI interfaces. Our work expands on their ideas and also looks into the possibility of using two-handed gestures while imposing fewer constraints on the users.

Much of the research on real-time gesture recognition has focused on the space-time trajectory of the hand without considering the shape or posture of the hand [6], [7], [8]. These works utilized only relative or oscillatory motion of the hand to recognize the gesture. However, in many situations, the meaning of gestures depends very much on the hand posture, in addition to the hand movement. Hence, our work incorporates hand posture as well as hand motion to recognize gestures. Also, unlike other works where users are required to wear artificial devices like data gloves [9] or green markers [10], it is our aim to allow the users to perform gestures in a natural and unencumbered manner.

In the work by Kjeldson and Kender, gestures were incorporated into a windowing user interface to manipulate windows [5]. The hand was segmented by using a neural network whose inputs were images coded in the hue-saturation-intensity (HSI) color model. Another neural network was trained to classify each pose. Gesture interpretation was performed by a state machine which implemented a gesture grammar. Their work demonstrated the feasibility of using gestures in a modern GUI. Our system differs from Ref. [5] in a few ways. We defined a slightly larger gesture set, and our system is specifically designed to allow the user to employ two-handed gestures, if he wishes. Moreover, our system does not need to be re-trained for every new user; it needs only to be trained once to achieve a relatively high level of user-invariance. Finally, the processing steps are quite different.

In the following, we present an overview of the system in Section 2. In Section 3, we describe the segmentation procedure to locate the hand(s) in the image. We then discuss a wrist-cropping method to isolate the segmented hand from the rest of the arm in Section 4. Next in Section 5, we describe pose classification using RBF networks, with Fourier descriptors of the segmented hand boundary as input features. We develop and compare the performance of the gesture recognizer based on hidden Markov Models (HMM) and recurrent neural networks (RNN) in Section 6. We also consider enhancing the recognition performance by combining the classifiers. In Section 7, we present and discuss the recognition results. A prototype GUI application that was developed to run in real time is described in Section 8, and Section 9 concludes the paper.

Section snippets

System description

Fourteen gestures shown in Fig. 1 were defined for controlling the windowing system. The Point gesture is used to move the cursor on the screen. The user can select a window/object to manipulate using the Pick gesture. Windows can be minimized with the Close gesture and restored with the Open gesture. The size of the window/object can be varied in different directions using different Enlarge and Shrink gestures. The Undo gestures can be used to reverse the previous action. These gestures are

Segmentation

Fig. 4 depicts the segmentation process, which uses color and motion cues. The camera's field of view may contain objects moving in the background, but it is assumed that hands are the only skin-colored objects in the view, to simplify their extraction. Background differencing is used to isolate the moving object region, followed by a segmentation process to extract the skin-colored objects (hand and arm). A wrist-cropping operation is next used to separate the hand from the segmented arm.

Wrist-cropping

The binary image obtained by segmentation is further processed to isolate the hand from the rest of the lower arm, by a wrist-cropping procedure. This is necessary because the segmented image may or may not include the lower arm depending on whether the user is wearing long-sleeves shirt, watches or other wrist ornaments. This can result in significantly different features being extracted for the same hand pose, and lead to increased complexity in the pose classifier. To avoid this, we

Fourier descriptors

We used Fourier descriptors [13] to represent the boundary of the extracted binary hand as the set of complex numbers, b_k=x_k+jy_k, where {x_k,y_k} are the boundary pixels. This is re-sampled to a fixed length sequence, {f_k,k=0,1,…N−1}, for use with the discrete Fourier transform (DFT). Denoting {F_n} as the DFT coefficients, the set of (rotation, scale, and translation invariant) Fourier descriptors is given by $A_{n} = |F_{n} | |F_{1} |, 2≤n<N.$ We used a set of 56 Fourier descriptors, resulting from using a

Gesture recognition

Gesture recognition uses the pose classification results and motion information of the centroids of the segmented hand(s) to classify the current frame as belonging to one of the fourteen predefined gestures. Input to the gesture recognition module is a nine element input vector u=[u₀…u₈]^T, which consists of five elements [u₀–u₄] from the pose classifier, and four additional elements [u₅–u₈], which encode the hand centroid motion and location. If the centroids of the primary hand and secondary

Results and discussion

In this section, we present results and discussion on the different components of the system, as well as overall gesture recognition performance.

Application

The prototype application simulates a windowing GUI driven by gesture. The processing rate for frames acquired in real time is 22 fps. Fig. 13 shows a screen shot of the application. A major portion of the screen is the simulated desktop, where the user can manipulate windows and objects as in any other GUI desktop environment. To the right of the simulated desktop is a tweak panel for users to adjust the simulation parameters according to personal preference. Just below it is an image display

Conclusions

In this paper, we considered a vision-based system that can interpret a user's gestures in real time to manipulate windows and objects within a graphical user interface. Every frame from the acquired image sequence was processed through five different stages, viz. hand segmentation, wrist-cropping, feature extraction, pose classification, and gesture recognition. Output from the gesture recognition module was used in an application to control windows and objects in a simulated GUI.

Hand

References (25)

B. Raytchev et al.
User-independent online gesture recognition by relative motion extraction
Pattern Recogn. Lett.
(2000)
V.I. Pavlovic et al.
Visual interpretation of hand gestures for human–computer interaction: a review
IEEE Trans. Pattern Anal. Mach. Intell.
(1997)
R. Kadobayashi et al.
Design and evaluation of an immersive walk-through application for exploring cyberspace
Proc. Third Int. Conf. Autom. Face Gesture Recogn.
(1998)
V.I. Pavlovic et al.
Gestural interface to a visual computing environment for molecular biologists
Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt
(1996)
W.T. Freeman et al.
Computer vision for computer games
Proc. Int. Conf. Autom. Face Gesture Recogn.
(1996)
R. Kjeldsen et al.
Towards the use of gesture in traditional user interface
Proc. Int. Conf. Autom. Face Gesture Recogn.
(1996)
C.J. Cohen et al.
Dynamical system representation, generation, and recognition of basic oscillatory motion gestures
Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt
(1996)
S. Nagaya et al.
A theoretical consideration of pattern space trajectory for gesture spotting recognition
Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt
(1996)
R. Liang et al.
A real-time gesture recognitiion system for sign language
Proc. Third Int. Conf. Autom. Face Gesture Recogn.
(1998)
M. Hoch
A prototype system for intuitive film planning
Proc. Third Int. Conf. Autom. Face Gesture Recogn.
(1998)

J. Yang, A. Waibel, Tracking Human Faces in Real Time, Technical Report CMU-CS-95-210, Department of Computer Science,...

E.S. Koh, Pose Recognition System, BE Thesis, National University of Singapore,...

Cited by (123)

A comprehensive survey and taxonomy of sign language research
2022, Engineering Applications of Artificial Intelligence
Sign language relies on visual gestures of human body parts to convey meaning and plays a vital role in modern society to communicate and interact with people having hearing difficulty as well as for human–machine interaction applications. This field has attracted a growing attention in recent years and several research outcomes have been witnessed covering various issues including sign acquisition, segmentation, recognition, translation and linguistic structures. In this paper, a comprehensive up-to-date survey of the state-of-the-art literature of automated sign language processing is presented. The survey provides a taxonomy and review of the body of knowledge and research efforts with focus on acquisition devices, available databases, and recognition techniques for fingerspelling signs, isolated sign words, and continuous sentence recognition systems. It covers recent advances including deep machine learning and multimodal approaches and discusses various related challenges. This survey is directed to junior researchers and industry developers working on sign language gesture recognition and related systems to gain insights and identify distinctive aspects and current status of existing landscape as well as future perspectives leading to further advancements.
Action recognition for educational proposals applying concepts of Social Assistive Robotics
2022, Cognitive Systems Research
Action recognition has been gaining interest in research due to its great number of applications. So, the main contribution of this manuscript is a human–robot interaction framework, relying on dimension reduction of the system’s inputs in order to require a smaller dataset for training of an Artificial Neural Network. Our motivation is the development of a Social Assistive Robotics application. In summary, we choose nine standard actions to guide a robot, and two neutral ones to represent stand-by or resting cases. The dataset is created by people with different body shape, for robustness purposes, using only 5 to 10 samples of each class per person. Offline and online tests validate the method’s accuracy and confusion matrices clarify the results. A TicTacToe game using a ground robot exemplify a real world application, where each action represents a desired spot in the game. The results confirm a high accuracy, above 96.7%, in all the tests. Based on this, we can conclude our preprocessing strategy and classifier are capable of identifying the action patterns, even for a tiny dataset; thus, it is recommended for educational proposals due to its simplicity.
Computer vision-based hand gesture recognition for human-robot interaction: a review
2024, Complex and Intelligent Systems
Artificial Intelligence based Real Time Deciphering of British Sign Language
2023, AIP Conference Proceedings
Real-time automated detection of older adults' hand gestures in home and clinical settings
2023, Neural Computing and Applications
Dynamic hand gesture tracking and recognition: Survey of different phases
2023, International Journal of Systematic Innovation

View all citing articles on Scopus

View full text

Real-time gesture recognition system and application

Abstract

Introduction

Section snippets

System description

Segmentation

Wrist-cropping

Fourier descriptors

Gesture recognition

Results and discussion

Application

Conclusions

Pattern Recogn. Lett.

Visual interpretation of hand gestures for human–computer interaction: a review

IEEE Trans. Pattern Anal. Mach. Intell.

Design and evaluation of an immersive walk-through application for exploring cyberspace

Proc. Third Int. Conf. Autom. Face Gesture Recogn.

Gestural interface to a visual computing environment for molecular biologists

Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt

Computer vision for computer games

Proc. Int. Conf. Autom. Face Gesture Recogn.

Towards the use of gesture in traditional user interface

Proc. Int. Conf. Autom. Face Gesture Recogn.

Dynamical system representation, generation, and recognition of basic oscillatory motion gestures

Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt

A theoretical consideration of pattern space trajectory for gesture spotting recognition

Proc. Int. Conf. Autom. Face Gesture Recogn., Killington, Vt

A real-time gesture recognitiion system for sign language

Proc. Third Int. Conf. Autom. Face Gesture Recogn.

A prototype system for intuitive film planning

Proc. Third Int. Conf. Autom. Face Gesture Recogn.