Introduction
Taxonomy of hand gesture recognition
Sensor-based methods
Computer vision-based methods
Gesture recognition processes
-
Data acquisition: acquiring gesture images with video camera and preprocessing images;
-
Gesture detection and segmentation: detecting the position of the hand in the gesture image and segmenting the hand region;
-
Gesture recognition: extracting image features from the hand region and recognizing the gesture type based on the features. In “Hand gesture recognition process”, the discussion will be divided into the two respective parts.
Hand gesture recognition process
Data acquisition
Image grayscaling
Image smoothing
Edge detection
Morphological image processing
Optimum thresholding
Gesture detection and segmentation
Methods | Representative Algorithms | Advantages | Disadvantages |
---|---|---|---|
Based on skin color | Color space | Fast processing, invariance to rotation, partial occlusion, pose change | Susceptible to interference from skin-like areas |
Edge detection operator | Fast and accurate gesture edge information extraction | Edge extraction results may be broken, overlapping, etc., and require subsequent processing | |
Based on contour information | Template matching | Adaptability to different shapes and sizes in gesture segmentation; for relatively simple gestures, the matching accuracy is high and the segmentation results are more accurate | The need to prepare many templates in advance increases the system complexity |
Active contour model | Adaptive adjustment of contour shape, suitable for gesture segmentation of various shapes. Better for gesture segmentation in the presence of noise, complex backgrounds, etc. | High algorithmic complexity and significant demand for computational resources | |
Based on a depth sensor | – | High identification efficiency | Reliance on depth cameras and the need for improved accuracy |
Based on deep learning | – | No requirement of manual analysis of gesture data for segmentation, which is thus more convenient and robust | Need to improve the real-time performance of detection |
Gesture tracking (dynamic gestures) | Frame differential method | The algorithm’s simplicity to implement, and the low programming. Lower sensitivity to scene changes such as light, ability to adapt to various dynamic environments, and relative robustness | Inability to extract the complete area of an object, and presence of “holes” inside the object; extraction of only the boundary with an outline that is coarse and often larger than the actual object |
Background subtraction | Ability to extract the complete area of an object, reduced sensitivity to changes in the scene such as light variations, and ability to adapt to various dynamic environments | Required initial background modeling and sensitivity to light changes | |
Optical flow | Ability to extract the complete area of an object, and insensitivity to light changes | Inadequate detection effectiveness in the case of fast movement or unclear object surface texture | |
MeanShift algorithm | Good real-time performance, fast calculations | Ineffectiveness of tracking in cases of large variations of the shape and size of the target | |
CAMShift algorithm | Better tracking for large variations in target shape and size | The calculations is large and takes a long time to calculate | |
Particle filtering algorithm | Fast calculations and low storage requirements | Ease of mismatching and losing the detailed features if the target is similar to the background |
Skin color segmentation
Contour information segmentation
Other segmentation approaches
Tracking
Feature Extraction
References | Features | Accuracy |
---|---|---|
[92] | LBP, PCA | 99.97 |
[93] | Harr-like features | 95 37 |
[94] | SIFT | 99 |
[95] | SURF | 63 |
[96] | Fused features consisting of blended Hu moments, finger angle counts, skin tone angles, and nonskin tone angles | 90.0 |
[97] | SIFT, Hu moments, LBP | 87.3,85.1 |
[98] | Distance and angle from the end point of the hand | 92.13 |
[99] | SURF | 84.6 |
[100] | SURF, longest common subsequence | 93.0 |
[101]] | Skin detector | 97 |
[102] | Harris | 94.8 |
[103] | SIFT | 90 |
[104] | HOG, SIFT | 91 |
[105] | PCA | 91.5 |
-
Variance filtering: Eliminate features with variance below a certain threshold because they have less impact on the classification or regression task.
-
Correlation filtering: Eliminate features that have a low correlation with the target variable.
-
Regularization method: Eliminate features by making the weights of some features converge to zero through L1 or L2 regularization.
-
Filtering: Evaluate each feature according to dispersion or relevance, set a threshold or the number of thresholds for features to be selected, and select features.
-
Wrapper: Select a number of features or exclude a number of features each time according to the objective function until the best subset is selected.
-
Embedding: First, use machine learning algorithms and models for training to obtain the weight coefficients of each feature, according to the coefficient from the largest to the smallest selection of features. This approach is similar to the Filtering method, but training is used to determine the utility of features.
Gesture classification
Methods | Advantages | Disadvantages | |
---|---|---|---|
Template matching | High speed in the case of small samples, good adaptability to light and background changes, wide range of applications | Low classification accuracy and limited types of gestures that can be recognized | |
Geometric information-based | Fingertip detection | Quick detection of the location and number of fingers | Recognition effectiveness may be different for different hand types, and in the case of small finger spacing, false or missed detections may occur |
Convex packet detection | Suitability for recognition of various hand types and good ability to detect the position and number of fingers; ability to perform detection correctly if the finger spacing is small | Greater algorithmic complexity and difficulty of implementation; the possibility of impacted recognition effectiveness in the case of hand occlusion or insufficient light | |
Dynamic time warping | Good algorithmic performance in matching and recognizing gestures with different motion speeds if the gesture sample template library is relatively small | Greatly reduced recognition speed and stability of the algorithm if the gesture sample template library is large, especially if the gestures are complex or in the case of a combination of two-handed gestures | |
Hidden Markov model | Ability to capture dynamic features and important timing information in gestures | Needing a large amount of data, which may lead to over- or under-fitting if the dataset is too small | |
Machine learning | Simplicity of the algorithm and ease of implementation and debugging; relatively small data requirements for traditional algorithms | Sensitivity to the interference of light, angle and background of gestures, low accuracy, and the need for manual extraction of gesture features | |
Deep learning | Automatic extraction of gesture features, eliminating the need for manual extraction, and greater robustness to interference such as that of lighting, angle, and background; high accuracy | Greater algorithmic complexity and the need for a large amount of data and computational resources; longer training time and greater computational requirements |
Template matching
Methods based on geometric information
Dynamic time warping
Hidden Markov model
Machine learning
Deep learning
Experimental evaluation
Accuracy
Precision
Recall
F1 score
Intersection over union (IoU)
Hand gesture recognition based on RGB-D cameras
References | Experimental results |
---|---|
[172] | Average task completion time for target capture: “semiautomatic” (176.9s) and “manual” (287.4s) |
[173] | On the self-acquisition dataset, the EPS solution reached an accuracy of over 96.5% with an average runtime of 30 ms. On the AIR Handwriting dataset, the recognition time per gesture was 24.3 ms with an average accuracy of 95.5% |
[59] | The percentages of false negatives and false positives were 2.00% and 4.38%, respectively. Training time was approximately 16min. Real-world image classification time was approximately 2.1 sec/frame |
[174] | The accuracy rate of hand inspection procedures improved to 95.43%. The average accuracy of hand part classification improved to 74.65% |
[175] | The average recognition rate was above 80% |
[177] | Average recognition success rate of 84.5% |
[178] | Highest recognition accuracy of up to 99.66% |
[180] | The accuracy rate reached 89% |
[181] | The best accuracy for static gesture recognition was 95.6%. The best accuracy for dynamic gesture recognition was 97.2% |
[176] | On the NTU Hand Digit Dataset, the best obtained performance was 98.7%. On the Kinect Leap DataSet, an accuracy of 96.8% was reached. On the Senz3d Dataset, an accuracy of 100% was attained. On the ASL-FS Dataset, an accuracy of 87.1% was obtained. On the ChaLearn LAP IsoGD Dataset, the accuracy of 60.12% was reached. The average runtime per query on an average PC (without a GPU) was only 6.3 ms |
Hand gesture recognition applications
-
Healthcare: Emergency rooms and operating rooms can be chaotic, with a significant amount of noise from individuals and equipment. In such an environment, voice commands are not as effective as hand gestures. Touchscreens are also not an option because of the strict boundaries between sterile and nonsterile domains. However, accessing information and images during surgery or other procedures is possible with gesture recognition technology, as demonstrated by Microsoft. GestSure, a gesture control technology that can be used to control medical devices, allows physicians to examine MRI, CT and other images with simple gestures without scrubbing. This touch-free interaction reduces the number of times doctors and nurses touch patients, reducing the risk of cross-contamination.
-
Safe driving: Advanced driver assistance systems that incorporate gesture recognition can somewhat increase driving safety. Through an advanced driver assistance system, drivers can modify many parameters inside the automobile using gestures, allowing them to focus more on the road and perhaps reducing traffic accidents. The BMW 7 Series has an integrated hand gesture recognition system that recognizes five gestures to control music, incoming calls, etc. Reducing interaction with the touchscreen makes the driving experience safer and more convenient.
-
Sign language awareness: The primary means of communication for hearing-impaired individuals is sign language; however, understanding sign language is difficult for those who have not received formal instruction. The ability of hearing-impaired and other individuals to communicate will be enhanced substantially using sign recognition technology for sign language cognition. The Italian startup Limix combines IoT and dynamic gesture recognition technology to record sign language, translate it to text, and then play it back on a smartphone via a voice synthesizer.
-
Virtual reality: Gesture recognition allows users to interact with and control virtual reality scenes more naturally, enhancing users’ immersion and experience. In 2016, Leap Motion demonstrated updated gesture recognition software that allowed users to track gestures in virtual reality in addition to controlling computers. ManoMotion’s hand-tracking application recognizes 3D gestures through a smartphone camera (on Android and iOS) and can be applied to AR and VR environments. Use cases for this technology include gaming, IoT devices, consumer electronics, and robotics.
-
Device control: Intelligent robots can also be controlled by gestures. With the advancement of artificial intelligence, home robots or smart home equipment will progressively appear in millions of households, and consumers will feel more at ease using gesture control as opposed to traditional button or touch screen input. A company called uSens develops hardware and software that enables Smart TVs to recognize finger movements and gestures. Gestoo’s artificial intelligence platform uses gesture recognition technology to enable touchless control of lighting and audio systems. With Gestoo, gestures can be created and assigned from a smartphone or another device, and a single gesture can be used to activate multiple commands.
References | Cameras | Focus |
---|---|---|
[182] | Monocular | Built a completely automated hand gesture detection system and used it for a robot that assisted individuals in libraries |
[183] | Monocular | Used a hand gesture detection system to improve conventional teaching techniques by replacing them with a simple gesture-based control scheme |
[184] | Monocular | Suggested Eureka, a deep learning-based gesture recognition method that combined a feature extractor and a neural network |
[185] | Monocular | Fed many preprocessed sample images into a CNN, which subsequently performed feature extraction, to apply the YOLOv4 method to distinguish gesture features |
[186] | RGB-D | Applied Kalman filtering to the raw data to lessen the jitter or jumps produced during data capture by Leap Motion |
[187] | RGB-D | Built a multisensor data fusion model and proposed a multilayer RNN consisting of an LSTM module |
[188] | RGB-D | Enhanced communication between the Arduino Mega chip and Microsoft Kinect V2 to construct an industrial robot with 7 degrees of freedom operated by hand gestures |
[189] | RGB-D | Presented a deep learning-based hand gesture categorization network and hand detection model, and performed pixel-level fusion of RGB and depth images with hand information |
[190] | RGB-D | Used Leap Motion to collect hand data, and developed a neural network method to categorize nine hand movements, and a finite state machine to control the robot |
References | Cameras | Focus |
---|---|---|
[191] | Monocular | Proposed an HRI system based on the HMM that could recognize meaningful gestures composed of continuous hand movements in real time |
[192] | Monocular | Tracked a robot’s subsequent gestures and used them to transmit data for movement control |
[193] | Monocular | Proposed a template matching algorithm to recognize gestures and control the motion of mobile carts by combining invariant moment matching techniques |
[194] | Monocular | Integrated the improved YOLOv5 algorithm for hand pinpointing and the Resnet-152 method for hand classification |
[195] | Monocular | Proposed E-MobileNetv2, an enhanced lightweight CNN, for classification |
[196] | Monocular | Built a gesture change detection technique by using an upgraded residual neural network, as well as a hand segmentation algorithm enhanced by skin color detection and skeletal joint tracking |
[197] | Monocular | Estimated gesture motion from time-series data of hand coordinates by using a one-dimensional fast Fourier transform and estimated two-dimensional coordinates of hand areas in images from color information |
[198] | Monocular | Applied the 3DCNN architecture to gesture recognition and implemented a system for directing robots with video-recorded gestures in real-world scenarios |
[199] | RGB-D | Improved a robot’s capacity to recognize gestures by training on the data captured by a Leap Motion controller using classifiers (SVM, KNN, and HMM) |
[200] | RGB-D | Suggested a two-handed gesture recognition method based on depth cameras for real-time control of a mechanical wheeled mobile robot |
[58] | RGB-D | Described a dynamic gesture recognition system based on depth sensors for a continuous operation of material-handling robots |
[201] | RGB-D | Created a real-time skeleton-based five-gesture detection system using depth cameras and machine learning |
[202] | RGB-D | Proposed a dynamic gesture detection technique based on 3D hand posture estimation |
Problems, outlook, and conclusion
Problems
Data gathering
Training data environment
Identification speed
Segmentation in complex background
Distance and hand anatomy
Future outlook
-
More intelligent: Gesture recognition will become more intelligent with the continued development of deep learning and artificial intelligence technology. Training a model will allow it to understand more complex gestures while reducing the user requirements, making gesture recognition more natural and intelligent.
-
More accurate: As computer vision and sensor technology continue to improve, gesture recognition will become more accurate. For example, higher-resolution cameras and more sensitive sensors can capture more subtle hand movements, improving the accuracy of gesture recognition.
-
More capable of real-time performance: Future gesture recognition technology will operate closer to real time, and be capable of processing large numbers of gestures and translating them into commands or actions. This will enable gesture recognition’s wider use in virtual reality, gaming, medical, and other fields.
-
More reliable: As the applications of gesture recognition technology expand, its reliability becomes increasingly important. Future gesture recognition technologies will require more rigorous testing and validation to ensure their reliable operation in a variety of environments.
-
More personalized: Future gesture recognition technologies will be more personalized and able to adapt to different users’ gesture habits and preferences. For example, users may be able to customize specific gestures to accomplish a particular operation or function.