Regression-based Active Appearance Model initialization for facial feature tracking with missing frames

doi:10.1016/j.patrec.2013.12.005

Pattern Recognition Letters

Volume 38, 1 March 2014, Pages 113-119

https://doi.org/10.1016/j.patrec.2013.12.005 Get rights and content

Abstract

The Active Appearance Model (AAM) is receiving considerable attention in the field of facial analysis as a powerful method for modeling and segmenting deformable visual objects. Several extensions and improvements have been proposed on the original AAM, but AAMs maintain their dependence on the good initialization of model parameters to achieve accurate fitting results. AAMs are usually used directly in video tracking by searching on each subsequent frame that employs the fitting result of the previous frame for initialization. However, this model sometimes fails when large movements exist between two frames. This mechanism occurs when frames are dropped from the video due to the use of a lossy multimedia network. A regression-based approach for automatic AAM initialization is presented in this paper. After undergoing a scattered feature correspondence based on a dual-threshold matching strategy, the AAM shape points are initialized by the spatial map between local-landmark (L2L) correspondences. The map is learned based on Kernel Ridge Regression (KRR). The proposed method can successfully track the frames that are not identified with the general AAM trackers by establishing spatial relationship between local and landmark points. The initialization is robust to disturbances, which enables it to outperform key-feature-tracking or detection-based methods. We demonstrate the efficacy of the approach on two challenging facial videos with different training data and report a detailed quantitative evaluation of its performance.

Introduction

The Active Appearance Model (AAM) (Cootes et al., 2001) receives a significant amount of attention from the computer vision community in terms of deformable visual objects registration. A variety of applications are possible, including dynamic head pose and gaze estimation for real-time user interfaces, expression recognition, and lip reading.

AAMs generally treat registration as an optimization problem solved by local minimization methods. However, the gradient-descent-based fitting scheme is inherently dependent on good initializations. Poor initializations of model parameters cause the optimization to stick easily into the local minimum and diverge away from the target. This problem is prominent in the registration of images with considerable shape variations, which often occurs in lossy multimedia mobile network where some frames are unavailable due to unstable wireless connection and narrow bandwidth. Therefore, robust initializations for AAM tracking are extremely desirable to deal with frame loss in such a lossy environment.

The simplest method to initialize the model is through brute-force searching, which iteratively tests every possible configuration. However this method is extremely time-consuming because of the huge number of possible initializations. Stegmann (2000) suggested performing an AAM search in parallel with different initialization parameters, i.e., perturbed pose and model parameters. This method is less time-consuming compared to brute-force searching. However, it is still far from efficient, especially for real-time applications.

In most AAM fitting cases whose targets are facial images, the parameters of the shape model are roughly estimated by the detected face and facial features such as eyes, mouth centers, and nose tips (Rara et al., 2009, Rabie et al., 2008, Wong and Chung, 2010). After detecting these facial features, the AAM base mesh is warped to these points for AAM initialization. The initialization accuracy increases with more detailed facial features. However, the performance of these methods highly depends on the accuracy of feature detection, which will decrease at the presence of varying facial contexts and complex backgrounds. Wimmer (2008) used a learned Active Shape Model (ASM) fitting for AAM initialization because it provides stable results for the entire set of experiments, even in cases of poor initial parameter estimates as determined by a face detector.

In fitting an AAM to video sequences, conventional methods directly fit the AAM to each frame using the fitting results, i.e., shape and appearance parameters, of the previous frame as the initialization of the current frame (Ionita et al., 2011, Liu, 2010, Cristinacce and Cootes, 2008, Saragih et al., 2011, Sung and Kim, 2009). However, this method is only suitable for small movements between frames. Cui and Jin (2012) used the Lucas–Kanade optical flow to track several salient feature points, which were then used to constrain AAM shape initialization. The method considered inter-frame correspondences. However, its performance relied on salient feature tracking results, and only the similarity preservation of the two frames was considered. This approach limited the initialization performance because internal and external changes may disable salient feature tracking, and the transformation between two frames is usually more complicated.

One possible solution is to locate a sparse set of local points on each image and use them to conduct the initialization. In Feng et al. (2011), local feature matching between neighboring frames was adopted to predict the initial three-dimensional (3D) AAM parameters, wherein 3D pose estimation was conducted using a 3D-shaped model constraint. This method is also time-consuming and is designed for 3D-based face tracking.

A regression-based approach is proposed in the current paper for efficient two-dimensional AAMs initialization. Instead of looking for salient features like eyes and mouth, we use local sparse/scattered feature correspondence for AAMs initializations. The relationship between the local features and the global shape is obtained from the training data. By establishing an inter-frame relationship that combines the local and global facial features, compared to a previous work, the approach becomes more robust to external facial context changes (illuminations, viewpoints, etc.) and internal changes (expressions, glass-bearing, etc.). This improvement makes this approach more suitable for lossy network where the variation between neighbor frames may be large.

Fig. 1 shows a diagram of the proposed method. The landmarks in the first frame are manually annotated to initialize tracking. The AAM for the rest frames during tracking is initialized by a local-landmark (L2L) mapping based on Kernel Ridge Regression (KRR) (Trevor Hastie and Friedman, 2009). This method exploits the spatial relationship between scattered local invariant features and structured facial annotation points. To improve initialization accuracy, an improved local feature correspondence strategy called dual-threshold scale invariant feature transform (SIFT) matching is also presented in this paper as one of the supporting strategies.

The proposed AAM initialization framework has two main contributions: (1) a data-driven approach is proposed to identify the shape correspondence between sequential images of faces from their scattered local feature matching; and (2) an accurate match strategy for local feature correspondences of consecutive frames, which improves the accuracy of tracking results. The proposed initialization method helps AAMs to converge and accurately localize the facial features during tracking. This method outperforms other AAM initializations in terms of convergence rate and tracking accuracy, especially in lossy network where some frames are missing.

The remainder of this paper is organized as follows: Section 2 describes the proposed regression-based AAM initialization approach in detail. Section 3 introduces some supporting strategies for performance improvement. Section 4 and Section 5 present the experimental results and the conclusions respectively.

Section snippets

Regression based AAM initialization

We briefly introduce KRR in this section before investigating the initialization process through a map obtained from the scattered/sparse local feature correspondence space to the structured landmark space.

Assistant strategies

Some strategies for performance improvement are introduced in this section.

Experiments and discussion

This section demonstrates the effectiveness of the proposed approach in fitting facial video sequences. The proposed initialization method is compared with the following approaches: (1) The general AAM initialization which takes the previous frame as the initialization of the current frame (Ionita et al., 2011, Liu, 2010, Cristinacce and Cootes, 2008, Saragih et al., 2011, Sung and Kim, 2009); (2) The recently proposed initialization method which used the Lucas-Kanade optical flow to track some

Conclusion

An approach for automatic AAM initialization during facial features tracking is presented. By establishing a spatial relationship between local and landmark points, the approach helps improve the performance of AAM trackers in terms of accuracy and efficiency, especially in lossy network where some frames may be unavailable and the variation between consecutive frames is unstable. The proposed framework is validated by tracking facial features in image sequences with different data for training.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (61104213), Natural Science Foundation of Jiangsu Province (BK2011146), and Opening Fund of Key Laboratory of System Control and Information Processing (Ministry of Education) at Shanghai Jiaotong University (SCIP2011008).

References (19)

D. Cristinacce et al.
Automatic feature localisation with constrained local models
Pattern Recogn.
(2008)
X. Liu
Video-based face model fitting using adaptive active appearance model
Image Vision Comput.
(2010)
J. Sung et al.
Adaptive active appearance model with incremental learning
Pattern Recogn. Lett.
(2009)
Aran, O., Ari, I., Guvensan, A., Haberdar, H., Kurr, Z., Turkmen, I., Uyar, A., Akarun, L., 2007. A database of...
T. Cootes et al.
Active appearance models
IEEE Trans. Pattern Anal. Mach. Intell.
(2001)
Y. Cui et al.
Facial feature points tracking based on aam with optical flow constrained initialization
J. Pattern Recogn. Res.
(2012)
Feng, X., Shen, X., Zhou, M., Zhang, H., Kim, J., 2011. Robust facial expression tracking based on composite...
FGNet, 2004. Fgnet talking face video....
Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S. 2007. Guide to the cmu multi-pie database, Technical report,...

There are more references available in the full text version of this article.

Cited by (9)

The decadal perspective of facial emotion processing and Recognition: A survey
2022, Displays
Citation Excerpt :
Secondly, several methods for integration of the DAFs into smooth and continuous operating space are used. Some advanced AAM models have also been recently developed, such as AAM-based Directed Gradient (HOG) histograms dense-based AAM, and AAM-based regression [75]. The efficacy of these newly developed AAM variants on FER is a fascinating piece of work to investigate.
Facial expression recognition (FER) is playing a crucial role in distinct psychological disorders, human–machine interaction, and a multitude of multimedia applications. The transformation of FER from lab to wild conditions and significant advancement in deep learning has led to the implementation of automatic FER. In this article, we provide a review of FER that includes Ekman’s six basic emotions, the significance of FER with datasets, and deep learning algorithms. The article classified the fundamental procedure of FER into distinct for clear understanding. The significance of each procedure in FER including face detection& tracking, extracting facial features of dynamic & static images, and facial expression classification is addressed with algorithms in this article. The existing state of art deep neural networks including convolution neural network (CNN), deep belief network (DBN), the deep auto encoder (DAE), and recurrent neural network (RNN) for FER are also presented in this article. Finally, the article provides challenges and recommendations namely deficiency in datasets, biasness and inconsistency in data set, integration of robust models, multimodal for effective recognition. FER with technology for sustainable health, edge computing powered devices for FER implementation, adoption of FER-based human interaction robots, and customized portable device for FER.
A robust incremental clustering-based facial feature tracking
2017, Applied Soft Computing Journal
Citation Excerpt :
It is worth mentioning that the performances of existing methods for facial feature tracking are not exactly comparable. This is due to each method reports its accuracy for feature tracking with a specific tracker such as appearance-based: DE-MC [6], AAM with KRR [31], DBN with RBM [57] and model-based: ASM [27], SDM [33], STCSR [34], Hierarchical Clustering [47], CLM [49]). In addition, the tracking accuracy varies for individual factor such as: extracted features (Gabor [6], local patches [14], distance displacement [27], SIFT features [33], Multi-level [57]), number of landmarks (17 [17], 26 [6], 66 [33,39], 68 [49],82 [27]), and finally, databases to evaluate distinct tracker (BIOID [6,58], CK+ [55,57], FGnet [54], Multi-PIE [59,39], and so on).
Emerging significance of person-independent, emotion specific facial feature tracking has been actively tracked in the machine vision society for decades. Among distinct methods, the Constrained Local Model (CLM) has shown significant results in person-independent feature tracking. In this paper, we propose an automatic, efficient, and robust method for emotion specific facial feature detection and tracking from image sequences. A novel tracking system along with 17-point feature model on the frontal face region has also been proposed to facilitate the tracking of human basic facial expressions. The proposed feature tracking system keeps patch images and face shapes till certain number of key frames incorporating CLM-based tracker. After that, incremental patch and shape clustering algorithms is applied to build appearance model and structure model of similar patches and similar shapes respectively. The clusters in each model are built and updated incrementally and online, controlled by amount of facial muscle movement. The overall performance of the proposed Robust Incremental Clustering-based Facial Feature Tracking (RICFFT) is evaluated on the FGnet database and the Extended Cohn-Kanade (CK+) database. RICFFT demonstrates mean tracking accuracy of 97.45% and 96.64% for FGnet and CK+ database respectively. Also, RICFFT is more robust by minimizing average shape distortion error of 0.20% and 1.86% for FGnet and CK+ (apex frame) database, as compared with classic method CLM.
Weighted-fusion feature of MB-LBPUH and HOG for facial expression recognition
2020, Soft Computing
System for face recognition under different facial expressions using a new associative hybrid model amαβ-KNN for people with visual impairment or prosopagnosia
2019, Sensors (Switzerland)
Face landmark point tracking using LK pyramid optical flow
2018, Proceedings of SPIE - The International Society for Optical Engineering
RGB-D Sensor for Facial Expression Recognition in AAL Context
2018, Lecture Notes in Electrical Engineering

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by C. Luengo.

View full text

Regression-based Active Appearance Model initialization for facial feature tracking with missing frames☆

Abstract

Introduction

Section snippets

Regression based AAM initialization

Assistant strategies

Experiments and discussion

Conclusion

Acknowledgment

Pattern Recogn.

Image Vision Comput.

Pattern Recogn. Lett.

Active appearance models

IEEE Trans. Pattern Anal. Mach. Intell.

Facial feature points tracking based on aam with optical flow constrained initialization

J. Pattern Recogn. Res.