In this paper, we focus on the problem of performing deformable face tracking across long-term sequences within unconstrained videos. The problem of tracking across long-term sequences is particularly challenging as the appearance of the face may change significantly during the sequence due to occlusions, illumination variation, motion artifacts and head pose. For the problem of deformable tracking, however, the problem is further complicated by the expectation of recovering a set of accurate fiducial points in conjunction with successfully tracking the object. As described in Sect.
2, current deformable facial tracking methods mainly concentrate on performing face detection per frame and then performing facial landmark localisation. However, we consider the most important metric for measuring the success of deformable face tracking as the facial landmark localisation accuracy. Given this, there are a number of strategies that could feasibly be employed in order to attempt to minimise the total facial landmark localisation error across the entire sequence. Therefore, we take advantage of current advances in face detection, model free tracking and facial landmark localisation techniques in order to perform deformable face tracking. Specifically, we investigate three strategies for deformable tracking:
1.
Detection + landmark localisation Face Detection per frame, followed by facial landmark localisation initialised within the facial bounding boxes. This scenario is visualised in Fig.
1 (top).
2.
Model free tracking + landmark localisation Model free tracking, initialised around the interior of the face within the first frame, followed by facial landmark localisation within the tracked box. This scenario is visualised in Fig.
1 (bottom).
3.
Hybrid systems Hybrid methods that attempt to improve the robustness of the placement of the bounding box for landmark localisation. Namely, we investigate methods for failure detection, trajectory smoothness and reinitialisation. Examples of such methods are pictorially demonstrated in Figs.
4 and
8.
Note that we focus on combinations of methods that provide bounding boxes of the facial region followed by landmark localisation. This is due to the fact that the current set of state-of-the-art landmark localisation methods are all local methods and require initialisation within the facial region. Although joint face detection and landmark localisation methods have been proposed (Zhu and Ramanan
2012; Chen et al.
2014), they are not competitive with the most recent set of landmark localisation methods. For this reason, in this paper we focus on the combination of bounding box estimators with state-of-the-art local landmark localisation techniques.
The remainder of this Section will give a brief overview of the literature concerning face detection, model free tracking and facial landmark localisation.
3.1 Face Detection
Face detection is among the most important and popular tasks in Computer Vision and an essential step for applications such as face recognition and face analysis. Although it is one of the oldest tasks undertaken by researchers (the early works appeared about 45 years ago (Sakai et al.
1972; Fischler and Elschlager
1973)), it is still an open and challenging problem. Recent advances can achieve reliable performance under moderate illumination and pose conditions, which led to the installation of simple face detection technologies in everyday devices such as digital cameras and mobile phones. However, recent benchmarks (Jain and Learned-Miller
2010) show that the detection of faces on arbitrary images is still a very challenging problem.
Since face detection has been a research topic for so many decades, the existing literature is, naturally, extremely extensive. The fact that all recent face detection surveys (Hjelmås and Low
2001; Yang et al.
2002; Zhang and Zhang
2010; Zafeiriou et al.
2015) provide different categorisations of the relative literature is indicative of the huge range of existing techniques. Consequently, herein, we only present a basic outline of the face detection literature. For an extended review, the interested reader may refer to the most recent face detection survey in Zafeiriou et al. (
2015).
According to the most recent literature review Zafeiriou et al. (
2015), existing methods can be separated in two major categories. The first one includes methodologies that learn a set of rigid templates, which can be further split in the following groups: (i) boosting-based methods, (ii) approaches that utilise SVM classifiers, (ii) exemplar-based techniques, and (iv) frameworks based on Neural Networks. The second major category includes deformable part models, i.e. methodologies that learn a set of templates per part as well as the deformations between them.
Boosting Methods Boosting combines multiple “weak” hypotheses of moderate accuracy in order to determine a highly accurate hypothesis. The most characteristic example is Adaptive Boosting (AdaBoost) which is utilised by the most popular face detection methodology, i.e. the Viola–Jones (VJ) detector of Viola and Jones (
2001,
2004). Characteristic examples of other methods that employ variations of AdaBoost include Li et al. (
2002), Wu et al. (
2004), Mita et al. (
2005). The original VJ algorithm used Haar features, however boosting (or cascade of classifiers methodologies in general) have been shown to greatly benefit from robust features (Köstinger et al.
2012; Jun et al.
2013; Li et al.
2011; Li and Zhang
2013; Mathias et al.
2014; Yang et al.
2014a), such as HOG (Dalal and Triggs
2005), SIFT (Lowe
1999), SURF (Bay et al.
2008) and LBP (Ojala et al.
2002). For example, SURF features have been successfully combined with a cascade of weak classifiers in Li et al. (
2011), Li and Zhang (
2013), achieving faster convergence. Additionally, Jun et al. (
2013) propose robust face specific features that combine both LBP and HOG. Mathias et al. (
2014) recently proposed an approach (so called HeadHunter) with state-of-the-art performance that employs various robust features with boosting. Specifically, they propose the adaptation of Integral Channel Features (ICF) (Dollár et al.
2009) with HOG and LUV colour channels, combined with global feature normalisation. A similar approach is followed by Yang et al. (
2014a), in which they combine gray-scale, RGB, HSV, LUV, gradient magnitude and histograms within a cascade of weak classifiers.
SVM Classifiers Maximum margin classifiers, such as Support Vector Machines (SVMs), have become popular for face detection (Romdhani et al.
2001; Heisele et al.
2003; Rätsch et al.
2004; King
2015). Even though their detection speed was initially slow, various schemes have been proposed to speed up the process. Romdhani et al. (
2001) propose a method that computes a reduced set of vectors from the original support vectors that are used sequentially in order to make early rejections. A similar approach is adopted by Rätsch et al. (
2004). A hierarchy of SVM classifiers trained on different resolutions is applied in Heisele et al. (
2003). King (
2015) proposes an algorithm for efficient learning of a max-margin classifier using all the sub-windows of the training images, without applying any sub-sampling, and formulates a convex optimisation that finds the global optimum. Moreover, SVM classifiers have also been used for multi-view face detection (Li et al.
2000; Wang and Ji
2004). For example, Li et al. (
2000) first apply a face pose estimator based on support vector regression (SVR), followed by an SVM face detector for each pose.
Exemplar-Based Techniques These methods aim to match a test image against a large set of facial images. This approach is inspired by principles used in image retrieval and requires that the exemplar set covers the large appearance variation of human face. Shen et al. (
2013) employ bag-of-word image retrieval methods to extract features from each exemplar, which creates a voting map for each exemplar that functions as a weak classifier. Thus, the final detection is performed by combining the voting maps. A similar methodology is applied in Li et al. (
2014), with the difference that specific exemplars are used as weak classifiers based on a boosting strategy. Recently, Kumar et al. (
2015) proposed an approach that enhances the voting procedure by using semantically related visual words as well as weighted occurrence of visual words based on their spatial distributions.
Table 1
The set of detectors used in this paper
DPM |
Felzenszwalb et al. ( 2010) | |
\(\checkmark \)
| |
| |
Alabort-i-Medina et al. ( 2014) | |
HR-TF | |
\(\checkmark \)
| | |
MTCNN | |
\(\checkmark \)
| | |
NPD | |
\(\checkmark \)
| | |
SS-DPM | | |
\(\checkmark \)
| |
SVM+HOG | |
\(\checkmark \)
| | |
| |
VJ | |
\(\checkmark \)
| | |
| |
VPHR | |
\(\checkmark \)
| | |
Convolutional Neural Networks Another category, similar to the previous rigid template-based ones, includes the employment of Convolutional Neural Networks (CNNs) and Deep CNNs (DCNNs) (Osadchy et al.
2007; Zhang and Zhang
2014; Ranjan et al.
2015; Li et al.
2015a; Yang et al.
2015b). Osadchy et al. (
2007) use a network with four convolution layers and one fully connected layer that rejects the non-face hypotheses and estimates the pose of the correct face hypothesis. Zhang and Zhang (
2014) propose a multi-view face detection framework by employing a multi-task DCNN for face pose estimation and landmark localization in order to obtain better features for face detection. Ranjan et al. (
2015) combine deep pyramidal features with Deformable Part Models. Recently, Yang et al. (
2015b) proposed a DCNN architecture that is able to discover facial parts responses from arbitrary uncropped facial images without any part supervision and report state-of-the-art performance on current face detection benchmarks.
Deformable Part Models DPMs (Schneiderman and Kanade
2004; Felzenszwalb and Huttenlocher
2005; Felzenszwalb et al.
2010; Zhu and Ramanan
2012; Yan et al.
2013; Li et al.
2013a; Yan et al.
2014; Mathias et al.
2014; Ghiasi and Fowlkes
2014; Barbu et al.
2014) learn a patch expert for each part of an object and model the deformations between parts using spring-like connections based on a tree structure. Consequently, they perform joint facial landmark localisation and face detection. Even though they are not the best performing methods for landmark localisation, they are highly accurate for face detection in-the-wild. However, their main disadvantage is their high computational cost. Pictorial Structures (PS) (Fischler and Elschlager
1973; Felzenszwalb and Huttenlocher
2005) are the first family of DPMs that appeared. They are generative DPMs that assume Gaussian distributions to model the appearance of each part, as well as the deformations. They became a very popular line of research after the influential work in Felzenszwalb and Huttenlocher (
2005) that proposed a very efficient dynamic programming algorithm for finding the global optimum based on Generalized Distance Transform. Many discriminatively trained DPMs (Felzenszwalb et al.
2010; Zhu and Ramanan
2012; Yan et al.
2013,
2014) appeared afterwards, which learn the patch experts and deformation parameters using discriminative classifiers, such as latent SVM.
DPMs can be further separated with respect to their training scenario into: (i) weakly supervised and (ii) strongly supervised. Weakly-supervised DPMs (Felzenszwalb et al.
2010; Yan et al.
2014) are trained using only the bounding boxes of the positive examples and a set of negative examples. The most representative example is the work by Felzenszwalb et al. (
2010), which has proved to be very efficient for generic object detection. Under a strongly supervised scenario, it is assumed that a training database with images annotated with figucial landmarks is available. Several strongly supervised methods exist in the literature (Felzenszwalb and Huttenlocher
2005; Zhu and Ramanan
2012; Yan et al.
2013; Ghiasi and Fowlkes
2014). Ghiasi and Fowlkes (
2014) propose an hierarchical DPM that explicitly models parts’ occlusions. In Zhu and Ramanan (
2012) it is shown that a strongly supervised DPM outperforms, by a large margin, a weakly supervised one. In contrast, HeadHunter by Mathias et al. (
2014) shows that a weakly supervised DPM can outperform all current state-of-the-art face detection methodologies including the strongly supervised DPM of Zhu and Ramanan (
2012).
According to FDDB (Jain and Learned-Miller
2010), which is the most well established face detection benchmark, the currently top-performing methodology is the one by Ranjan et al. (
2015), which combines DCNNs with a DPM. Some of the top-performing systems consist of commercial software, thus we did use the deep methods of Hu and Ramanan (
2016), Zhang et al. (
2016) that are available as open source with the method of Hu and Ramanan (
2016) reporting the latest best performance in FDDB. Additionally, we employ the top performing SVM-based method for learning rigid templates (King
2015), the best weakly and strongly supervised DPM implementations of Mathias et al. (
2014) and Zhu and Ramanan (
2012), along with the best performing exemplar-based technique of Kumar et al. (
2015) . Finally, we also use the popular VJ algorithm (Viola and Jones
2001,
2004) as a baseline face detection method. The employed face detection implementations are summarised in Table
1.
3.2 Model Free Tracking
Model free tracking is an extremely active area of research. Given the initial state (e.g., position and size of the containing box) of a target object in the first image, model free tracking attempts to estimate the states of the target in subsequent frames. Therefore, model free tracking provides an excellent method of initialising landmark localisation methods.
The literature on model free tracking is vast. For the rest of this section, we will provide an extremely brief overview of model free tracking that focuses primarily on areas that are relevant to the tracking methods we investigated in this paper. We refer the interested reader to the wealth of tracking surveys (Li et al.
2013b; Smeulders et al.
2014; Salti et al.
2012; Yang et al.
2011) and benchmarks (Wu et al.
2013,
2015; Kristan et al.
2013,
2014,
2015,
2016; Smeulders et al.
2014) for more information on model free tracking methods.
Generative Trackers These trackers attempt to model the objects appearance directly. This includes template based methods, such as those by Matthews et al. (
2004), Baker and Matthews (
2004), Sevilla-Lara and Learned-Miller (
2012), as well as parametric generative models such as Balan and Black (
2006), Ross et al. (
2008), Black and Jepson (
1998) , Xiao et al. (
2014). The work of Ross et al. (
2008) introduces online subspace learning for tracking with a sample mean update, which allows the tracker to account for changes in illumination, viewing angle and pose of the object. The idea is to incrementally learn a low-dimensional subspace and adapt the appearance model on object changes. The update is based on an incremental principal component analysis (PCA) algorithm, however it seems to be ineffective at handling large occlusions or non-rigid movements due to its holistic model. To alleviate the partial occlusion, Xiao et al. (
2014) suggest the use of square templates along with PCA. Another popular area of generative tracking is the use of sparse representations for appearance. In Mei and Ling (
2011), a target candidate is represented by a sparse linear combination of target and trivial templates. The coefficients are extracted by solving an
\(\ell _1\) minimisation problem with non-negativity constraints, while the target templates are updated online. However, solving the
\(\ell _1\) minimisation for each particle is computationally expensive. A generalisation of this tracker is the work of Zhang et al. (
2012), which learns the representation for all particles jointly. It additionally improves the robustness by exploiting the correlation among particles. An even further abstraction is achieved in Zhang et al. (
2014d) where a low-rank sparse representation of the particles is encouraged. In Zhang et al. (
2014c), the authors generalise the low-rank constraint of Zhang et al. (
2014d) and add a sparse error term in order to handle outliers. Another low-rank formulation was used by Wu et al. (
2012) which is an online version of the RASL (Peng et al.
2012) algorithm and attempts to jointly align the input sequence using convex optimisation.
Keypoint Trackers These trackers (Pernici and Del Bimbo
2014; Poling et al.
2014; Hare et al.
2012; Nebehay and Pflugfelder
2015) attempt to use the robustness of keypoint detection methodologies like SIFT (Lowe
1999) or SURF (Bay et al.
2008) in order to perform tracking. Pernici and Del Bimbo (
2014) collected multiple descriptors of weakly aligned keypoints over time and combined these matched keypoints in a RANSAC voting scheme. Nebehay and Pflugfelder (
2015) utilises keypoints to vote for the object center in each frame. A consensus-based scheme is applied for outlier detection and the votes are transformed based on the current key point arrangement to consider scale and rotation. However, keypoint methods may suffer from difficulty in capturing the global information of the tracked target by only considering the local points.
Discriminative Trackers These trackers attempt to explicitly model the difference between the object appearance and the background. Most commonly, these methods are named “tracking-by-detection” techniques as they involve classifying image regions as either part of the object or the background. In their work, Grabner et al. (
2006) propose an online boosting method to select and update discriminative features which allows the system to account for minor changes in the object appearance. However, the tracker fails to model severe changes in appearance. Babenko et al. (
2011) advocate the use of a multiple instance learning boosting algorithm to mitigate the drifting problem. More recently, discriminative correlation filters (DCF) have become highly successful at tracking. The DCF is trained by performing a circular sliding window operation on the training samples. This periodic assumption enables efficient training and detection by utilizing the Fast Fourier Transform (FFT). Danelljan et al. (
2014) learn separate correlation filters for the translation and the scale estimation. In Danelljan et al. (
2015), the authors introduce a sparse spatial regularisation term to mitigate the artifacts at the boundaries of the circular correlation. In contrast to the linear regression commonly used to learn DCFs, Henriques et al. (
2015) apply a kernel regression and propose its multi-channel extension to enable to the use of features such as HOG Dalal and Triggs (
2005). Li et al. (
2015b) propose a new use for particle filters in order to choose reliables patches to consider part of the object. These patches are modelled using a variant of the method proposed by Henriques et al. (
2015). Hare et al. (
2011) propose the use of structured output prediction. By explicitly allowing the outputs to parametrize the needs of the tracker, an intermediate classification step is avoided.
Table 2
The set of trackers that are used in this paper
CAMSHIFT | |
\(\checkmark \)
| | | | | |
CCOT | |
\(\checkmark \)
| | | |
\(\checkmark \)
| |
CMT |
Nebehay and Pflugfelder ( 2015) | | | |
\(\checkmark \)
| | |
DF |
Sevilla-Lara and Learned-Miller ( 2012) | |
\(\checkmark \)
| | | | |
DLSSVM | |
\(\checkmark \)
| | | | | |
DSST | |
\(\checkmark \)
| | | | | |
| | | | | | |
FCT | |
\(\checkmark \)
|
\(\checkmark \)
| | | | |
HDT | | | | | |
\(\checkmark \)
| |
IVT | | |
\(\checkmark \)
| | | | |
KCF | |
\(\checkmark \)
| | | | | |
LCT | |
\(\checkmark \)
| | | | | |
LRST | | |
\(\checkmark \)
| | | | |
MDNET | |
\(\checkmark \)
| | | |
\(\checkmark \)
| |
MEEM | |
\(\checkmark \)
| | | | | |
MIL | |
\(\checkmark \)
| | | | | |
| | | | |
ORIA | | |
\(\checkmark \)
| | | | |
PF | | |
\(\checkmark \)
| | | | |
RPT | |
\(\checkmark \)
| | | | | |
SIAM-OXF |
Bertinetto et al. ( 2016b) |
\(\checkmark \)
| | | |
\(\checkmark \)
| |
SPOT |
Zhang and van der Maaten ( 2014) |
\(\checkmark \)
| |
\(\checkmark \)
| | | |
SPT | |
\(\checkmark \)
| | | | | |
SRDCF | |
\(\checkmark \)
| | | | | |
STAPLE |
Bertinetto et al. ( 2016a) |
\(\checkmark \)
| | | | | |
STCL | |
\(\checkmark \)
| | | | | |
STRUCK | |
\(\checkmark \)
| | | | | |
TGPR | |
\(\checkmark \)
| | | | | |
TLD | |
\(\checkmark \)
| | | | | |
Part-based Trackers These trackers attempt to implicitly model the parts of an object in order to improve tracking performance. Adam et al. (
2006) represent the object with multiple arbitrary patches. Each patch votes on potential positions and scales of the object and a robust statistic is employed to minimise the voting error. Kalal et al. (
2010b) sample the object and the points are tracked independently in each frame by estimating optical flow. Using a forward–backward measure, the erroneous points are identified and the remaining reliable points are utilised to compute the optimal object trajectory. Yao et al. (
2013) adapt the latent SVM of Felzenszwalb et al. (
2010) for online tracking, by restricting the search in the vicinity of the location of the target object in the previous frame. In comparison to the weakly supervised part-based model of Yao et al. (
2013), in Zhang and van der Maaten (
2013) the authors recommend an online strongly supervised part-based deformable model that learns the representation of the object and the representation of the background by training a classifier. Wang et al. (
2015) employ a part-based tracker by estimating a direct displacement prediction of the object. A cascade of regressors is utilised to localise the parts, while the model is updated online and the regressors are initialised by multiple motion models at each frame.
Given the wealth of available trackers, selecting appropriate trackers for deformable tracking purposes poses a difficult proposition. In order to attempt to give as broad an overview as possible, we selected trackers from each of the aforementioned categories. Therefore, in this paper we compare against 27 trackers which are outlined in Table
2. SRDCF (Danelljan et al.
2015), KCF (Henriques et al.
2015), LCT (Ma et al.
2015), STAPLE (Bertinetto et al.
2016a) and DSST (Danelljan et al.
2014) are all discriminative trackers based on DCFs. They all performed well in the VOT 2015 (Kristan et al.
2015) challenge and DSST was the winner of VOT 2014 (Kristan et al.
2014). The trackers of Danelljan et al. (
2016), Qi et al. (
2016); Nam and Han (
2016), Bertinetto et al. (
2016b) are indicative trackers that employ neural networks and achieve top results. STRUCK (Hare et al.
2011) is a discriminative tracker that performed very well in the Online Object Tracking benchmark (Wu et al.
2013), while the more recent method of Ning et al. (
2016) improves the computational burden of the structural SVM of STRUCK and reports superior results. SPOT (Zhang and van der Maaten
2014) is a strong performing part based tracker, CMT (Nebehay and Pflugfelder
2015) is a strong performing keypoint based tracker, LRST (Zhang et al.
2014d) and ORIA (Wu et al.
2012) are recent generative trackers. RPT (Li et al.
2015b) is a recently proposed technique that reported state-of-the-art results on the Online Object Tracking benchmark (Wu et al.
2013). TLD (Kalal et al.
2012), MIL (Babenko et al.
2011), FCT (Zhang et al.
2014c), DF (Sevilla-Lara and Learned-Miller
2012) and IVT (Ross et al.
2008) were included as baseline tracking methods with publicly available implementations. Finally, the CAMSHIFT and PF methods (Bradski
1998a; Isard and Blake
1996) are included as very influential trackers used in the previous decades for tracking.
3.3 Facial Landmark Localisation
Statistical deformable models have emerged as an important research field over the last few decades, existing at the intersection of computer vision, statistical pattern recognition and machine learning. Statistical deformable models aim to solve generic object alignment in terms of localisation of fiducial points. Although deformable models can be built for a variety of object classes, the majority of ongoing research has focused on the task of facial alignment. Recent large-scale challenges on facial alignment (Sagonas et al.
2013b,
a,
2015) are characteristic examples of the rapid progress being made in the field.
Table 3
The landmark localisation methods employed in this paper
AAM | | |
\(\checkmark \)
| |
Alabort-i-Medina et al. ( 2014) | |
ERT |
Kazemi and Sullivan ( 2014) |
\(\checkmark \)
| | |
| |
CFSS | |
\(\checkmark \)
| | |
SDM |
Xiong and De la Torre ( 2013) |
\(\checkmark \)
| | |
Alabort-i-Medina et al. ( 2014) | |
Currently, the most commonly-used and well-studied face alignment methods can be separated into two major families: (i) discriminative models that employ regression in a cascaded manner, and (ii) generative models that are iteratively optimised.
Regression-Based Models The methodologies of this category aim to learn a regression function that regresses from the object’s appearance (e.g. commonly handcrafted features) to the target output variables (either the landmark coordinates or the parameters of a statistical shape model). Although the history behind using linear regression in order to tackle the problem of face alignment spans back many years (Cootes et al.
2001), the research community turned towards alternative approaches due to the lack of sufficient data for training accurate regression functions. Nevertheless, recently regression-based techniques have prevailed in the field thanks to the wealth of annotated data and effective handcrafted features (Lowe
1999; Dalal and Triggs
2005). Recent works have shown that excellent performance can be achieved by employing a cascade of regression functions (Burgos-Artizzu et al.
2013; Xiong and De la Torre
2013,
2015; Dollár et al.
2010; Cao et al.
2014; Kazemi and Sullivan
2014; Ren et al.
2014; Asthana et al.
2014; Tzimiropoulos
2015; Zhu et al.
2015). Regression based methods can be approximately seperated into two categories depending on the nature of the regression function employed. Methods that employ a linear regression such as the supervised descent method (SDM) of Xiong and De la Torre (
2013) tend to employ robust hand-crafted features (Xiong and De la Torre
2013; Asthana et al.
2014; Xiong and De la Torre
2015; Tzimiropoulos
2015; Zhu et al.
2015). On the other hand, methods that employ tree-based regressors such as the explicit shape regression (ESR) method of Cao et al. (
2014), tend to rely on data driven features that are optimised directly by the regressor (Burgos-Artizzu et al.
2013; Cao et al.
2014; Dollár et al.
2010; Kazemi and Sullivan
2014).
Table 4
The set of experiments conducted in this paper
1 | | |
\(\checkmark \)
|
\(\checkmark \)
| | | |
2 | | |
\(\checkmark \)
|
\(\checkmark \)
| |
\(\checkmark \)
| |
3 | |
\(\checkmark \)
| |
\(\checkmark \)
| | | |
4 | |
\(\checkmark \)
| |
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| |
5 | |
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
| | |
\(\checkmark \)
|
6 | | Comparison against state-of-the-art of 300 VW competition (Shen et al. 2015). |
Generative Models The most dominant representative algorithm of this category is, by far, the active appearance model (AAM). AAMs consist of parametric linear models of both shape and appearance of an object, typically modelled by Principal Component Analysis (PCA). The AAM objective function involves the minimisation of the appearance reconstruction error with respect to the shape parameters. AAMs were initially proposed by Cootes et al. (
1995,
2001), where the optimisation was performed by a single regression step between the current image reconstruction residual and an increment to the shape parameters. However, Matthews and Baker (
2004), Baker and Matthews (
2004) linearised the AAM objective function and optimised it using the Gauss-Newton algorithm. Following this, Gauss-Newton optimisation has been the modern method for optimising AAMs. Numerous extensions have been published, either related to the optimisation procedure (Papandreou and Maragos
2008; Tzimiropoulos and Pantic
2013; Alabort-i-Medina and Zafeiriou
2014,
2015; Tzimiropoulos and Pantic
2014) or the model structure (Tzimiropoulos et al.
2012; Antonakos et al.
2014; Tzimiropoulos et al.
2014; Antonakos et al.
2015b,
a).
In recent challenges by Sagonas et al. (
2013a,
2015), discriminative methods have been shown to represent the current state-of-the-art. However, in order to enable a fair comparison between types of methods we selected a representative set of landmark localisation methods to compare with in this paper. The set of landmark localisation methods used in the paper is given in Table
3. We chose to use ERT (Kazemi and Sullivan
2014) as it is extremely fast and the implementation provided by King (
2009) is the best known implementation of a tree-based regressor. We chose CFSS (Zhu et al.
2015) as it is the current state-of-the-art on the data provided by the 300W competition of Sagonas et al. (
2013a). We used the Gauss-Newton Part-based AAM of Tzimiropoulos and Pantic (
2014) as the top performing generative localisation method, as provided by the Menpo Project (Alabort-i-Medina et al.
2014). Finally, we also demonstrated an SDM (Xiong and De la Torre
2013) as implemented by Alabort-i-Medina et al. (
2014) as a baseline.