1 Introduction
-
Labeled Face Parts in the Wild (LFPW) dataset (Belhumeur et al. 2013). Since LFPW provides only the source links to download the images and not the actual images, only 1035 images were available (out of 1287).
-
Helen dataset (Le et al. 2012) which consists of 2330 high resolution images downloaded from the
flickr.com
web service. -
The Annotated Faces in-the-wild (AFW) (Zhu and Ramanan 2012) dataset which consists of 250 images with 468 faces.
-
Two new datasets, namely the IBUG dataset and the 300W test set. IBUG consists of 135 images. In addition, 300W test set consists of 300 images captured indoors and 300 images captured outdoors. The 300W test set was publicly released with the second version of the competition (Sagonas et al. 2016).
2 Menpo 2D and Menpo 3D Benchmarks
2.1 Datasets
-
Pose Large pose variations can cause heavy self-occlusion, and some facial components such as half of the facial contour can even be completely missing in a profile face.
-
Occlusion Occlusion frequently happens on facial contour and some facial organs (e.g. sunglasses on eyes and food on mouths) under uncontrolled conditions. Heavy occlusions can bring great challenges to the in-the-wild face alignment as the facial appearance can be locally changed or even completely missing.
-
Expression Some inner facial components (e.g. mouth and eyes) have their own variation patterns. Especially, mouth shape is largely affected by some expressions (e.g. supervise and happy), thus it is very challenging for face alignment under exaggerated expressions.
-
Illumination Illumination changes (e.g. intensity and direction variations) can significantly alter the facial appearance, and even make some detailed textures missing.
-
Image Training Set 337 images from the Annotated Faces in-the-wild (AFW) (Zhu and Ramanan 2012), 1035 images from the Labeled Face Parts in the Wild (LFPW) (Belhumeur et al. 2013), 2300 images from Helen (Le et al. 2012), 135 images from IBUG (Sagonas et al. 2013), 600 images from 300W (Sagonas et al. 2013, 2016) and 7564 images from Menpo 2D training dataset (Zafeiriou et al. 2017b).
-
Video Training set 55 videos from 300VW (Shen et al. 2015).
-
Video Test Set 111 in-the-wild videos manually collected from YouTube.
Datasets | Year | Faces | Points |
---|---|---|---|
XM2VTS (Messer et al. 1999) | 1999 | 2360 | 68 |
BioID (Jesorsky et al. 2001) | 2001 | 1521 | 20 |
FRGC (Phillips et al. 2005) | 2005 | 4950 | 68 |
PUT (Kasinski et al. 2008) | 2007 | 9971 | 30 |
BUHMAP-DB (Aran et al. 2007) | 2007 | 2880 | 52 |
MUCT (Milborrow et al. 2010) | 2010 | 3755 | 76 |
Multi-PIE (Gross et al. 2010) (Semi-frontal) | 2010 | 6665 | 68 |
Multi-PIE (Gross et al. 2010) (Profile) | 2010 | 1400 | 39 |
LFW (Huang et al. 2008) | 2007 | 13,233 | 10 |
AFLW (Köstinger et al. 2011) | 2011 | 25,993 | 21 |
LFPW (Belhumeur et al. 2013) | 2011 | 1432 | 29 |
AFW (Zhu and Ramanan 2012) | 2012 | 205 | 6 |
HELEN (Le et al. 2012) | 2012 | 2330 | 194 |
COFW (Burgos-Artizzu et al. 2013) | 2013 | 1007 | 29 |
COFW (Ghiasi and Fowlkes 2015) | 2015 | 507 | 68 |
2013 | 3837 | 68 | |
300VW (Shen et al. 2015) | 2015 | 218k | 68 |
MTFL (Zhang et al. 2014) | 2014 | 12,995 | 5 |
MAFL (Zhang et al. 2016b) | 2016 | 20,000 | 5 |
Menpo 2D (Zafeiriou et al. 2017c) (Semi-frontal) | 2017 | 10,993 | 68 |
Menpo 2D (Zafeiriou et al. 2017c) (Profile) | 2017 | 3852 | 39 |
AFLW2000-3D (Zhu et al. 2016c) | 2016 | 2000 | 68 |
300W-LP (Zhu et al. 2016c) | 2016 | 61,225 | 68 |
Menpo 3D (Zafeiriou et al. 2017a) | 2017 | 11,971 + 280k | 84 |
2.2 Adopted Landmark Configurations
-
Semi-frontal2D landmarks, which we use in the Menpo 2D benchmark.
-
Profile 2D landmarks, which we also use in the Menpo 2D benchmark.
-
3DA-2D (3D Aware 2D) landmarks, which we use in the Menpo 3D benchmark.
-
3D landmarks, which we also use in the Menpo 3D benchmark.
2.3 Creation of Ground-Truth Semi-frontal and Profile 2D Facial Landmarks
landmarker.io
2 annotation tool that was developed by our group, as shown in Fig. 6a.landmarker.io
, as illustrated in Fig. 6b, c.2.4 Creation of Ground-Truth 3DA-2D and 3D Facial Landmarks
2.4.1 Dense 3D Face Shape Modelling
2.4.2 Camera Model
2.4.3 LSFM Fitting on Videos: Energy Formulation
2.4.4 Optimisation of Energy Function
2.4.5 LSFM Fitting on Images
2.4.6 Facial Landmark Sampling and Re-projection for Images and Videos
3 Menpo 2D Challenge
-
localisation of Semi-frontal 2D landmarks in semi-frontal facial images.
-
localisation of Profile 2D landmarks in profile facial images.
3.1 Evaluation Metrics
Methods | Mean | Std | Median | MAD | Max error |
\({\hbox {AUC}}_{0.05}\)
| Failure rate |
---|---|---|---|---|---|---|---|
Yang et al. (2017) | 0.0120 | 0.0060 | 0.0107 | 0.0022 | 0.1453 | 0.7624 | 0.0024 |
He et al. (2017) | 0.0139 | 0.0260 | 0.0111 | 0.0023 | 0.9624 | 0.7478 | 0.0096 |
Wu and Yang (2017) | 0.0135 | 0.0095 | 0.0120 | 0.0024 | 0.5098 | 0.7337 | 0.0036 |
Kowalski et al. (2017) | 0.0138 | 0.0157 | 0.0120 | 0.0023 | 0.6312 | 0.7337 | 0.0049 |
Chen et al. (2017) | 0.0200 | 0.0756 | 0.0120 | 0.0026 | 1.2799 | 0.7290 | 0.0111 |
Xiao et al. (2017) | 0.0159 | 0.0201 | 0.0133 | 0.0027 | 0.6717 | 0.6986 | 0.0081 |
Shao et al. (2017) | 0.0165 | 0.0235 | 0.0138 | 0.0027 | 0.9612 | 0.6913 | 0.0101 |
Feng et al. (2017) | 0.0182 | 0.0179 | 0.0149 | 0.0033 | 0.4661 | 0.6586 | 0.0186 |
Zadeh et al. (2017a) | 0.0205 | 0.0340 | 0.0143 | 0.0035 | 0.9467 | 0.6479 | 0.0409 |
3.2 Participants
-
X. Chen The method in Chen et al. (2017) proposed a four-stage coarse-to-fine framework to tackle the facial landmark localisation problem in-the-wild. In the first stage, they predict the facial landmarks on a coarse level, which sets a good initialisation for the whole framework. Then, the key points are grouped into several components and each component is refined within the local patch. After that, each key point is further refined with multi-scale local patches cropped according to its nearest 3-, 5-, and 7-neighbours, respectively. The results are further fused by an attention gate network. Since the facial landmark configuration is different for semi-frontal and profile faces in the menpo 2D challenge, a linear transformation is finally learned with the least square approximation to adapt the predictions to the competition’s subsets.
-
X. Shao The method in Shao et al. (2017) used a deep architecture to directly detect facial landmarks without using face detection as an initialisation. The architecture consists of two stages, a basic landmark prediction stage and a whole landmark regression stage. At the former stage, given an input image, the basic landmarks of all faces are detected by a sub-network of landmark heatmap and affinity field prediction. At the latter stage, the coarse canonical face and the pose are generated by a Pose Splitting Layer based on the visible basic landmarks. According to its pose, each canonical state is distributed to the corresponding branch of the shape regression sub-networks for the whole landmark detection.
-
Z. He The method in He et al. (2017) proposed an effective facial landmark detection system, recorded as Robust Fully End-to-end Cascaded Convolutional Neural Network (RFEC-CNN), to characterise the complex non-linearity from face appearance to shape. Moreover, a face bounding box invariant technique is adopted to reduce the landmark localisation sensitivity to the face detector while a model ensemble strategy is adopted to further enhance the landmark localisation performance.
-
Z. Feng The method in Feng et al. (2017) presented a four-stage framework (face detection, bounding box aggregation, pose estimation and landmark localisation) for robust face detection and landmark localisation in the wild. To achieve a high detection rate, two publicly available CNN-based face detectors and two proprietary detectors are employed. Then, the detected face bounding boxes of each input image are aggregated to reduce false positives and improve face detection accuracy. After that, a cascaded shape regressor, trained using faces with a variety of pose variations, is then employed for pose estimation and image pre-processing. Finally, another cascaded shape regressor is trained for fine-grained landmark localisation, using a large number of training samples with limited pose variations.
-
J. Yang The method in Yang et al. (2017) explored a two-stage CNN model for robust facial landmark localisation. First, a supervised face transformation network is adopted to remove the translation, scale and rotation variation of each face, in order to reduce the variance of the regression target. Then, a deep convolutional neural network named Stacked Hourglass Network (Newell et al. 2016) is explored to increase the capacity of the regression model.
-
M. Kowalski The method in Kowalski et al. (2017) used a VGG-based Deep Alignment Network (DAN) for robust face alignment. This method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initialisation. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage.
-
A. Zadeh The method in Zadeh et al. (2017a) used a novel local detector, Convolutional Experts Network (CEN), in the framework of Constrained Local Model (CLM) for face alignment in the wild. This method brings together the advantages of deep neural architectures and mixtures of experts in an end-to-end framework.
-
S. Xiao The method in Xiao et al. (2017) proposed a novel 3D-assisted coarse-to-fine extreme-pose facial landmark detection system. For a given face image, the face bounding box is first refined with landmark locations inferred from a 3D face model generated by a Recurrent 3D Regressor (R3R) at a coarse level. Then, another R3R is employed to fit a 3D face model onto the 2D face image cropped with the refined bounding box at fine-scale. 2D landmark locations inferred from the fitted 3D face are further adjusted with the popular 2D regression method, i.e. LBF (Ren et al. 2014). The 3D-assisted coarse-to-fine strategy and the 2D adjustment process explicitly ensure both the robustness to extreme face poses and bounding box disturbance and the accuracy towards pixel-level landmark displacement.
-
W. Wu The method in Wu and Yang (2017) explored intra-dataset variation and inter-dataset variation to improve face alignment in-the-wild. Intra-dataset variation refers to bias in expression and head pose inside one certain dataset, while inter-dataset variation refers to different bias across different datasets. Model robustness can be significantly improved by leveraging rich variations within and between different datasets. More specifically, Wu and Yang (2017) proposed a novel Deep Variation Leveraging Network (DVLN), which consists of two strong coupling sub-networks, e.g., Dataset-Across Network (DA-Net) and Candidate-Decision Network (CD-Net). In particular, DA-Net takes advantage of different characteristics and distributions across different datasets, while CD-Net makes a final decision on candidate hypotheses given by DA-Net to leverage variations within one certain dataset.
Methods | Mean | Std | Median | MAD | Max error |
\({\hbox {AUC}}_{0.05}\)
| Failure rate |
---|---|---|---|---|---|---|---|
Yang et al. (2017) | 0.0172 | 0.0105 | 0.0150 | 0.0035 | 0.2490 | 0.6613 | 0.0077 |
He et al. (2017) | 0.0247 | 0.0422 | 0.0179 | 0.0048 | 0.6280 | 0.5932 | 0.0355 |
Wu and Yang (2017) | 0.0217 | 0.0131 | 0.0193 | 0.0044 | 0.2623 | 0.5802 | 0.0221 |
Feng et al. (2017) | 0.0285 | 0.0367 | 0.0208 | 0.0057 | 0.4725 | 0.5268 | 0.0617 |
Xiao et al. (2017) | 0.0290 | 0.0417 | 0.0209 | 0.0055 | 0.6327 | 0.5237 | 0.0612 |
Zadeh et al. (2017a) | 0.0375 | 0.0630 | 0.0241 | 0.0071 | 0.7594 | 0.4604 | 0.0951 |
Chen et al. (2017) | 0.0448 | 0.1162 | 0.0265 | 0.0058 | 1.3698 | 0.4259 | 0.0642 |
Shao et al. (2017) | 0.0451 | 0.0636 | 0.0282 | 0.0088 | 0.7534 | 0.3891 | 0.1608 |
3.3 Competition Results
Methods | Mean | Std | Median | MAD | Max error |
\({\hbox {AUC}}_{0.05}\)
| Failure rate |
---|---|---|---|---|---|---|---|
Yang et al. (2017) | 0.0097 | 0.0053 | 0.0087 | 0.0017 | 0.1719 | 0.8084 | 0.0022 |
He et al. (2017) | 0.0117 | 0.0253 | 0.0093 | 0.0019 | 0.9520 | 0.7886 | 0.0079 |
Wu and Yang (2017) | 0.0113 | 0.0085 | 0.0101 | 0.0019 | 0.4752 | 0.7778 | 0.0024 |
Kowalski et al. (2017) | 0.0116 | 0.0147 | 0.0102 | 0.0018 | 0.6720 | 0.7765 | 0.0036 |
Chen et al. (2017) | 0.0174 | 0.0724 | 0.0099 | 0.0021 | 1.2699 | 0.7746 | 0.0096 |
Xiao et al. (2017) | 0.0132 | 0.0188 | 0.0110 | 0.0022 | 0.6411 | 0.7513 | 0.0066 |
Shao et al. (2017) | 0.0139 | 0.0220 | 0.0115 | 0.0022 | 0.9590 | 0.7420 | 0.0084 |
Zadeh et al. (2017a) | 0.0162 | 0.0319 | 0.0111 | 0.0026 | 0.9377 | 0.7200 | 0.0204 |
Feng et al. (2017) | 0.0159 | 0.0164 | 0.0129 | 0.0029 | 0.3686 | 0.7007 | 0.0161 |
Methods | Mean | Std | Median | MAD | Max error |
\({\hbox {AUC}}_{0.05}\)
| Failure rate |
---|---|---|---|---|---|---|---|
Yang et al. (2017) | 0.0136 | 0.0093 | 0.0110 | 0.0026 | 0.2162 | 0.7319 | 0.0036 |
He et al. (2017) | 0.0201 | 0.0414 | 0.0132 | 0.0035 | 0.6380 | 0.6778 | 0.0257 |
Wu and Yang (2017) | 0.0168 | 0.0109 | 0.0142 | 0.0034 | 0.2252 | 0.6709 | 0.0128 |
Xiao et al. (2017) | 0.0233 | 0.0416 | 0.0154 | 0.0042 | 0.7073 | 0.6231 | 0.0509 |
Feng et al. (2017) | 0.0236 | 0.0361 | 0.0161 | 0.0046 | 0.5141 | 0.6124 | 0.0483 |
Zadeh et al. (2017a) | 0.0293 | 0.0632 | 0.0157 | 0.0046 | 0.8780 | 0.5990 | 0.0617 |
Chen et al. (2017) | 0.0409 | 0.1181 | 0.0223 | 0.0051 | 1.3809 | 0.4954 | 0.0493 |
Shao et al. (2017) | 0.0388 | 0.0636 | 0.0228 | 0.0079 | 0.7769 | 0.4756 | 0.1223 |
3.4 A New Strong Baseline for 2D Face Alignment
3.4.1 Face Region Normalisation
3.4.2 Multi-view Hourglass Model
3.4.3 Experimental Results
3.5 Development of 2D Face Alignment
4 Menpo 3D Challenge
4.1 Evaluation Metrics
4.2 Participants
-
D. Crispell The method in Crispell and Bazik (2017) proposed an efficient and fully automatic method for 3D face shape and pose estimation in the unconstrained 2D images. More specifically, the proposed method jointly estimates a dense set of 3D landmarks and facial geometry using a single pass of a modified version of the popular “U-Net” neural network architecture. In addition, the 3D Morphable Model (3DMM) parameters are directly predicted by using the estimated 3D landmarks and geometry as constraints in a linear system.
-
A. Zadeh The method in Zadeh et al. (2017b) proposed to apply an extension of the popular Constrained Local Model (CLM), the so-called Convolutional Experts (CE)-CLM for the problem of 3DA-2D facial landmark detection. The important module of CE-CLM is a novel convolutional local detector that brings together the advantages of neural architectures and mixtures. In order to further improve the performance on 3D face tracking, the authors use two complementary networks alongside CE-CLM: a network that maps the output of CE-CLM to 84 landmarks called Adjustment Network, and a Deep Residual Network called Correction Networks that learns dataset specific corrections for CE-CLM.
-
P. Xiong The method in Xiong et al. (2017) proposed a two-stage shape regression method by combining the powerful local heatmap regression and global shape regression. This method is based on the popular stacked Hourglass network which is used to generate a set of heatmaps for each 3D shape point. Since these heatmaps are independent to each other, a hierarchical attention mechanism is applied from global to local heatmaps into the network, in order to model the correlations among neighbouring regions. Then, all these heatmaps alongside the input aligned image are processed by a deep residual network to further learn the global features and produce the final smooth 3D shape.