Approaching human level facial landmark localization by deep learning☆
Introduction
Reliable face recognition crucially depends on accurate and robust face alignment. Good alignment enables the face recognizer to be robust against pose and expression change [1], [2]. Facial landmark localization seeks to detect a set of predefined key points on a human face. It attracts intense interest from both the industry and the research community.
Despite rapid progress in this area, facial landmark localization in uncontrolled settings remains an unsolved problem. The major challenge comes from the large variations in the facial images (see Fig. 1 for example of images used in our experiments). The head can have large yaw or pitch angles. The lightening condition can be extreme. In addition, some of the images are of low quality. The resolution can be low. The image can be blurry or corrupted. When viewed locally, some of the facial landmarks are difficult to be recognized. Even humans have to use the global context to identify a point.
The major requirement of face alignment is robustness. The landmark localizer must return a set of reasonable point coordinates however the image is corrupted. Even if a point is not visible, it is typical for face recognizers to demand that the landmark localizer should “guess” a position for the point so that some global measurements (e.g. the rotation of the face) can be made. This poses great challenge to the landmark localizer.
The most straightforward way to solve the landmark localization problem is to view it as an image-based regression problem. The input is the RGB image. The outputs are the coordinate values. Any image-based regressor can be plugged in. This framework is capable of giving very promising result if the regressor is powerful enough. In our solution we use a CNN (Convolutional Neural Network) Cascade. There are two key ingredients in our solution:
- •
Deep network. To increase robustness, we used significantly deeper networks compared to previous works [3], [4].
- •
Coarse-to-fine prediction. Using multiple CNNs in a coarse-to-fine manner improves accuracy.
However, obtaining good quality training data is difficult and expensive. We conduct human study to investigate human's ability to locate key points in an image. Then we discuss how our findings would influence our future direction to improve the system.
Face alignment is an indispensable step in modern face recognition system [5], [6], [7], [8]. Generally, there are two schools of methods for facial landmark detection: model-based methods and regression-based methods. Model-based methods try to build models to fit input images. They can take into account the local texture appearance [9] or the part-based structure [10], [11], [12]. Their advantage is that human's prior knowledge can be easily incorporated into the system.
The regression-based methods are more straightforward. Deep neural networks [4], [3], [13], [14] and boosted regressors [15] have been successfully employed. The regression-based method solely depends on the regressor's capacity to learn the extremely complex relation between pixel values and the appearance of facial features. Great care is needed in training these complex models. In our solution we aim at keeping the conceptual framework of the method as simple as possible. The major difference in our method is that our neural network's depths greatly exceed those used by existing works [4], [3], [13] which typically only has at most 4 convolutional layers (ours contains 8 convolutional layers). Due to the extra power of the network, we can make the whole framework largely simplified.
Section snippets
Deep CNN cascaded for facial landmark localization
We formulate the landmark localization problem as learning a function that maps image pixel arrays to point coordinates. The input image Ih × w is a h × w three channel (RGB) image. It contains the face area found by the face detector. The output Pn × 2 is landmark coordinates relative to the face's bounding box.
In the following two subsections we present detailed descriptions of the two components of our framework: Convolutional Neural Network and coarse-to-find prediction.
Experiment
We conducted three experiments. The first experiment evaluates the number of points seen for each of networks in the second level. The second experiment validates the effectiveness of our method. In the last experiment we study human's ability to locate points on images.
Our training set consists of the datasets provided by 300 W [17], [18], [19], [20] and our own manually labeled web images. The IBUG dataset as well as “test” partition of HELEN and LFPW is left out for testing. The
End of the free lunch
One of the “free lunches” in Deep Learning is that the system's performance almost improves monotonically when we increase the neural network's depth and size. This is only made possible by the availability of abundant training data and ever increasing computing power. The combination of Amazon Turk and Nvidia's high-end GPUs witness researchers' impressive progress in image classification and object detection [30], [31], which is one of the most exciting stories in the current deep learning
Conclusion
In this paper we showed how to build an accurate and robust facial landmark localizer using deep learning tools. We point out the limitation of our current supervised learning framework and describe directions to further improve our system.
References (31)
- et al.
Tom-vs-Pete classifiers and identity-preserving alignment for face verification
- et al.
Blessing of dimensionality: high-dimensional feature and its efficient compression for face verification
- et al.
Extensive facial landmark localization with coarse-to-fine convolutional network cascade
- et al.
Deep convolutional network cascade for facial point detection
- E. Zhou, Z. Cao, Q. Yin, Naive-deep face recognition: touching the limit of lfw benchmark or not?, arXiv preprint...
- et al.
Learning compact face representation: packing a face into an int32
- Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective, and robust, arXiv preprint...
- et al.
Deepface: closing the gap to human-level performance in face verification
- et al.
Active appearance models
- et al.
Object detection with discriminatively trained part-based models
IEEE Trans. Pattern Anal. Mach. Intell.
(2010)
Gauss–Newton deformable part models for face alignment in-the-wild
Pictorial structures for object recognition
Int. J. Comput. Vis.
Learning deep representation for face alignment with auxiliary attributes
IEEE Trans. Pattern Anal. Mach. Intell.
Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment
Face alignment by explicit shape regression
Int. J. Comput. Vis.
Cited by (108)
Head pose healthiness prediction using a novel image quality based stacked autoencoder
2022, Digital Signal Processing: A Review JournalCitation Excerpt :Label quality is crucial for training a neural network in landmark detection. In [25], an experiment showing how to develop successful deep learning models which localize landmarks was performed. That study showed that although convolutional neural networks learn from noisy data, misleading labels should be removed in order to obtain satisfiable performance.
Automatic annotation of cervical vertebrae in videofluoroscopy images via deep learning
2021, Medical Image AnalysisFacial landmarks localization using cascaded neural networks
2021, Computer Vision and Image UnderstandingFine-grained facial landmark detection exploiting intermediate feature representations
2020, Computer Vision and Image UnderstandingCitation Excerpt :This is likely due to the loss of local detail through successive feature map down-sampling. Hence, numerous coarse-to-fine methods (Sun et al., 2013; Zhou et al., 2013; Zhang et al., 2014a; Trigeorgis et al., 2016; Fan and Zhou, 2016; Kowalski et al., 2017; Chen et al., 2017; He et al., 2017b; Lv et al., 2017) have been proposed to cope with this issue. In most of them, the refinement is performed in a cascade that sequentially processes local image patches to recover the local detail information.
Implementing cascaded regression tree-based face landmarking: An in-depth overview
2020, Image and Vision Computing
- ☆
This paper has been recommended for acceptance by Stefanos Zafeiriou.