Elsevier

Image and Vision Computing

Volume 47, March 2016, Pages 27-35
Image and Vision Computing

Approaching human level facial landmark localization by deep learning

https://doi.org/10.1016/j.imavis.2015.11.004Get rights and content

Highlights

  • We show how to achieve state-of-the-art facial landmark localization by CNN.

  • The system's performance is improved by deeper network.

  • We show our system's performance is close to human.

Abstract

In this paper we present our solution to the 300 Faces in the Wild Facial Landmark Localization Challenge. We demonstrate how to achieve very competitive localization performance with a simple deep learning based system. Human study is conducted to show that the accuracy of our system has been very close to human performance. We discuss how this finding would affect our future direction to improve our system.

Introduction

Reliable face recognition crucially depends on accurate and robust face alignment. Good alignment enables the face recognizer to be robust against pose and expression change [1], [2]. Facial landmark localization seeks to detect a set of predefined key points on a human face. It attracts intense interest from both the industry and the research community.

Despite rapid progress in this area, facial landmark localization in uncontrolled settings remains an unsolved problem. The major challenge comes from the large variations in the facial images (see Fig. 1 for example of images used in our experiments). The head can have large yaw or pitch angles. The lightening condition can be extreme. In addition, some of the images are of low quality. The resolution can be low. The image can be blurry or corrupted. When viewed locally, some of the facial landmarks are difficult to be recognized. Even humans have to use the global context to identify a point.

The major requirement of face alignment is robustness. The landmark localizer must return a set of reasonable point coordinates however the image is corrupted. Even if a point is not visible, it is typical for face recognizers to demand that the landmark localizer should “guess” a position for the point so that some global measurements (e.g. the rotation of the face) can be made. This poses great challenge to the landmark localizer.

The most straightforward way to solve the landmark localization problem is to view it as an image-based regression problem. The input is the RGB image. The outputs are the coordinate values. Any image-based regressor can be plugged in. This framework is capable of giving very promising result if the regressor is powerful enough. In our solution we use a CNN (Convolutional Neural Network) Cascade. There are two key ingredients in our solution:

  • Deep network. To increase robustness, we used significantly deeper networks compared to previous works [3], [4].

  • Coarse-to-fine prediction. Using multiple CNNs in a coarse-to-fine manner improves accuracy.

However, obtaining good quality training data is difficult and expensive. We conduct human study to investigate human's ability to locate key points in an image. Then we discuss how our findings would influence our future direction to improve the system.

Face alignment is an indispensable step in modern face recognition system [5], [6], [7], [8]. Generally, there are two schools of methods for facial landmark detection: model-based methods and regression-based methods. Model-based methods try to build models to fit input images. They can take into account the local texture appearance [9] or the part-based structure [10], [11], [12]. Their advantage is that human's prior knowledge can be easily incorporated into the system.

The regression-based methods are more straightforward. Deep neural networks [4], [3], [13], [14] and boosted regressors [15] have been successfully employed. The regression-based method solely depends on the regressor's capacity to learn the extremely complex relation between pixel values and the appearance of facial features. Great care is needed in training these complex models. In our solution we aim at keeping the conceptual framework of the method as simple as possible. The major difference in our method is that our neural network's depths greatly exceed those used by existing works [4], [3], [13] which typically only has at most 4 convolutional layers (ours contains 8 convolutional layers). Due to the extra power of the network, we can make the whole framework largely simplified.

Section snippets

Deep CNN cascaded for facial landmark localization

We formulate the landmark localization problem as learning a function that maps image pixel arrays to point coordinates. The input image Ih × w is a h × w three channel (RGB) image. It contains the face area found by the face detector. The output Pn × 2 is landmark coordinates relative to the face's bounding box.

In the following two subsections we present detailed descriptions of the two components of our framework: Convolutional Neural Network and coarse-to-find prediction.

Experiment

We conducted three experiments. The first experiment evaluates the number of points seen for each of networks in the second level. The second experiment validates the effectiveness of our method. In the last experiment we study human's ability to locate points on images.

Our training set consists of the datasets provided by 300 W [17], [18], [19], [20] and our own manually labeled web images. The IBUG dataset as well as “test” partition of HELEN and LFPW is left out for testing. The

End of the free lunch

One of the “free lunches” in Deep Learning is that the system's performance almost improves monotonically when we increase the neural network's depth and size. This is only made possible by the availability of abundant training data and ever increasing computing power. The combination of Amazon Turk and Nvidia's high-end GPUs witness researchers' impressive progress in image classification and object detection [30], [31], which is one of the most exciting stories in the current deep learning

Conclusion

In this paper we showed how to build an accurate and robust facial landmark localizer using deep learning tools. We point out the limitation of our current supervised learning framework and describe directions to further improve our system.

References (31)

  • T. Berg et al.

    Tom-vs-Pete classifiers and identity-preserving alignment for face verification

  • D. Chen et al.

    Blessing of dimensionality: high-dimensional feature and its efficient compression for face verification

  • E. Zhou et al.

    Extensive facial landmark localization with coarse-to-fine convolutional network cascade

  • Y. Sun et al.

    Deep convolutional network cascade for facial point detection

  • E. Zhou, Z. Cao, Q. Yin, Naive-deep face recognition: touching the limit of lfw benchmark or not?, arXiv preprint...
  • H. Fan et al.

    Learning compact face representation: packing a face into an int32

  • Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective, and robust, arXiv preprint...
  • Y. Taigman et al.

    Deepface: closing the gap to human-level performance in face verification

  • T.F. Cootes et al.

    Active appearance models

  • P.F. Felzenszwalb et al.

    Object detection with discriminatively trained part-based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • G. Tzimiropoulos et al.

    Gauss–Newton deformable part models for face alignment in-the-wild

  • P.F. Felzenszwalb et al.

    Pictorial structures for object recognition

    Int. J. Comput. Vis.

    (2005)
  • Z. Zhang et al.

    Learning deep representation for face alignment with auxiliary attributes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • J. Zhang et al.

    Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment

  • X. Cao et al.

    Face alignment by explicit shape regression

    Int. J. Comput. Vis.

    (2014)
  • Cited by (108)

    • Head pose healthiness prediction using a novel image quality based stacked autoencoder

      2022, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      Label quality is crucial for training a neural network in landmark detection. In [25], an experiment showing how to develop successful deep learning models which localize landmarks was performed. That study showed that although convolutional neural networks learn from noisy data, misleading labels should be removed in order to obtain satisfiable performance.

    • Facial landmarks localization using cascaded neural networks

      2021, Computer Vision and Image Understanding
    • Fine-grained facial landmark detection exploiting intermediate feature representations

      2020, Computer Vision and Image Understanding
      Citation Excerpt :

      This is likely due to the loss of local detail through successive feature map down-sampling. Hence, numerous coarse-to-fine methods (Sun et al., 2013; Zhou et al., 2013; Zhang et al., 2014a; Trigeorgis et al., 2016; Fan and Zhou, 2016; Kowalski et al., 2017; Chen et al., 2017; He et al., 2017b; Lv et al., 2017) have been proposed to cope with this issue. In most of them, the refinement is performed in a cascade that sequentially processes local image patches to recover the local detail information.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Stefanos Zafeiriou.

    View full text