Approaching human level facial landmark localization by deep learning

doi:10.1016/j.imavis.2015.11.004

Image and Vision Computing

Volume 47, March 2016, Pages 27-35

https://doi.org/10.1016/j.imavis.2015.11.004 Get rights and content

Highlights

•
We show how to achieve state-of-the-art facial landmark localization by CNN.
•
The system's performance is improved by deeper network.
•
We show our system's performance is close to human.

Abstract

In this paper we present our solution to the 300 Faces in the Wild Facial Landmark Localization Challenge. We demonstrate how to achieve very competitive localization performance with a simple deep learning based system. Human study is conducted to show that the accuracy of our system has been very close to human performance. We discuss how this finding would affect our future direction to improve our system.

Introduction

Reliable face recognition crucially depends on accurate and robust face alignment. Good alignment enables the face recognizer to be robust against pose and expression change [1], [2]. Facial landmark localization seeks to detect a set of predefined key points on a human face. It attracts intense interest from both the industry and the research community.

Despite rapid progress in this area, facial landmark localization in uncontrolled settings remains an unsolved problem. The major challenge comes from the large variations in the facial images (see Fig. 1 for example of images used in our experiments). The head can have large yaw or pitch angles. The lightening condition can be extreme. In addition, some of the images are of low quality. The resolution can be low. The image can be blurry or corrupted. When viewed locally, some of the facial landmarks are difficult to be recognized. Even humans have to use the global context to identify a point.

The major requirement of face alignment is robustness. The landmark localizer must return a set of reasonable point coordinates however the image is corrupted. Even if a point is not visible, it is typical for face recognizers to demand that the landmark localizer should “guess” a position for the point so that some global measurements (e.g. the rotation of the face) can be made. This poses great challenge to the landmark localizer.

The most straightforward way to solve the landmark localization problem is to view it as an image-based regression problem. The input is the RGB image. The outputs are the coordinate values. Any image-based regressor can be plugged in. This framework is capable of giving very promising result if the regressor is powerful enough. In our solution we use a CNN (Convolutional Neural Network) Cascade. There are two key ingredients in our solution:

•
Deep network. To increase robustness, we used significantly deeper networks compared to previous works [3], [4].
•
Coarse-to-fine prediction. Using multiple CNNs in a coarse-to-fine manner improves accuracy.

However, obtaining good quality training data is difficult and expensive. We conduct human study to investigate human's ability to locate key points in an image. Then we discuss how our findings would influence our future direction to improve the system.

Face alignment is an indispensable step in modern face recognition system [5], [6], [7], [8]. Generally, there are two schools of methods for facial landmark detection: model-based methods and regression-based methods. Model-based methods try to build models to fit input images. They can take into account the local texture appearance [9] or the part-based structure [10], [11], [12]. Their advantage is that human's prior knowledge can be easily incorporated into the system.

The regression-based methods are more straightforward. Deep neural networks [4], [3], [13], [14] and boosted regressors [15] have been successfully employed. The regression-based method solely depends on the regressor's capacity to learn the extremely complex relation between pixel values and the appearance of facial features. Great care is needed in training these complex models. In our solution we aim at keeping the conceptual framework of the method as simple as possible. The major difference in our method is that our neural network's depths greatly exceed those used by existing works [4], [3], [13] which typically only has at most 4 convolutional layers (ours contains 8 convolutional layers). Due to the extra power of the network, we can make the whole framework largely simplified.

Section snippets

Deep CNN cascaded for facial landmark localization

We formulate the landmark localization problem as learning a function that maps image pixel arrays to point coordinates. The input image I_h × w is a h × w three channel (RGB) image. It contains the face area found by the face detector. The output P_n × 2 is landmark coordinates relative to the face's bounding box.

In the following two subsections we present detailed descriptions of the two components of our framework: Convolutional Neural Network and coarse-to-find prediction.

Experiment

We conducted three experiments. The first experiment evaluates the number of points seen for each of networks in the second level. The second experiment validates the effectiveness of our method. In the last experiment we study human's ability to locate points on images.

Our training set consists of the datasets provided by 300 W [17], [18], [19], [20] and our own manually labeled web images. The IBUG dataset as well as “test” partition of HELEN and LFPW is left out for testing. The

End of the free lunch

One of the “free lunches” in Deep Learning is that the system's performance almost improves monotonically when we increase the neural network's depth and size. This is only made possible by the availability of abundant training data and ever increasing computing power. The combination of Amazon Turk and Nvidia's high-end GPUs witness researchers' impressive progress in image classification and object detection [30], [31], which is one of the most exciting stories in the current deep learning

Conclusion

In this paper we showed how to build an accurate and robust facial landmark localizer using deep learning tools. We point out the limitation of our current supervised learning framework and describe directions to further improve our system.

References (31)

T. Berg et al.
Tom-vs-Pete classifiers and identity-preserving alignment for face verification
D. Chen et al.
Blessing of dimensionality: high-dimensional feature and its efficient compression for face verification
E. Zhou et al.
Extensive facial landmark localization with coarse-to-fine convolutional network cascade
Y. Sun et al.
Deep convolutional network cascade for facial point detection
E. Zhou, Z. Cao, Q. Yin, Naive-deep face recognition: touching the limit of lfw benchmark or not?, arXiv preprint...
H. Fan et al.
Learning compact face representation: packing a face into an int32
Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective, and robust, arXiv preprint...
Y. Taigman et al.
Deepface: closing the gap to human-level performance in face verification
T.F. Cootes et al.
Active appearance models
P.F. Felzenszwalb et al.
Object detection with discriminatively trained part-based models
IEEE Trans. Pattern Anal. Mach. Intell.
(2010)

G. Tzimiropoulos et al.

Gauss–Newton deformable part models for face alignment in-the-wild

P.F. Felzenszwalb et al.

Pictorial structures for object recognition

Int. J. Comput. Vis.

(2005)

Z. Zhang et al.

Learning deep representation for face alignment with auxiliary attributes

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

J. Zhang et al.

Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment

X. Cao et al.

Face alignment by explicit shape regression

Int. J. Comput. Vis.

(2014)

Cited by (108)

Head pose healthiness prediction using a novel image quality based stacked autoencoder
2022, Digital Signal Processing: A Review Journal
Citation Excerpt :
Label quality is crucial for training a neural network in landmark detection. In [25], an experiment showing how to develop successful deep learning models which localize landmarks was performed. That study showed that although convolutional neural networks learn from noisy data, misleading labels should be removed in order to obtain satisfiable performance.
This paper introduces an approach aiming to determine head pose healthiness of computer users. The main contributions of this paper are: 1) Image Quality Assessment (IQA) based Stacked Autoencoder (referred to as IQASAE) which adjusts the value of learning rate based on the quality of images; 2) Head Pose Healthiness Prediction (HPHP) framework which leverages the proposed IQASAE algorithm in combination with image processing operations; 3) A set of features suitable for face analysis applications; 4) Ontology-driven semantic framework which enables further exploiting pose estimation results within applications in synergy with healthcare expert domain knowledge about pose healthiness. Our framework was evaluated on both offline (BIWI and AFLW) and online (our own, collected using Arduino) datasets. Furthermore, it was compared to several state-of-art methods, including Multi-Layer Perceptron (MLP), CART, Random Forest, Convolutional Neural Networks (CNN), Temporal Deep Learning Model (TDLM), hybrid CNN with Support Vector Machine (SVM), Quatnet and Trinet. According to the achieved experimental results, it reaches accuracy up to 79.63% outperforming all of them, except Quatnet and Trinet. However, the main advantages of IQASAE compared to state-of-art methods are: 1) it does not require selection of features, so the processing time is reduced, 2) utilizing angle between chin and mouth reduces training time for SAE, 3) leveraging vector-based feature set to create training data resulted in a significant improvement, especially in offline facial images.
Automatic annotation of cervical vertebrae in videofluoroscopy images via deep learning
2021, Medical Image Analysis
Judging swallowing kinematic impairments via videofluoroscopy represents the gold standard for the detection and evaluation of swallowing disorders. However, the efficiency and accuracy of such a biomechanical kinematic analysis vary significantly among human judges affected mainly by their training and experience. Here, we showed that a novel machine learning algorithm can with high accuracy automatically detect key anatomical points needed for a routine swallowing assessment in real-time. We trained a novel two-stage convolutional neural network to localize and measure the vertebral bodies using 1518 swallowing videofluoroscopies from 265 patients. Our network model yielded high accuracy as the mean distance between predicted points and annotations was 4.20 $\pm$ 5.54 pixels. In comparison, human inter-rater error was 4.35 $\pm$ 3.12 pixels. Furthermore, 93% of predicted points were less than five pixels from annotated pixels when tested on an independent dataset from 70 subjects. Our model offers more choices for speech language pathologists in their routine clinical swallowing assessments as it provides an efficient and accurate method for anatomic landmark localization in real-time, a task previously accomplished using an off-line time-sinking procedure.
Facial landmarks localization using cascaded neural networks
2021, Computer Vision and Image Understanding
The accurate localization of facial landmarks is at the core of several face analysis tasks, such as face recognition and facial expression analysis, to name a few. In this work, we propose a novel localization approach based on a deep learning architecture that utilizes two paired cascaded subnetworks with convolutional neural network units. The cascaded units of the first subnetwork estimate heatmap-based encodings of the landmarks’ locations, while the cascaded units of the second subnetwork receive as inputs the outputs of the corresponding heatmap estimation units, and refine them through regression. The proposed scheme is experimentally shown to compare favorably with contemporary state-of-the-art schemes, especially when applied to images depicting challenging localization conditions.
Fine-grained facial landmark detection exploiting intermediate feature representations
2020, Computer Vision and Image Understanding
Citation Excerpt :
This is likely due to the loss of local detail through successive feature map down-sampling. Hence, numerous coarse-to-fine methods (Sun et al., 2013; Zhou et al., 2013; Zhang et al., 2014a; Trigeorgis et al., 2016; Fan and Zhou, 2016; Kowalski et al., 2017; Chen et al., 2017; He et al., 2017b; Lv et al., 2017) have been proposed to cope with this issue. In most of them, the refinement is performed in a cascade that sequentially processes local image patches to recover the local detail information.
Facial landmark detection has been an active research subject over the last decade. In this paper, we present a new approach for Fine-grained Facial Landmark Detection (FFLD) improving on the precision of the detected points. A high spatial precision of facial landmarks is crucial for many applications related to aesthetic rendering, such as face modeling, face animation, virtual make-up, etc. In this paper, we present an approach that improves the detection precision. Since most facial landmarks are positioned on visible boundary lines, we train a model that encourages the detected landmarks to stay on these boundaries. Our proposed Convolutional Neural Networks (CNN) effectively exploits lower-level feature maps containing abundant boundary information. To this end, beside the main CNN predicting facial landmark positions, we use several additional components, called CropNets. CropNet receives patches cropped from feature maps at different stages of this CNN, and estimate fine corrections of its predicted positions. We also introduce a novel robust spatial loss function based on pixel-wise differences between patches cropped from predicted and ground-truth positions. To further improve the landmark localization, our framework uses several loss functions optimizing the precision at several stages in different ways. Extensive experiments show that our framework significantly increases the local precision of state-of-the-art deep coordinate regression models.
Implementing cascaded regression tree-based face landmarking: An in-depth overview
2020, Image and Vision Computing
Face landmarking, defined as the detection of fiducial points on faces, has received a lot of attention over the last two decades within the computer vision community. While research literature documents major advances using state-of-art deep convolutional neural networks, earlier cascaded regression tree-based approaches remain a relevant alternative for low-cost, low-power embedded systems. Yet, from a practical point of view, their implementation and parametrization can be a difficult and tedious process. In this paper, we provide the readers with insights and advice on how to design a successful face landmarking system using a cascade of regression trees.
A Dual Coordinate System Vertebra Landmark Detection Network with Sparse-to-Dense Vertebral Line Interpolation
2024, Bioengineering

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Stefanos Zafeiriou.

View full text

Approaching human level facial landmark localization by deep learning☆

Highlights

Abstract

Introduction

Section snippets

Deep CNN cascaded for facial landmark localization

Experiment

End of the free lunch

Conclusion

Tom-vs-Pete classifiers and identity-preserving alignment for face verification

Blessing of dimensionality: high-dimensional feature and its efficient compression for face verification

Extensive facial landmark localization with coarse-to-fine convolutional network cascade

Deep convolutional network cascade for facial point detection

Learning compact face representation: packing a face into an int32

Deepface: closing the gap to human-level performance in face verification

Active appearance models

Object detection with discriminatively trained part-based models

IEEE Trans. Pattern Anal. Mach. Intell.

Gauss–Newton deformable part models for face alignment in-the-wild

Pictorial structures for object recognition

Int. J. Comput. Vis.

Learning deep representation for face alignment with auxiliary attributes

IEEE Trans. Pattern Anal. Mach. Intell.

Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment

Face alignment by explicit shape regression

Int. J. Comput. Vis.